tags: - Hardware

AICR System Overview¶

AICR is a GPU-focused AI computing cluster housed at the MGHPCC in Holyoke, MA. It serves researchers at Boston University, Harvard, MIT, Northeastern, UMass (five campuses), and Yale as part of the Massachusetts AI Hub.

Hardware¶

Component	Count	Details
NVIDIA B200 GPU Nodes	31	8 B200 GPUs per node (248 total), 180 GB HBM3e per GPU, NVLink 5, liquid-cooled
NVIDIA RTX PRO 6000 GPU Nodes	19	8 RTX PRO 6000 GPUs per node (152 total), 96 GB GDDR7 per GPU, PCIe Gen5
Service Nodes	32	Dell R6715, 128-core AMD EPYC 9745, 1.125 TB RAM, login/Slurm/OnDemand/Globus/CPU compute
Storage	~14.3 PiB effective	VAST Data all-flash, 360 GB/s throughput, encryption at rest, GPU Direct Storage
Networking	N/A	Rail-optimized IB (B200 GPU), cluster IB (storage/MPI), Ethernet (management)

Partitions¶

Partition	Cores	Memory	GPUs/Node	GPU Memory	Max Time	Use Case
`rtx-batch`	128	2.25 TB	8 RTX PRO 6000	96 GB	24h	RTX GPU batch workloads
`rtx-devel`	128	2.25 TB	8 RTX PRO 6000	96 GB	4h	Interactive GPU development
`b200-batch`	128	2.25 TB	8 B200	180 GB	24h	B200 GPU batch workloads
`b200-devel`	128	2.25 TB	8 B200	180 GB	4h	Interactive GPU development
`cpu`	128	1 TB	—	—	24h	Data analysis, workflow orchestration

Default memory per CPU: 1 GB. Development partitions have a 4-concurrent-job limit per user.

Storage¶

Storage	Path	Quota	Snapshots	Purge
Home	`/home/USERNAME`	100 GiB	7-day	None
Scratch	`/scratch/USERNAME`	10 TiB	None	Regular
Work	`/work/INSTITUTION/PROJECT`	Varies	7-day	None

Warning

AICR storage relies on snapshots only — there are no off-site backups. Back up irreplaceable data externally.

Interconnect¶

AICR nodes are connected by InfiniBand NDR400 (400 Gb/s) for high-bandwidth, low-latency communication between GPUs and storage. This enables fast distributed training across multiple GPU nodes and high-throughput access to the VAST storage system via NFS-RDMA.