GPU Jobs¶

AICR's primary resource is its GPUs. This page covers how to request GPUs, choose the right partition, and submit GPU jobs.

Choosing a Partition¶

Partition	GPU Model	GPUs/Node	Max Time	Best For
`rtx-batch`	RTX PRO 6000	8	24h	Production GPU workloads
`rtx-devel`	RTX PRO 6000	8	4h	Testing and debugging
`b200-batch`	B200	8	24h	Large-scale AI/ML training
`b200-devel`	B200	8	4h	Testing and debugging

Tip

Use devel partitions for testing — they have shorter queue times. Switch to batch for production runs.

Requesting GPUs¶

The simplest way to request GPUs:

#SBATCH --gpus=1                    # one GPU (any type on the partition)
#SBATCH --gpus=4                    # four GPUs
#SBATCH --gpus-per-node=8           # all 8 GPUs on a node

Your partition will specify the GPU type:

#SBATCH --gpus=4               # 4  GPUs
#SBATCH --partition=rtx-devel  # The rtx-devel partition has RTX Pro 6000 GPUs

Note

Partitions on AICR are homogenous, meaning node configurations are identical within a partition, including the GPU type. Since you include the partition in your job script you do not need to specify the GPU type.

Single-GPU Job¶

The following script requests one GPU, 8 CPUs, and 32GB of RAM for 8 hours in the rtx-batch partition. Since the job is in rtx-batch, this job will get an RTX Pro 6000 GPU.

#!/bin/bash
#SBATCH --job-name=single_gpu
#SBATCH --partition=rtx-batch
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=08:00:00
#SBATCH --account=ACCOUNT_NAME
#SBATCH --output=%x-%j.out

module load miniforge3
module load cuda

python train.py

Multi-GPU Job (Single Node)¶

The following script request multiple GPUs (4) on one node in the b200-batch partition. It is also asking for 32 CPUs, 128GB of CPU memory, and 12 hours. Since the job is in b200-batch, this job will get 4 B200 GPUs.

#!/bin/bash
#SBATCH --job-name=multi_gpu
#SBATCH --partition=b200-batch
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --time=12:00:00
#SBATCH --account=ACCOUNT_NAME
#SBATCH --output=%x-%j.out

module load miniforge3
module load cuda

torchrun --nproc_per_node=4 train.py

Verifying GPU Access¶

Inside a running job, check that GPUs are visible with:

nvidia-smi

This shows GPU model, memory, utilization, and temperature. If no GPUs appear, verify that you requested them with --gpus and submitted to a GPU partition.

For PyTorch:

import torch
print(torch.cuda.is_available())       # should print True
print(torch.cuda.device_count())       # should match requested GPUs
print(torch.cuda.get_device_name(0))   # GPU model name

GPU Memory Considerations¶

RTX PRO 6000 and B200 GPUs each have substantial memory, but large models or large batch sizes can still cause out-of-memory (OOM) errors
If you encounter OOM: reduce batch size, enable mixed precision (torch.amp), or use gradient checkpointing
Monitor GPU memory usage with nvidia-smi during interactive sessions to right-size your requests