About
The Engaging cluster is a mixed CPU and GPU computing cluster managed with Slurm. On Engaging, you can run your work in two ways:
- Interactive jobs give you a live shell on a compute node. Use these for development and troubleshooting.
- Batch jobs run your code automatically from a job script. Submit them with sbatch for heavy or long-running computations. Batch jobs wait in the queue and run when resources are available.
Quick Start: Running an Interactive Job
If you don’t already have access, visit the OnDemand Portal and log in with your MIT Kerberos credentials. This will automatically initiate account creation for the Engaging Cluster.
Log in via your terminal:
ssh USERNAME@orcd-login001.mit.edu
Replace USERNAME with your Kerberos username. You’ll be prompted for your Kerberos password and two-factor authentification.
You can also use orcd-login002, orcd-login003, or orcd-login004. These are functionally the same — if one login node is slow, try another.
Login nodes are not meant for heavy computations, so let’s request an interactive session on one of our group’s nodes (partition: sched_mit_kburdge_r8). First check which group compute nodes are available:
sinfo -N -l -p sched_mit_kburdge_r8
Sample output:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
node2002 1 sched_mit_kburdge_r8 idle 256 2:64:2 515000 0 1 node2002 none
node2003 1 sched_mit_kburdge_r8 idle 256 2:64:2 515000 0 1 node2003 none
node2004 1 sched_mit_kburdge_r8 idle 256 2:64:2 515000 0 1 node2004 none
node2005 1 sched_mit_kburdge_r8 idle 256 2:64:2 515000 0 1 node2005 none
Choose an idle node (e.g., node2002), then request an interactive session:
salloc --nodelist=node2002 --ntasks=1 --cpus-per-task=1 --mem=2G --time=01:00:00 -p sched_mit_kburdge_r8
This example requests 1 task, 1 CPU, 2 GB of memory, and 30 minutes of runtime on a node in our group partition (sched_mit_kburdge_r8). See below for other partitions.
Always set –time. Jobs without a time limit may fail to submit.
To request GPUs interactively, use, e.g.: –gres=gpu:2
On our cluster, most software–compilers, libraries, applications–is managed using the module system. To see what’s already installed, run:
module avail
Troubleshooting module errors (e.g., bash: module: command not found): some (older, e.g. sched_mit_mki) compute nodes don’t automatically load the environment needed for the module system. If you see this error, simply run: source /etc/profile
You might need specialized tools that aren’t installed system-wide. We encourage each user to use conda environments to manage their own software. On Engaging, Conda is available via the Miniforge module. To make conda commands available in your shell, run:
module load miniforge
Quick Start: Running a Batch Job
Batch jobs on Engaging let you run work without staying logged in. You submit a job script, and Slurm runs it automatically once resources are available.
Create a Slurm script (e.g. script.sh) with the following format:
#!/bin/bash
#SBATCH -p sched_mit_kburdge_r8 # partition name
#SBATCH --job-name=myjob # name for your job
#SBATCH --gres=gpu:2 # if you need GPUs
#SBATCH --ntasks=1 # number of tasks (often 1 for serial jobs)
#SBATCH --cpus-per-task=4 # CPU cores per task
#SBATCH --mem=16G # memory per node
#SBATCH --time=02:00:00 # max walltime (HH:MM:SS)
#SBATCH --output=slurm-%j.out # output file (%j = job ID) to capture logs for debugging
# Load your shell environment to activate your Conda environment
source /home/user/.bashrc
conda activate myconda
# Load any modules or software you need
module load cuda/12.0
# Run your command or script
python my_analysis.py
Then submit your job:
sbatch script.sh
Slurm will assign a job ID and run your job when resources are free.
You can see your running and pending jobs with:
squeue -u $USER
Furthermore, you can check your job details, including its CPUs, memory, time limits, and node allocation.
scontrol show job <jobid>
Cancel the job with:
scancel <jobid>
Lastly, view accounting info for finished jobs: shows job IDs, run times, exit codes, etc.
sacct -u $USER
Filesystems
Storage Type | Path | Quota | Backed up | Purpose |
Home Directory Flash | /home/<username> | 200 GB | Backed up with snapshots | Use for important files and software |
Burdge Group Storage | /orcd/data/kburdge/001 | 100 TB | Backed up | Use for important files and software |
Pool Hard Disk | /home/<username>/orcd/pool | 1 TB | Disaster recovery backup | Storing larger datasets |
Scratch Flash | /home/<username>/orcd/scratch | 1 TB | Not backed up | Scratch space for I/O heavy jobs |
If you have not logged in for 6 months files in /home/<username>/orcd/scratch will be deleted.
Partition information
Partition Name | Purpose | Nodes | Resources | Max Time Limit |
sched_mit_kburdge_r8 | Long-running batch or interactive jobs needing CPU or GPU resources for the Burdge group. | 4 nodes (node2002-node2005) | 256 CPU cores/node 4 GPUs/node ~515 GB RAM/node | 14 days |
sched_mit_mki | Longer CPU-only jobs on shared MKI (CentOS 7) nodes. | 27 nodes (node1399-node1414, node1447-node1458) | Varies by node ~64 cores/node ~384 GB RAM/node | 14 days |
sched_mit_mki_r8 | GPU workloads on shared MKI nodes. | 2 nodes (node2000, node2001) | 128 cores/node 4 GPUs/node ~515 GB RAM/node | 7 days |
mit_normal | Batch and interactive CPU-only jobs on MIT-wide shared nodes. | 52 nodes (node1600-1625, node2704-2705, node3103-3114, node3303-3314) | 96 cores/node ~520 GB RAM/node | 12 hours |
mit_normal_gpu | Batch or interactive jobs that need a GPU. | 75 nodes (node2433-2434, node2804, node2906, node3000-3008, node3100-3101, node3200-3208, node3300-3302, node3400, node3402-3408, node3500-3512, node4102-4108, node4200-4212, node4302-4305, node4502-4504) | Partition totals: 4 H100 GPUs 88 H200 GPUs 252 L40S 5416 cores (~72 cores/node) ~1.18 TB RAM/node | 6 hours |
Other Useful Slurm Commands
- List all partitions: shows which partitions exist, how many nodes are available, and time limits.
sinfo
- Summarized view of partitions: one-line summary per partition for a quick overview.
sinfo -s
- Show nodes in a specific partition:
sinfo -p <partition_name> -N -l
- Show hardware and limits of a partition:
scontrol show partition <partition_name>
- Check details of a specific node:
scontrol show node <node_name>