MEGHANA.BHANGE
  • Home
  • Notes
  • ACA

My Compute Canada Cheatsheet

HPC
ComputeCanada
Tips
All the things compute canada like login, storage, modules, uv, SLURM, and submitit patterns I actually use on the Alliance clusters.
Published

April 22, 2026

~/compute-canada/cheatsheet.sh
−▢×
> Tips & Tricks · HPC / Compute Canada

Compute Canada
Cheatsheet

HPC SLURM by Meghana Bhange

Things I look up every time I start a new project on the Alliance clusters — login, storage, modules, environments, and job submission. Take everything here with a grain-of-salt — I keep updating this as I learn new tips (unlearn old ones that were wrong).

01

Login

bash
ssh <username>@<cluster>.computecanada.ca
Note - ssh-keys and 2FA

Make sure your SSH keys are set up in the Compute Canada portal, and that you have 2FA configured for your account. You won't be able to log in to some of the clusters without these!

02

Storage

Path Backed up Notes
~/ (home) ✅ Yes Code, configs — not for data
~/scratch/ ❌ No Large datasets, job outputs. Purged after 60 days if unused.
~/projects/ ✅ Yes Persistent project data, shared with group
Rule of thumb

Code in home, datasets + checkpoints in scratch, important results in projects.

03

Modules

Load software via the module system — do not install system-wide.

bash
module avail # list available modules module spider python # search for a module module load StdEnv/2023 python/3.11 # load specific versions module list # see loaded modules module purge # unload everything

My default stacks:

CPU jobs
module purge module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0
GPU jobs
module purge module load StdEnv/2020 gcc/9.3.0 cuda/11.8.0 module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0
04

Virtual Environments

I use uv now — much faster than pip, works great on the clusters.

1. Point TMPDIR at scratch (avoids quota issues):

bash
echo 'export TMPDIR=$SCRATCH/tmp' >> ~/.bashrc mkdir -p $SCRATCH/tmp source ~/.bashrc

2. Install uv into home:

bash
curl -L https://github.com/astral-sh/uv/releases/latest/download/uv-x86_64-unknown-linux-gnu.tar.gz \ -o $SCRATCH/tmp/uv.tar.gz tar -xzf $SCRATCH/tmp/uv.tar.gz -C $SCRATCH/tmp/ mkdir -p ~/.local/bin cp $SCRATCH/tmp/uv-x86_64-unknown-linux-gnu/uv ~/.local/bin/ echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc source ~/.bashrc uv --version

3. Create and activate a venv on scratch:

bash
uv venv $SCRATCH/envs/my_env --python 3.11 source $SCRATCH/envs/my_env/bin/activate uv pip install -r ~/scratch/path_to/requirements.txt
Why scratch for envs?

Home has a 50 GB quota. Large ML environments with torch + transformers can easily blow past that. Keep envs on scratch, just re-create them if purged.

05

SLURM Basics

Interactive session — Compute Canada (salloc):

bash · cedar / beluga / narval
salloc --time=1:00:00 --mem=16G --gres=gpu:1 --account=<account-pi>

Interactive session — NIBI Cloud (Mila):

bash · nibi.calculquebec.ca
ssh <username>@nibi.calculquebec.ca salloc --time=1:00:00 --mem=16G --gres=gpu:1 --account=<account-pi>

Check your running jobs:

bash
squeue -u $USER # all your jobs squeue -u $USER -t RUNNING # only running scancel <job_id> # cancel a job

Tail logs for multiple jobs at once:

bash
for id in JOB_ID_1 JOB_ID_2 JOB_ID_3; do echo "=== $id ERR ==="; tail -n 5 slurm_logs/${id}_0_log.err echo "=== $id OUT ==="; tail -n 5 slurm_logs/${id}_0_log.out done

Register a Jupyter kernel for your env:

bash
TMPDIR=$SCRATCH/tmp python -m ipykernel install --user \ --name my_env \ --display-name "My ENV"
06

submitit

submitit lets you submit any Python function directly to SLURM — no sbatch scripts needed. Pair it with click for a clean CLI launcher.

Concrete example with a simple add(x, y) function — swap it for your actual training code:

submit.py
import os, click, submitit SCRATCH = os.environ["SCRATCH"] SETUP = [ "module purge", "module load StdEnv/2020 gcc/9.3.0 cuda/11.8.0", "module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0", f"source {SCRATCH}/envs/my_env/bin/activate", ] # ── The actual work that runs on the cluster node ────────────── def add(x: float, y: float) -> float: result = x + y print(f"{x} + {y} = {result}") # shows up in slurm_logs/*.out return result # ── Boilerplate: build executor with resource params ─────────── def make_executor(account, gres, timeout, cpus, mem): ex = submitit.AutoExecutor(folder="slurm_logs") ex.update_parameters( slurm_account=account, slurm_gres=gres, timeout_min=timeout, cpus_per_task=cpus, mem_gb=mem, slurm_setup=SETUP, ) return ex # ── One @cli.command() per job type ─────────────────────────── @click.group() def cli(): pass @cli.command() @click.option("--x", default=1.0, type=float) @click.option("--y", default=2.0, type=float) @click.option("--account", default="def-yourpi") @click.option("--gres", default="gpu:h100:1") @click.option("--timeout", default=60, type=int) @click.option("--cpus", default=2, type=int) @click.option("--mem", default=8, type=int) def run_add(x, y, account, gres, timeout, cpus, mem): """Submit add(x, y) to SLURM.""" ex = make_executor(account, gres, timeout, cpus, mem) job = ex.submit(add, x, y) click.echo(f"Submitted job {job.job_id} ({x} + {y})") if __name__ == "__main__": cli()

Running a Python script file instead of a function?

submit.py should only know about SLURM args. Use -- as a separator — everything after it gets passed directly to your script, so train.py owns its own args:

submit.py — running a script
import sys, subprocess def run_script(cmd: list[str]) -> None: subprocess.run(cmd, check=True) # context_settings + UNPROCESSED lets unknown flags pass through after -- @cli.command(context_settings=dict(ignore_unknown_options=True)) @click.option("--account", default="def-yourpi") @click.option("--gres", default="gpu:h100:1") @click.option("--timeout", default=240, type=int) @click.option("--cpus", default=4, type=int) @click.option("--mem", default=32, type=int) @click.argument("script_args", nargs=-1, type=click.UNPROCESSED) def train(account, gres, timeout, cpus, mem, script_args): """Submit train.py — SLURM opts before --, script opts after.""" cmd = [sys.executable, "train.py"] + list(script_args) ex = make_executor(account, gres, timeout, cpus, mem) job = ex.submit(run_script, cmd) click.echo(f"Submitted job {job.job_id}") click.echo(f" slurm : {gres}, {cpus} cpus, {mem}GB, {timeout}min") click.echo(f" script: {' '.join(cmd)}")
bash
# submit.py args ──────────────┐ train.py args ──────────────┐ meghana@beluga:~/project:$ python submit.py train --gres gpu:h100:1 -- --lr 3e-4 --epochs 20 Submitted job 14823999 slurm : gpu:h100:1, 4 cpus, 32GB, 240min script: python train.py --lr 3e-4 --epochs 20
Tip: env vars go inside the function

Set HF_HOME, TMPDIR, TRANSFORMERS_CACHE etc. at the top of the submitted function using os.environ.setdefault() — they need to be in the Python process, not just the shell setup.

Last updated: April 2026 ← Back to blog