My Compute Canada Cheatsheet

HPC

ComputeCanada

Tips

All the things compute canada like login, storage, modules, uv, SLURM, and submitit patterns I actually use on the Alliance clusters.

Published

April 22, 2026

~/compute-canada/cheatsheet.sh

−▢×

> Tips & Tricks · HPC / Compute Canada

Compute Canada
Cheatsheet

HPC SLURM by Meghana Bhange

Things I look up every time I start a new project on the Alliance clusters — login, storage, modules, environments, and job submission. Take everything here with a grain-of-salt — I keep updating this as I learn new tips (unlearn old ones that were wrong).

Login

bash

ssh <username>@<cluster>.computecanada.ca

Note - ssh-keys and 2FA

Make sure your SSH keys are set up in the Compute Canada portal, and that you have 2FA configured for your account. You won't be able to log in to some of the clusters without these!

Storage

Path	Backed up	Notes
`~/` (home)	✅ Yes	Code, configs — not for data
`~/scratch/`	❌ No	Large datasets, job outputs. Purged after 60 days if unused.
`~/projects/`	✅ Yes	Persistent project data, shared with group

Rule of thumb

Code in home, datasets + checkpoints in scratch, important results in projects.

Modules

Load software via the module system — do not install system-wide.

bash

module avail                         # list available modules
module spider python                 # search for a module
module load StdEnv/2023 python/3.11  # load specific versions
module list                          # see loaded modules
module purge                         # unload everything

My default stacks:

CPU jobs

module purge
module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0

GPU jobs

module purge
module load StdEnv/2020 gcc/9.3.0 cuda/11.8.0
module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0

Virtual Environments

I use uv now — much faster than pip, works great on the clusters.

1. Point TMPDIR at scratch (avoids quota issues):

bash

echo 'export TMPDIR=$SCRATCH/tmp' >> ~/.bashrc
mkdir -p $SCRATCH/tmp
source ~/.bashrc

2. Install uv into home:

bash

curl -L https://github.com/astral-sh/uv/releases/latest/download/uv-x86_64-unknown-linux-gnu.tar.gz \
    -o $SCRATCH/tmp/uv.tar.gz
tar -xzf $SCRATCH/tmp/uv.tar.gz -C $SCRATCH/tmp/
mkdir -p ~/.local/bin
cp $SCRATCH/tmp/uv-x86_64-unknown-linux-gnu/uv ~/.local/bin/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
uv --version

3. Create and activate a venv on scratch:

bash

uv venv $SCRATCH/envs/my_env --python 3.11
source $SCRATCH/envs/my_env/bin/activate
uv pip install -r ~/scratch/path_to/requirements.txt

Why scratch for envs?

Home has a 50 GB quota. Large ML environments with torch + transformers can easily blow past that. Keep envs on scratch, just re-create them if purged.

SLURM Basics

Interactive session — Compute Canada (salloc):

bash · cedar / beluga / narval

salloc --time=1:00:00 --mem=16G --gres=gpu:1 --account=<account-pi>

Interactive session — NIBI Cloud (Mila):

bash · nibi.calculquebec.ca

ssh <username>@nibi.calculquebec.ca
salloc --time=1:00:00 --mem=16G --gres=gpu:1 --account=<account-pi>

Check your running jobs:

bash

squeue -u $USER              # all your jobs
squeue -u $USER -t RUNNING   # only running
scancel <job_id>             # cancel a job

Tail logs for multiple jobs at once:

bash

for id in JOB_ID_1 JOB_ID_2 JOB_ID_3; do
    echo "=== $id ERR ==="; tail -n 5 slurm_logs/${id}_0_log.err
    echo "=== $id OUT ==="; tail -n 5 slurm_logs/${id}_0_log.out
done

Register a Jupyter kernel for your env:

bash

TMPDIR=$SCRATCH/tmp python -m ipykernel install --user \
    --name my_env \
    --display-name "My ENV"

submitit

submitit lets you submit any Python function directly to SLURM — no sbatch scripts needed. Pair it with click for a clean CLI launcher.

Concrete example with a simple add(x, y) function — swap it for your actual training code:

submit.py

import os, click, submitit

SCRATCH = os.environ["SCRATCH"]

SETUP = [
    "module purge",
    "module load StdEnv/2020 gcc/9.3.0 cuda/11.8.0",
    "module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0",
    f"source {SCRATCH}/envs/my_env/bin/activate",
]

# ── The actual work that runs on the cluster node ──────────────
def add(x: float, y: float) -> float:
    result = x + y
    print(f"{x} + {y} = {result}")   # shows up in slurm_logs/*.out
    return result

# ── Boilerplate: build executor with resource params ───────────
def make_executor(account, gres, timeout, cpus, mem):
    ex = submitit.AutoExecutor(folder="slurm_logs")
    ex.update_parameters(
        slurm_account=account, slurm_gres=gres,
        timeout_min=timeout, cpus_per_task=cpus, mem_gb=mem,
        slurm_setup=SETUP,
    )
    return ex

# ── One @cli.command() per job type ───────────────────────────
@click.group()
def cli(): pass

@cli.command()
@click.option("--x",       default=1.0,         type=float)
@click.option("--y",       default=2.0,         type=float)
@click.option("--account", default="def-yourpi")
@click.option("--gres",    default="gpu:h100:1")
@click.option("--timeout", default=60,           type=int)
@click.option("--cpus",    default=2,            type=int)
@click.option("--mem",     default=8,            type=int)
def run_add(x, y, account, gres, timeout, cpus, mem):
    """Submit add(x, y) to SLURM."""
    ex  = make_executor(account, gres, timeout, cpus, mem)
    job = ex.submit(add, x, y)
    click.echo(f"Submitted job {job.job_id}  ({x} + {y})")

if __name__ == "__main__":
    cli()

Running a Python script file instead of a function?

submit.py should only know about SLURM args. Use -- as a separator — everything after it gets passed directly to your script, so train.py owns its own args:

submit.py — running a script

import sys, subprocess

def run_script(cmd: list[str]) -> None:
    subprocess.run(cmd, check=True)

# context_settings + UNPROCESSED lets unknown flags pass through after --
@cli.command(context_settings=dict(ignore_unknown_options=True))
@click.option("--account", default="def-yourpi")
@click.option("--gres",    default="gpu:h100:1")
@click.option("--timeout", default=240, type=int)
@click.option("--cpus",    default=4,   type=int)
@click.option("--mem",     default=32,  type=int)
@click.argument("script_args", nargs=-1, type=click.UNPROCESSED)
def train(account, gres, timeout, cpus, mem, script_args):
    """Submit train.py — SLURM opts before --, script opts after."""
    cmd = [sys.executable, "train.py"] + list(script_args)
    ex  = make_executor(account, gres, timeout, cpus, mem)
    job = ex.submit(run_script, cmd)
    click.echo(f"Submitted job {job.job_id}")
    click.echo(f"  slurm : {gres}, {cpus} cpus, {mem}GB, {timeout}min")
    click.echo(f"  script: {' '.join(cmd)}")

bash

#          submit.py args ──────────────┐  train.py args ──────────────┐
meghana@beluga:~/project:$ python submit.py train --gres gpu:h100:1 -- --lr 3e-4 --epochs 20
Submitted job 14823999
  slurm : gpu:h100:1, 4 cpus, 32GB, 240min
  script: python train.py --lr 3e-4 --epochs 20

Tip: env vars go inside the function

Set HF_HOME, TMPDIR, TRANSFORMERS_CACHE etc. at the top of the submitted function using os.environ.setdefault() — they need to be in the Python process, not just the shell setup.

Last updated: April 2026 ← Back to blog

Compute CanadaCheatsheet

Login

Storage

Modules

Virtual Environments

SLURM Basics

submitit

Compute Canada
Cheatsheet