My Compute Canada Cheatsheet
Compute Canada
Cheatsheet
Things I look up every time I start a new project on the Alliance clusters — login, storage, modules, environments, and job submission. Take everything here with a grain-of-salt — I keep updating this as I learn new tips (unlearn old ones that were wrong).
Login
ssh <username>@<cluster>.computecanada.ca
Make sure your SSH keys are set up in the Compute Canada portal, and that you have 2FA configured for your account. You won't be able to log in to some of the clusters without these!
Storage
| Path | Backed up | Notes |
|---|---|---|
~/ (home) |
✅ Yes | Code, configs — not for data |
~/scratch/ |
❌ No | Large datasets, job outputs. Purged after 60 days if unused. |
~/projects/ |
✅ Yes | Persistent project data, shared with group |
Code in home, datasets + checkpoints in scratch, important results in projects.
Modules
Load software via the module system — do not install system-wide.
module avail # list available modules
module spider python # search for a module
module load StdEnv/2023 python/3.11 # load specific versions
module list # see loaded modules
module purge # unload everything
My default stacks:
module purge
module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0
module purge
module load StdEnv/2020 gcc/9.3.0 cuda/11.8.0
module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0
Virtual Environments
I use uv now — much faster than pip, works great on the clusters.
1. Point TMPDIR at scratch (avoids quota issues):
echo 'export TMPDIR=$SCRATCH/tmp' >> ~/.bashrc
mkdir -p $SCRATCH/tmp
source ~/.bashrc
2. Install uv into home:
curl -L https://github.com/astral-sh/uv/releases/latest/download/uv-x86_64-unknown-linux-gnu.tar.gz \
-o $SCRATCH/tmp/uv.tar.gz
tar -xzf $SCRATCH/tmp/uv.tar.gz -C $SCRATCH/tmp/
mkdir -p ~/.local/bin
cp $SCRATCH/tmp/uv-x86_64-unknown-linux-gnu/uv ~/.local/bin/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
uv --version
3. Create and activate a venv on scratch:
uv venv $SCRATCH/envs/my_env --python 3.11
source $SCRATCH/envs/my_env/bin/activate
uv pip install -r ~/scratch/path_to/requirements.txt
Home has a 50 GB quota. Large ML environments with torch + transformers can easily blow past that. Keep envs on scratch, just re-create them if purged.
SLURM Basics
Interactive session — Compute Canada (salloc):
salloc --time=1:00:00 --mem=16G --gres=gpu:1 --account=<account-pi>
Interactive session — NIBI Cloud (Mila):
ssh <username>@nibi.calculquebec.ca
salloc --time=1:00:00 --mem=16G --gres=gpu:1 --account=<account-pi>
Check your running jobs:
squeue -u $USER # all your jobs
squeue -u $USER -t RUNNING # only running
scancel <job_id> # cancel a job
Tail logs for multiple jobs at once:
for id in JOB_ID_1 JOB_ID_2 JOB_ID_3; do
echo "=== $id ERR ==="; tail -n 5 slurm_logs/${id}_0_log.err
echo "=== $id OUT ==="; tail -n 5 slurm_logs/${id}_0_log.out
done
Register a Jupyter kernel for your env:
TMPDIR=$SCRATCH/tmp python -m ipykernel install --user \
--name my_env \
--display-name "My ENV"
submitit
submitit lets you submit any Python function directly to SLURM — no sbatch scripts needed. Pair it with click for a clean CLI launcher.
Concrete example with a simple add(x, y) function — swap it for your actual training code:
import os, click, submitit
SCRATCH = os.environ["SCRATCH"]
SETUP = [
"module purge",
"module load StdEnv/2020 gcc/9.3.0 cuda/11.8.0",
"module load StdEnv/2023 gcc/14.3 python/3.11 arrow/17.0.0",
f"source {SCRATCH}/envs/my_env/bin/activate",
]
# ── The actual work that runs on the cluster node ──────────────
def add(x: float, y: float) -> float:
result = x + y
print(f"{x} + {y} = {result}") # shows up in slurm_logs/*.out
return result
# ── Boilerplate: build executor with resource params ───────────
def make_executor(account, gres, timeout, cpus, mem):
ex = submitit.AutoExecutor(folder="slurm_logs")
ex.update_parameters(
slurm_account=account, slurm_gres=gres,
timeout_min=timeout, cpus_per_task=cpus, mem_gb=mem,
slurm_setup=SETUP,
)
return ex
# ── One @cli.command() per job type ───────────────────────────
@click.group()
def cli(): pass
@cli.command()
@click.option("--x", default=1.0, type=float)
@click.option("--y", default=2.0, type=float)
@click.option("--account", default="def-yourpi")
@click.option("--gres", default="gpu:h100:1")
@click.option("--timeout", default=60, type=int)
@click.option("--cpus", default=2, type=int)
@click.option("--mem", default=8, type=int)
def run_add(x, y, account, gres, timeout, cpus, mem):
"""Submit add(x, y) to SLURM."""
ex = make_executor(account, gres, timeout, cpus, mem)
job = ex.submit(add, x, y)
click.echo(f"Submitted job {job.job_id} ({x} + {y})")
if __name__ == "__main__":
cli()
Running a Python script file instead of a function?
submit.py should only know about SLURM args. Use -- as a separator — everything after it gets passed directly to your script, so train.py owns its own args:
import sys, subprocess
def run_script(cmd: list[str]) -> None:
subprocess.run(cmd, check=True)
# context_settings + UNPROCESSED lets unknown flags pass through after --
@cli.command(context_settings=dict(ignore_unknown_options=True))
@click.option("--account", default="def-yourpi")
@click.option("--gres", default="gpu:h100:1")
@click.option("--timeout", default=240, type=int)
@click.option("--cpus", default=4, type=int)
@click.option("--mem", default=32, type=int)
@click.argument("script_args", nargs=-1, type=click.UNPROCESSED)
def train(account, gres, timeout, cpus, mem, script_args):
"""Submit train.py — SLURM opts before --, script opts after."""
cmd = [sys.executable, "train.py"] + list(script_args)
ex = make_executor(account, gres, timeout, cpus, mem)
job = ex.submit(run_script, cmd)
click.echo(f"Submitted job {job.job_id}")
click.echo(f" slurm : {gres}, {cpus} cpus, {mem}GB, {timeout}min")
click.echo(f" script: {' '.join(cmd)}")
# submit.py args ──────────────┐ train.py args ──────────────┐
meghana@beluga:~/project:$ python submit.py train --gres gpu:h100:1 -- --lr 3e-4 --epochs 20
Submitted job 14823999
slurm : gpu:h100:1, 4 cpus, 32GB, 240min
script: python train.py --lr 3e-4 --epochs 20
Set HF_HOME, TMPDIR, TRANSFORMERS_CACHE etc. at the top of the submitted function using os.environ.setdefault() — they need to be in the Python process, not just the shell setup.