Skip to content

Submitting jobs

VSC uses Slurm for cluster/resource management and job scheduling. Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources and scheduling work for future execution.

What is a batch job?

A batch job is a non-interactive way to run an application in a pre-determined way. What happens during the batch job is controlled by the job script. When a batch job is submitted to the system, it is put in a queue, and is then started at a later time. With this approach one can queue many batch jobs at the same time, which will start automatically once resources are available.

Preparing a batch job:

  • Copy any needed input files.
  • Load any modules required to run your job script by placing the “module load” commands in your job script.
  • Include the job options (e.g. the amount of memory , number of CPU cores , maximum wall time, etc.) in the Slurm file.

Submitting a batch job:

sbatch is the command to submit a job to the queueu for execution. When you submit the job you will get a job ID to identify it and Slurm will:

  • allocate resources (nodes, tasks, partition, etc.)
  • run a single copy of the batch script on the first allocated node

Here is an example job script:

#!/bin/bash
#SBATCH -A your_account
#SBATCH -J jobname
#SBATCH -t 00:10:00
#SBATCH --partition=zen3_0512
#SBATCH --qos=zen3_0512
#SBATCH -N 2
#SBATCH --exclusive

#execute your program
./my_program

Info

  • Core hours will be charged to the specified account.
  • Account, partition, and qos have to be compatible.
  • If the account is not given, the default account will be used.
  • If partition and QOS are not given, default values are zen3_0512 for both on VSC-5 and skylake_0096 for both on VSC-4.

Warning

You must submit the job in the correct partition!
For VSC-5
zen partition for CPU jobs
cuda-zen partition for GPU jobs
This allows you to load the correct modules.

One way to avoid errors, is to directly specify the module tree with the spackup command in the submission script, for example:

#!/bin/bash
#SBATCH --partition=zen2_0256_a40x2
#SBATCH --qos=zen2_0256_a40x2
#SBATCH -J cp2k_test
#SBATCH --gres=gpu:1
#SBATCH -t 00:05:00

spackup cuda-zen
module load cp2k/master-gcc-11.4.0-nmxvrkm

mpirun -n 2 cp2k.popt -i argon.inp

Commonly used options

Option (long form) Option (short form) Meaning Format
--time -t maximum walltime days-hours:minutes:seconds (not all are required)
--time-min (none) minimum walltime days-hours:minutes:seconds (not all are required)
--mem (none) total memory required xxxM or xxxG
--mem_per_task (none) memory per each task (=core) xxxM or xxxG
--nodes -N number of nodes
--ntasks -n number of tasks
--ntasks-per-node (none) number of tasks run in parallel on a single node
--ntasks-per-core (none) number of tasks a single core should work on
--cpus-per-task -c number of processors per task
--gres=gpu: -G total number of GPUs allowed values are 1 or 2
--partition -p partition name
--qos -q quality of service (QOS)
--account -A project to charge for this job
--job-name -J name of job
--mail-user (none) sends an email to this address mail adsress
--mail-type (none) sends an email at specific events BEGIN, END, FAIL, REQUEUE, ALL

Info

If you do not request a full node, the --mem option must be specified

Note

The --mem_per_task should be set such that: mem_per_task * ntasks < memory of node - 2GB
The approx 2Gb reduction in available memory is due to operating system stored in memory.

Tip

VSC has many example job scripts available at VSC's public GitLab.

Where to run

You should run and store the data from your jobs in the $DATA directory. We don't recommend running from your $home directory since storage is limited.

Scratch?

We don't have a scratch file system.

Single node runs

For single node runs, it is possible to use /local or /tmp.

Warning

The data in /local or /tmp is NOT stored!, it is deleted after the jobs. You must copy the files you want to keep from your runs to your $HOME or $DATA.

Wall time

The “wall time” limit determines how long your job may run (in actual hours, not core hours) before it is terminated by the system.

The QOS's run time limits can also be requested via the command

sacctmgr show qos  format=name%20s,priority,grpnodes,maxwall,description%40s

Tip

It is a good practice to include the wall time in your submission script. It will help spending less time in the queue. The default maximum walltime is 1 day. Use the #SBATCH -- time=DD-HH:MM:SS flag to specify the appropriate time.

If your job ends before the time limit is up, your project will only be charged for the actual time used. It is important to note that short jobs can start faster due to backfill and jobs that require more resources might take longer to start.

Please keep in mind that it is wise to add a margin to the wall time setting to prevent jobs from failing if they run slightly slower than expected for some unforeseen reason.

Testing your batch job

It's a good idea to test that your script works instead of finding out it has an error after 2 days waiting in the queue.

VSC has QOS reserved for test and development, check the VSC-5 QOS/VSC-4 QOS. These can be sued to quickly check your job script before submitting it to the normal queue, where waiting time might be for hours or days before your job starts, only to find out that you made a simple error in the job script.

Warning

Do not run production jobs on the development QOS. They are a shared resource for all VSC users.
DO NOT ABUSE THE DEVEL QOS.

You can also use an interactive session via the interactivejobs command to check your script. interactivejobs accepts the same command line options as sbatch. Information about how to use theinteractivejobs command can be found here.

The advantage of testing batch jobs in an interactive session is that in this way, one can quickly fix a bug, re-run the script, find another bug, fix it, etc. This can speed up the process of debugging job scripts compared to submitting them normally.