Skip to content

Job monitoring

squeue

squeue provides information about jobs in the Slurm scheduling queue and is best used for viewing jobs and job step information for active jobs.

It is possible to only display the information for your current jobs (or a specific user) by using: squeue -u <username>

To display the job ID and the working directory where a current job was submitted:

squeue -u <username> -o "%A %Z"

squeue has many other useful options, complete details can be found running squeue --help, man squeue. For example the following command will display details about each queued job, sorted by priority:

 squeue -o "%.12Q %.10i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.12v %r" --state=PD -S "-p" 

The “START_TIME” given by squeue is an estimate by the scheduler, based on the current jobs in the queue. The addition of new high-priority jobs or jobs ending earlier than expected alter this time, therefore this estimated start time is extremely unreliable.

Job reason codes

Code Meaning
InvalidQoS The job's QoS is invalid (does not exist).
QoSNotAllowed The job is not allowed to run on that QoS.
PartitionNodeLimit The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The job's time limit exceeds it's partition's current time limit.
Priority One or more higher priority jobs exist for this partition or advanced reservation.
QoSJobLimit The job's QoS has reached its maximum job count.
QoSResourceLimit The job's QoS has reached some resource limit.
QoSTimeLimit The job's QoS has reached its time limit.
QoSGrpNodeLimit The job's QoS has reached the maximum number of nodes. No free nodes are available within this QoS.
QoSGrpCpuLimit The job's QoS has reached the maximum number of available CPUs. No free CPUs are available within this QoS.
QoSMaxNodePerUserLimit The maximum number of nodes allowed per user has been reached in that QoS. No free nodes are available for the user.
Resources The job is waiting for resources to become available
None There are so many jobs waiting that the requested job cannot even get a priority.

scontrol

scontrol can be used to obtain detailed information for a job:

scontrol show job <jobid>

sinfo

sinfo allows to get partition status and node information:

sinfo --partition=zen2_0256_a40x2

To see all the available partitions run sinfo -o %P

Cancel a job

To cancel a job, use scancel

scancel <jobid>

lastjobs

The command lastjobs is a wrapper around sacct that displays by default the last 10 submitted jobs in the last month. Run lastjobs -h for more details.

node status

Once a node has been assigned for a job by Slurm, one can ssh into the node and obersve the node activity visually by the $htop command. Say for example n3504-05 is occupied, then,

[...@... ~]$ssh n3504-05
[...@... ~]$htop