Job monitoring¶

squeue¶

squeue provides information about jobs in the Slurm scheduling queue and is best used for viewing jobs and job step information for active jobs.

It is possible to only display the information for your current jobs (or a specific user) by using: squeue -u <username>

To display the job ID and the working directory where a current job was submitted:

squeue -u <username> -o "%A %Z"

squeue has many other useful options, complete details can be found running squeue --help, man squeue. For example the following command will display details about each queued job, sorted by priority:

 squeue -o "%.12Q %.10i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.12v %r" --state=PD -S "-p"

The “START_TIME” given by squeue is an estimate by the scheduler, based on the current jobs in the queue. The addition of new high-priority jobs or jobs ending earlier than expected alter this time, therefore this estimated start time is extremely unreliable.

Job reason codes¶

Code	Meaning
InvalidQoS	The job's QoS is invalid (does not exist).
QoSNotAllowed	The job is not allowed to run on that QoS.
PartitionConfig	The job has requirements not fulfilled on that Partition
PartitionNodeLimit	The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit	The job's time limit exceeds it's partition's current time limit.
Priority	One or more higher priority jobs exist for this partition or advanced reservation.
QoSJobLimit	The job's QoS has reached its maximum job count.
QoSResourceLimit	The job's QoS has reached some resource limit.
QoSTimeLimit	The job's QoS has reached its time limit.
QoSGrpNodeLimit	The job's QoS has reached the maximum number of nodes. No free nodes are available within this QoS.
QoSGrpCpuLimit	The job's QoS has reached the maximum number of available CPUs. No free CPUs are available within this QoS.
QoSGrpGRES	The job's QoS has reached the maximum number of available GPUs. No free GPUs are available within this QoS.
QoSMaxNodePerUserLimit	The maximum number of nodes allowed per user has been reached in that QoS. No free nodes are available for the user.
JpbArrayTaskLimit	The maximum number of set tasks in an array job already run
Resources	The job is waiting for resources to become available
JobHeldUser	The job has been blocked from running by the user
JobHeldAdmin	The job has been blocked from running by an admin
Dependency	The job is waiting for another job to finish (setup by the user that way)
None	There are so many jobs waiting that the requested job cannot even get a priority.

scontrol¶

scontrol can be used to obtain detailed information for a job:

scontrol show job <jobid>

sinfo¶

sinfo allows to get partition status and node information:

sinfo --partition=zen2_0256_a40x2

To see all the available partitions run sinfo -o %P

Cancel a job¶

To cancel a job, use scancel

scancel <jobid>

lastjobs¶

The command lastjobs is a wrapper around sacct that displays by default the last 10 submitted jobs in the last month. Run lastjobs -h for more details.

node status¶

Once a node has been assigned for a job by Slurm, one can ssh into the node and observe the node activity visually by the htop command. Say for example n3504-05 is occupied, then,

[...@... ~]$ssh n3504-05
[...@... ~]$htop