Skip to content

Job monitoring

squeue

squeue provides information about jobs in the Slurm scheduling queue and is best used for viewing jobs and job step information for active jobs.

It is possible to only display the information for your current jobs (or a specific user) by using: squeue -u <username>

To display the job ID and the working directory where a current job was submitted:

squeue -u <username> -o "%A %Z"

squeue has many other useful options, complete details can be found running squeue --help, man squeue. For example the following command will display details about each queued job, sorted by priority:

 squeue -o "%.12Q %.10i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.12v %r" --state=PD -S "-p" 

The “START_TIME” given by squeue is an estimate by the scheduler, based on the current jobs in the queue. The addition of new high-priority jobs or jobs ending earlier than expected alter this time, therefore this estimated start time is extremely unreliable.

Job reason codes

Code Meaning
InvalidQoS The job's QoS is invalid (does not exist).
QoSNotAllowed The job is not allowed to run on that QoS.
PartitionConfig The job has requirements not fulfilled on that Partition
PartitionNodeLimit The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The job's time limit exceeds it's partition's current time limit.
Priority One or more higher priority jobs exist for this partition or advanced reservation.
QoSJobLimit The job's QoS has reached its maximum job count.
QoSResourceLimit The job's QoS has reached some resource limit.
QoSTimeLimit The job's QoS has reached its time limit.
QoSGrpNodeLimit The job's QoS has reached the maximum number of nodes. No free nodes are available within this QoS.
QoSGrpCpuLimit The job's QoS has reached the maximum number of available CPUs. No free CPUs are available within this QoS.
QoSGrpGRES The job's QoS has reached the maximum number of available GPUs. No free GPUs are available within this QoS.
QoSMaxNodePerUserLimit The maximum number of nodes allowed per user has been reached in that QoS. No free nodes are available for the user.
JpbArrayTaskLimit The maximum number of set tasks in an array job already run
Resources The job is waiting for resources to become available
JobHeldUser The job has been blocked from running by the user
JobHeldAdmin The job has been blocked from running by an admin
Dependency The job is waiting for another job to finish (setup by the user that way)
None There are so many jobs waiting that the requested job cannot even get a priority.

scontrol

scontrol can be used to obtain detailed information for a job:

scontrol show job <jobid>

sinfo

sinfo allows to get partition status and node information:

sinfo --partition=zen2_0256_a40x2

To see all the available partitions run sinfo -o %P

Cancel a job

To cancel a job, use scancel

scancel <jobid>

lastjobs

The command lastjobs is a wrapper around sacct that displays by default the last 10 submitted jobs in the last month. Run lastjobs -h for more details.

node status

Once a node has been assigned for a job by Slurm, one can ssh into the node and observe the node activity visually by the htop command. Say for example n3504-05 is occupied, then,

[...@... ~]$ssh n3504-05
[...@... ~]$htop