Job monitoring¶
squeue¶
squeue
provides information about jobs in the Slurm scheduling queue and is best used for viewing jobs and job step information for active jobs.
It is possible to only display the information for your current jobs (or a specific user) by using:
squeue -u <username>
To display the job ID and the working directory where a current job was submitted:
squeue -u <username> -o "%A %Z"
squeue
has many other useful options, complete details can be found running squeue --help
, man squeue
. For example the following command will display details about each queued job, sorted by priority:
squeue -o "%.12Q %.10i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.12v %r" --state=PD -S "-p"
The “START_TIME” given by squeue
is an estimate by the scheduler, based on the current jobs in the queue. The addition of new high-priority jobs or jobs ending earlier than expected alter this time, therefore this estimated start time is extremely unreliable.
Job reason codes¶
Code | Meaning |
---|---|
InvalidQoS | The job's QoS is invalid (does not exist). |
QoSNotAllowed | The job is not allowed to run on that QoS. |
PartitionNodeLimit | The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED. |
PartitionTimeLimit | The job's time limit exceeds it's partition's current time limit. |
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
QoSJobLimit | The job's QoS has reached its maximum job count. |
QoSResourceLimit | The job's QoS has reached some resource limit. |
QoSTimeLimit | The job's QoS has reached its time limit. |
QoSGrpNodeLimit | The job's QoS has reached the maximum number of nodes. No free nodes are available within this QoS. |
QoSGrpCpuLimit | The job's QoS has reached the maximum number of available CPUs. No free CPUs are available within this QoS. |
QoSMaxNodePerUserLimit | The maximum number of nodes allowed per user has been reached in that QoS. No free nodes are available for the user. |
Resources | The job is waiting for resources to become available |
None | There are so many jobs waiting that the requested job cannot even get a priority. |
scontrol¶
scontrol
can be used to obtain detailed information for a job:
scontrol show job <jobid>
sinfo¶
sinfo
allows to get partition status and node information:
sinfo --partition=zen2_0256_a40x2
To see all the available partitions run sinfo -o %P
Cancel a job¶
To cancel a job, use scancel
scancel <jobid>
lastjobs¶
The command lastjobs
is a wrapper around sacct that displays by default the last 10 submitted jobs in the last month.
Run lastjobs -h
for more details.
node status¶
Once a node has been assigned for a job by Slurm, one can ssh into the node and obersve the node activity visually by the $htop
command. Say for example n3504-05 is occupied, then,
[...@... ~]$ssh n3504-05
[...@... ~]$htop