Skip to content

Out of memory (OOM)

Symptoms

It is not very easy to determine out of memory (OOM) problems since it can end a job prematurely or even immediately but also this may happen for many different reasons. Sometimes one is lucky and the slurm output will display a slurmstepd: Exceeded step memory limit message, but not necessarily.

  • squeue displays a CG status if at (the sjob cannot be cancelled with scancel and an admin has to reboot the node).
  • The job stops making progress, but continue to run until it hits the walltime limit.
  • The job gets killed by the queue system.
  • The NODE_FAIL status is usually an indication of out of memory.

Note

A compute node that has 64GB RAM, does not use all of it for running applications. Part of it is left for the operating system, disk cache, etc.Use a safe margin when estimating memory requirements,

How to check for OOM

Check the memory usage while the job is running

One can ssh to the compute nodes of the job, and monitor the memory usage in real-time using the top command.

Seff

Seff is a command line tool related to the efficiency of resource usage by a completed job. It displays data that Slurm collected while the job was running. Note that the data is sampled at regular intervals and can miss short peaks in memory usage.

$ seff 1003731
Job ID: 1003731
Cluster: vsc4
User/Group: 
State: FAILED (exit code 137)
Nodes: 1
Cores per node: 96
CPU Utilized: 00:59:47
CPU Efficiency: 43.45% of 02:17:36 core-walltime
Job Wall-clock time: 00:01:26
Memory Utilized: 94.25 GB
Memory Efficiency: 0.00% of 0.00 MB

Possible solutions

  • Check the memory usage while the job is running. ssh to one or more compute nodes in your job, and monitor memory usage in real-time using the top or htop commands.

  • Use nodes with more memory. VSC-5 has nodes with 512GB, 1TB and 2TB memory and VSC-4 with 96, 384 and 768GB. The larger the memory the fewer number of nodes available, therefore requesting nodes with more memory may require longer queue waiting times.

  • Use less cores per compute node. Parallel application can request fewer ranks per node, and either run on more nodes or accept a longer runtime.

  • Use more compute nodes. The problem with this option is that it is not very efficient and uses more core hours for the job.