Memory Options (Torque/Slurm)

It is vital to request enough memory for your job to run efficiently but not to request too much memory which may deprive other users.

Slurm

There are two memory options in the Slurm scheduler: --mem and --mem-per-cpu.

--mem

This option is similar to HTCondor's request_memory and Torque's -l mem options.

You can append different suffixes like K, M, G or T. Like Torque, it can either be a command line option like

--mem=16G

or a directive like

#SBATCH --mem=16G

It will request an amount of RAM for the whole job. For example, if you want 2 cores and 2GiB for each core then you should use

--mem=4G

If your job exceeds the memory requested, it will be killed and a message like the following will be sent to stdout:

/var/spool/slurmd/job00938/slurm_script: line 15: 24509 Killed memgrab -s 10 -w 100
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=938.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The manpage and other documentation may read that the units for the --mem and --mem-per-cpu options are in megabytes (Base10 units, a.k.a. MB). Technically they are in mebibytes (Base2 units a.k.a. MiB). For the most part the difference is minimal and can be ignored.

Torque

We recommend setting either the -l mem= or the -L tasks=1:memory= option in your job so the scheduler knows how much memory to allocate. The vmem and pvmem options should only be used if you really understand the use of those options.

For single-node jobs, the mem option directs the scheduler to put your job on a node with at least mem bytes of memory. If your job exceeds mem, then your job will swap but continue to run. So, set mem to the total amount of memory you expect your job to use at any one time.

-l mem=16gb

If you are submitting a multi-node batch job, the scheduler will divide mem by the number of nodes requested and put your job on nodes with at least mem/nodes bytes of memory. If your job exceeds mem/nodes bytes it will start swapping but continue to run. So, set mem to the total amount of memory you expect your job to use at any one time.

-l nodes=2:ppn=4,mem=64

The default for all the memory options, like mem is unlimited. This is why we ask that you set limits to your job so that others can have resources to use.

In the examples below, we use a fictional program called memgrab as an example, where -s is the amount of memory it allocates in gigabytes. This is meant as an analog to running CASA.

-l mem

mem is physical memory. It doesn't include shared libraries, swapped pages or pages of mapped files unused.

For single-node jobs, mem is the maximum amount of memory expected to be used by all the processes in the job combined. If the total amount of memory used by all the processes in the job combined exceeds mem, those processes will swap.

For multi-node jobs, the scheduler divides mem by the number of nodes requested and will use that as the maximum amount of memory expected to be used by all the processes on each node in the job. If the amount of memory used by the job on any given node exceeds mem/nodes then those processes will swap.

Scheduler will not kill the job if any process or the whole job exceeds mem, but it will swap.

Sets "data seg size" in ulimit a.k.a. "Max data size" in /proc/$$/limits only if nodes=1 (the default).
Sets "max memory size" in ulimit a.k.a. "Max resident set" in /proc/$$/limits only if nodes=1 (the default).
Sets "memory.limit_in_bytes" in the memory cgroup to mem/nodes.

#!/bin/sh

#PBS -l nodes=1:ppn=2,mem=10gb

# Creates a cgroup on each node requested (E.g. 1) with a memory limit of mem/nodes (E.g. 10gb).

memgrab -s 8 &        # 8GB < 10GB so it doesn't swap, yet
memgrab -s 8          # 16GB > 10GB so both processes start swapping

-l vmem

vmem is virtual memory including code and data pages, shared libraries, plus pages that have been swapped out or mapped but not used. It is essentially what the top command reports as VIRT.

Because virtual memory involves more than just disk swap, we recommend not using the vmem option unless you really understand how it works. For example, if your code memory maps a file, that counts as virtual memory and may cause your job to be killed even though it hasn't used any real memory to map the file.

For single-node jobs, vmem is the maximum amount of virtual memory allowed by all processes combined in the job. If any process exceeds vmem, the scheduler will kill it. If the total amount of memory used by all the processes in the job combined exceeds vmem, then oom-killer will kill processes until the usage falls below vmem.

For multi-node jobs, the scheduler divides vmem by the number of nodes requested and will use that as the maximum amount of memory+swap allowed by all the processes on each node in the job. If any process exceeds vmem/nodes, the scheduler will kill it. If the amount of memory used by the job on any given node exceeds vmem/nodes then oom-killer will kill processes in that job on that node until the usage falls below vmem/nodes.

Linuix will kill any process that exceeds vmem with a Segmentation fault. This may end the job with "Exit_status=-10".

Either the Scheduler or the Linux oom-killer will kill the job if the whole job exceeds vmem/nodes.

Sets "virtual memory" in ulimit a.k.a. "Max address space" in /proc/$$/limits to vmem.
Sets "memory.limit_in_bytes" and "memory.memsw.limit_in_bytes" in memory cgroup to vmem/nodes.

#!/bin/sh

#PBS -l nodes=1:ppn=2,mem=10gb,vmem=15gb

# Creates a cgroup on each node requested (E.g. 1) with a memory+swap limit of vmem/nodes (E.g. 15gb).

memgrab -s 12 &       # 12GB < 15GB so it will swap but isn't killed, yet
memgrab -s 12         # both memgrabs total > 15GB so oom-killer will kill one

-l pmem

For both single-node and multi-node jobs, pmem is the maximum amount of memory expected to be used per processor, per node. If asking for multiple processors (via ppn) then the scheduler will multiply pmem by the number of processors requested and look for that much available memory. For example, if you use -l nodes=1,ppn=2,pmem=3gb then the scheduler will look for one node with 6GB of memory available. If you use -l nodes=2,ppn=4,pmem=3gb then the scheduler will look for two nodes, each with 12GB of memory available.

If the total amount of memory used by all the processes combined on any node in the job exceeds pmem*ppn, then those processes on that node will swap. Processes on a node can exceed pmem without swapping as long as the total stays under pmem*ppn.

Scheduler will not kill the job if any process or the whole job exceeds pmem*ppn, but it will swap.

Sets "data seg size" in ulimit a.k.a. "Max data size" in /proc/$$/limits to pmem.

Sets "max memory size" in ulimit a.k.a. "Max resident set" in /proc/$$/limits to pmem.
Sets "memory.limit_in_bytes" in memory cgroup to pmem*ppn.

#!/bin/sh

#PBS -l nodes=1:ppn=2,pmem=16gb

# Creates a cgroup on each node requested (E.g. 1) with a memory limit of pmem*ppn (E.g. 32gb).

memgrab -s 17 &       # 17GB < 32GB so it doesn't swap, yet
memgrab -s 17         # 34GB > 32GB so both processes start swapping

-l pvmem

For both single-node and multi-node jobs, pvmem is the maximum amount of memory+swap allowed by any single process in the job. If asking for multiple processors (via ppn) then the scheduler will multiply pvmem by the number of processors requested and look for that much available virtual memory.

Because virtual memory involves more than just disk swap, we recommend not using the pvmem option unless you really understand how it works. For example, if your code memory maps a file, that counts as virtual memory and may cause your job to be killed even though it hasn't used any real memory to map the file.

If any process exceeds pvmem (not pvmem*ppn), it will be killed but the job will continue.

If the total amount of memory used by all the processes combined in the job exceeds pmem*ppn, then oom-killer will kill processes until the usage falls below pmem*ppn.

Linuix will kill any process that exceeds pvmem with a Segmentation fault. This may end the job with "Exit_status=137" or "Exit_status=139" or "Exit_status=255".

Sets "virtual memory" in ulimit a.k.a. "Max address space" in /proc/$$/limits to pvmem.
Sets "memory.limit_in_bytes" and "memory.memsw.limit_in_bytes" in memory cgroup to pvmem*ppn.

#!/bin/sh

#PBS -l nodes=1:ppn=2,pvmem=16gb

# Creates a cgroup on each node requested (E.g. 1) with a memory+swap limit of pvmem*ppn (E.g. 32gb).

memgrab -s 17 &       # 17GB > 16GB so this is killed
memgrab -s 15 &       # 15GB < 16GB so this isn't killed, yet
memgrab -s 15 &       # 15GB < 16GB and 30GB < 32GB so this isn't killed, yet
memgrab -s 15         # 15GB < 16GB but 45GB > 32GB so oom-killer kills a memgrab process

-L memory

Using this option requires the -L syntax (E.g. -L tasks=1:lprocs=2:memory=10gb) instead of the -l syntax.

Maximum amount of memory expected to be used by all the processes in the job combined. If the total amount of memory used by all the processes in the job combined exceeds memory, then those processes will swap.

Essentially the same as the mem option but does not get divided by the number of nodes requested.

Scheduler will not kill the job if any process or the whole job exceeds memory, but it will swap.
Doesn't set anything in ulimit or /proc/$$/limits.

Sets "memory.limit_in_bytes" in the memory cgroup to memory.

#!/bin/sh

#PBS -L tasks=1:lprocs=2:memory=10gb

# tasks is the number of nodes requested
# lprocs is the number of cores requested per node
# memory is the amount of memory requested per node
# This creates a cgroup on each node with a memory limit of 10gb.

memgrab -s 8 &        # 8GB < 10GB so it doesn't swap, yet
memgrab -s 8          # 16GB > 10GB so both processes swap

Sections