Troubleshooting (Slurm)
Node was rebooted
We don't allow persistent reservations because they will restart your batch job after a node reboots which which is not usually what you want to happen. So if your node is rebooted, your reservation is released. Please request another node.
Node is swapping
A node is said to be "swapping" if it is moving data from memory to/from its swap partition on disk. This can happen when the available RAM is oversubscribed. Since disk speed is around two orders of magnitude slower than RAM speed, any swapping to disk results in very poor performance. You can view the memory usage of a node graphically by either starting a web browser on an NRAO machine, preferably in VNC, and going to our ganglia server or using wget.
NMASC (nmpost)
- Connect to http://ganglia.aoc.nrao.edu/
- Pull down Choose a Source and select NM Postprocessing
- Pull down Choose a Node and select the node you are using
If you see any purple in the graph, then that machine has data on its swap disk and may be performing poorly.
You can also try downloading the graph manually like so (nmpost050 is an example)
wget "http://ganglia.aoc.nrao.edu/graph.php?c=NM%20Postprocessing&h=nmpost050.aoc.nrao.edu&g=mem_report&z=large" -O graph.png
eog graph.png
NAASC (cvpost)
- Connect to http://ganglia.cv.nrao.edu/
- Pull down Choose a Source and select NAASC HPC
- Pull down Choose a Node and select the node you are using
If you see any purple in the graph, then that machine has data on its swap disk and may be performing poorly.
You can also try downloading the graph manually like so (cvpost050 is an example)
wget --no-check-certificate "https://neuron.cv.nrao.edu/graph.php?c=NAASC+HPC&h=cvpost050.cv.nrao.edu&g=mem_report&z=xlarge" -O graph.png
eog graph.png
The best way to keep a node from swapping is to ask for enough memory to run your job properly.
Not enough slots
If you get a message like the following, it may be because you are using mpicasa with a -n option that is larger than the number of tasks you have requested from Slurm.
-------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 9 slots that were requested by the application: /home/casa/packages/RHEL7/release/casa-6.1.2-7-pipeline-2020.1.0.36/bin/casa Either request fewer slots for your application, or make more slots available for use. --------------------------------------------------------------------------
With Torque, some users would request 8 cores and then run mpicasa -n9 knowing that one of those nine processes would be a parent process while the other eight would be worker processes and the parent process wouldn't use a significant amount of resources. This is something unique to mpicasa where it treats the parent process just like all the worker processes. Most other software doesn't do this. This worked with Torque because Torque and MPI were never really aware of each other. Now with Slurm, MPI is aware it is in a Slurm environment and will stop you from doing this. The easy solution is to request 9 cores in this case.