Troubleshooting (Torque/Slurm)
Torque Job exits unexpectedly
Your job may have exceeded the requested amount of RAM. The scheduler may kill your job if you set VMEM and/or PVMEM and exceeded either of those limits. If you have the the -m option for qsub set to at least e then you should receive an e-mail message when the job ends. Look at the Exit_status and consult the table below.
Exit_status | Probable Cause(s) |
---|---|
-11 | walltime exceeded |
-10 | vmem exceeded |
0 | success |
1 | your script produced an error |
2 | no such file or directory on Lustre |
137 |
A process exceeded mem/vmem or pmem/pvmem or process was killed with KILL signal |
139 | A process exceeded vmem or pvmem or process was killed with SEGV signal |
265 | Common when using qsub -I |
271 | Job killed with qdel |
Node was rebooted
We don't allow persistent reservations because they will restart your batch job after a node reboots which which is not usually what you want to happen. So if your node is rebooted, your reservation is released. Please request another node.
Node is swapping
A node is said to be "swapping" if it is moving data from memory to/from its swap partition on disk. This can happen when the available RAM is oversubscribed. Since disk speed is around two orders of magnitude slower than RAM speed, any swapping to disk results in very poor performance. You can view the memory usage of a node graphically by either starting a web browser on an NRAO machine, preferably in VNC, and going to our ganglia server or using wget.
NMASC (nmpost)
- Connect to http://ganglia.aoc.nrao.edu/
- Pull down Choose a Source and select NM Postprocessing
- Pull down Choose a Node and select the node you are using
If you see any purple in the graph, then that machine has data on its swap disk and may be performing poorly.
You can also try downloading the graph manually like so (nmpost050 is an example)
wget "http://ganglia.aoc.nrao.edu/graph.php?c=NM%20Postprocessing&h=nmpost050.aoc.nrao.edu&g=mem_report&z=large" -O graph.png
eog graph.png
NAASC (cvpost)
- Connect to http://ganglia.cv.nrao.edu/
- Pull down Choose a Source and select NAASC HPC
- Pull down Choose a Node and select the node you are using
If you see any purple in the graph, then that machine has data on its swap disk and may be performing poorly.
You can also try downloading the graph manually like so (cvpost050 is an example)
wget --no-check-certificate "https://neuron.cv.nrao.edu/graph.php?c=NAASC+HPC&h=cvpost050.cv.nrao.edu&g=mem_report&z=xlarge" -O graph.png
eog graph.png
The best way to keep a node from swapping is to ask for enough memory to run your job properly.
For more information, see the Torque manual.
Not enough slots
If you get a message like the following, it may be because you are using mpicasa with a -n option that is larger than the number of tasks you have requested from Slurm.
-------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 9 slots that were requested by the application: /home/casa/packages/RHEL7/release/casa-6.1.2-7-pipeline-2020.1.0.36/bin/casa Either request fewer slots for your application, or make more slots available for use. --------------------------------------------------------------------------
With Torque, some users would request 8 cores and then run mpicasa -n9 knowing that one of those nine processes would be a parent process while the other eight would be worker processes and the parent process wouldn't use a significant amount of resources. This is something unique to mpicasa where it treats the parent process just like all the worker processes. Most other software doesn't do this. This worked with Torque because Torque and MPI were never really aware of each other. Now with Slurm, MPI is aware it is in a Slurm environment and will stop you from doing this. The easy solution is to request 9 cores in this case.
Submit command not found
The submit script was deprecated in 2017 and was only available with the old Torque system. There is no submit script with the Slurm system. Please create script to submit with sbatch.