Welcome to the Info TEST server!

Skip to content. | Skip to navigation

Sections

Troubleshooting

Troubleshooting

 

Job exits unexpectedly

Your job may have exceeded the requested amount of RAM.  The scheduler may kill your job if you set VMEM and/or PVMEM and exceeded either of those limits.  If you have the  the -m option for qsub set to at least e then you should receive an e-mail message when the job ends.  Look at the Exit_status and consult the table below.

Exit_statusProbable Cause(s)
-11 walltime exceeded
-10 vmem exceeded
0 success
1 your script produced an error
2 no such file or directory on Lustre
137

A process exceeded mem/vmem or pmem/pvmem or process was killed with KILL signal

139 A process exceeded vmem or pvmem or process was killed with SEGV signal
265 Common when using qsub -I
271 Job killed with qdel

Node was rebooted

Persistent reservations cause problems within the Torque job scheduler by trying to restart your batch process which is not usually what you want to happen.  If your node is rebooted your reservation is released. Please request another node.

 

Node is swapping

A node is said to be "swapping" if it is moving data from memory to/from its swap partition on disk.  This can happen when the available RAM is oversubscribed.  Since disk speed is around two orders of magnitude slower than RAM speed, any swapping to disk results in very poor performance.  You can view the memory usage of a node graphically by either starting a web browser on an NRAO machine, preferably in VNC, and going to our ganglia server or using wget.

 

NMASC (nmpost)

  1. Connect to http://ganglia.aoc.nrao.edu/
  2. Pull down Choose a Source and select NM Postprocessing
  3. Pull down Choose a Node and select the node you are using

If you see any purple in the graph, then that machine has data on its swap disk and may be performing poorly.

You can also try downloading the graph manually like so (nmpost050 is an example)

wget "http://ganglia.aoc.nrao.edu/graph.php?c=NM%20Postprocessing&h=nmpost050.aoc.nrao.edu&g=mem_report&z=large" -O graph.png

eog graph.png

 

NAASC (cvpost)

  1. Connect to http://ganglia.cv.nrao.edu/
  2. Pull down Choose a Source and select NAASC HPC
  3. Pull down Choose a Node and select the node you are using

If you see any purple in the graph, then that machine has data on its swap disk and may be performing poorly.

You can also try downloading the graph manually like so (cvpost050 is an example)

wget --no-check-certificate "https://neuron.cv.nrao.edu/graph.php?c=NAASC+HPC&h=cvpost050.cv.nrao.edu&g=mem_report&z=xlarge" -O graph.png

eog graph.png

The best way to keep a node from swapping is to ask for enough memory to run your job properly.

Info Services Contacts
 
Search All NRAO