Welcome to the Info TEST server!

Skip to content. | Skip to navigation

Sections
Info Services > Computing Guide > Cluster Processing > Access and Running Jobs (Torque/Slurm)

Access and Running Jobs (Torque/Slurm)

If you are an NRAO staff member, you can use your regular Linux login to access the cluster. If you are a remote observer, you will need an observer login. See either User Accounts for creating a remote observer account.


Login to the NRAO

Note: to preserve ssh agent, and X11 forwarding, you may need to use either -AX (Linux) or -AY (OSX) options on the ssh command line.

If you are outside the Socorro and Charlottesville facilities, you will need to login to our ssh gateway first.

For Socorro (NMASC/nmpost)

ssh <account name>@ssh.aoc.nrao.edu

For Charlottesville (NAASC/cvpost)

ssh <account name>@ssh.cv.nrao.edu

Login to the Master Node

Once on the NRAO network, login to the master node. It is from this node that you submit either batch or interactive jobs. Please refrain from doing compute intensive work, like CASA, on the master node as it is a very limited resource that everyone must share.

For NMASC (nmpost)

ssh nmpost-master

For NAASC (cvpost)

ssh cvpost-master
For more advanced connections like VNC, see the appendix.

 

Schedulers

The NMASC cluster in Socorro (nmpost) is subdivided into three clusters: Torque, Slurm, and HTCondor.  Torque and Slurm are both  High Performance Cluster systems focused on completing jobs as quickly as possible.  They are commonly used for interactive as well as batch jobs.  Slurm will ultimately replace torque but there will be a transition period where there are cluster nodes managed by both schedulers.  Your NRAO Linux login gives you access to both systems.

The NAASC cluster in Charlottesville (cvpost) only supports Torque at this time.


Running Jobs

There are two types of jobs: batch and interactive. Batch jobs involve submitting a script (bash, python, etc) to the cluster with memory and CPU requirements and the cluster decides where and when to run that job. Interactive jobs give you exclusive access to a subset of a node, allowing you to run software interactively as you might on a desktop computer.  Batch is preferred over Interactive as it is a more efficient use of resources and allows for many other users to use the cluster. Please use batch jobs instead of interactive jobs when possible.


Batch Jobs

There are several different ways of starting batch jobs depending on the cluster you wish to use.

 

Torque

The command to submit scripts to be run on the cluster is qsub.  If you have used other PBS-style schedulers like OpenPBS or PBS Pro, you may be familiar with qsub.  It has many options which you can see in the qsub manual.  These options can be added either to the command-line or into a submit script via the #PBS directive.  Below is a very simple example using qsub.  More detailed examples are in the appendix.

Create a run_casa.sh script like the following (nmpost example)

#!/bin/sh

#Don't put any commands before the #PBS options or they will not work
#PBS -V                               # Export all environment variables from the qsub commands environment to the batch job.
#PBS -l pmem=16gb,pvmem=16gb          # Amount of memory needed by each processor (ppn) in the job.
#PBS -d /lustre/aoc/observers/nm-4386 # Working directory (PBS_O_WORKDIR) set to your Lustre area
#PBS -l walltime=2:30:00              # Expected runtime of 2 hours and 30 minutes
#PBS -m ae # Send email when Jobs end or abort # casa's python requires a DISPLAY for matplot, so create a virtual X server xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/run_casa.py

Make it executable

chmod u+x run_casa.sh

Run job

qsub run_casa.sh

Slurm

The command to submit scripts to be run on the cluster is sbatch. It has many options which you can see in the sbatch manual. These options can be added either to the command-line or into a submit script via the #SBATCH directive. Below is a very simple example using sbatch. More detailed examples are in the appendix.

Create a run_casa.sh script like the following (nmpost example)

#!/bin/sh

#Don't put any commands before the #SBATCH options or they will not work
#SBATCH --export ALL                     # Export all evnironment variables to the job.
#SBATCH --mem=16G # Amount of memory needed by the whole job. #SBATCH -D /lustre/aoc/observers/nm-4386 # Working directory set to your Lustre area #SBATCH --time=0-2:30:00 # Expected runtime of 2 hours and 30 minutes
#SBATCH --mail-type=END,FAIL # Send email when Jobs end or fail
# casa's python requires a DISPLAY for matplot, so create a virtual X server xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/run_casa.py
Run job
sbatch run_casa.sh

Interactive Jobs

The nodescheduler program is used to grant interactive access to a node and works with both the Torque and Slurm clusters.  You are allowed to request up to two weeks of time on a single node. For example:

nodescheduler -r 14 1

This requests 14 days of time on one node. The nodescheduler command will return something like 3001.nmpost-master.aoc.nrao.edu or Submitted batch job 3001 to your terminal window. The number in this output (e.g. 3001) is your unique job id. You will need this number to terminate your job if you finish with it early, or as a reference if you need to ask for more time.

When your node is ready, you will receive an email from the system with the subject "Cluster interactive: time begun" which will tell you the name of the node available to you. Please login to this node from our ssh gateway instead of the cluster master. If you do not receive an email within a few minutes it is likely that your request is waiting in the queue because there are no available nodes. You can use the command qstat -1nu <USERNAME> for Torque

qstat -1nu nm-4386

or squeue -l --me for Slurm

squeue -l --me

to check the status of your jobs.

After approximately 75% of your time has passed, the system will send you an email warning. This would be a good time to get an extension if you are needing one. Please do not wait until the last minute. You will receive another email warning approximately one hour from the end of your requested time. When your time expires the system will kill all your processes and remove your /tmp files on that node.

If you finish your work before your time ends, please release the node. The original email you received has the terminate command in it. For example:

nodescheduler -t 3001

It is best to release a node if you finish early and then request another node later, rather than leaving a node reserved idle for weeks at a time.


Checking Status

All the following commands should be run from either the cluster master or a compute node.

Torque

To see all the jobs on the cluster, both running and queued, run the following.  The S column indicates if a job is running (R) or if it is waiting to start (Q).
qstat -1n

To see only your jobs (user nm-4386 is an example)

qstat -1nu nm-4386

To see the full status of a running or queued job, like start time and memory usage   (job id 3001 is an example)

qstat -f 3001

To see what nodes are available with no jobs running on them and what queue they are in

nodesfree

Slurm

 To see all your jobs on the cluster, run the following.

squeue --me
The ST column indicates the status of the job: R means running and PD means pending, waiting to start running.
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    5     batch run_casa    krowe  R       0:04      1 nmpost029

Killing jobs

Torque

To kill a running or queued job (either batch or interactive), use the qdel command and the job id as an argument (job id 3001 is an example)

qdel 3001

 

Slurm

To kill a running or queued job (either batch or interactive), use the qdel command and the job id as an argument (job id 3001 is an example)

qdel 3001

 

nodescheduler

You can also use the nodescheduler command to kill interactive jobs

nodescheduler --terminate 3001

or the following which will kill all your interactive jobs

nodescheduler --terminate me

 

 

Search All NRAO