Welcome to the Info TEST server!

Skip to content. | Skip to navigation

Sections
Info Services > Computing Guide > Cluster Processing > Access and Running Jobs

Access and Running Jobs

If you are an NRAO staff member, you can use your regular Linux login to access the cluster. If you are a remote observer, you will need an observer login. See either User Accounts for creating a remote observer account.


Login to the NRAO

Note: to preserve ssh agent, and X11 forwarding, you may need to use either -AX (Linux) or -AY (OSX) options on the ssh command line.

If you are outside the Socorro and Charlottesville facilities, you will need to login to our ssh gateway first.

For NMASC (nmpost)

ssh <account name>@ssh.aoc.nrao.edu

For NAASC (cvpost)

ssh <account name>@ssh.cv.nrao.edu

Login to the Master Node

Once on the NRAO network, login to the master node. It is from this node that you submit either batch or interactive jobs. Please refrain from doing compute intensive work, like CASA, on the master node as it is a very limited resource that everyone must share.

For NMASC (nmpost)

ssh nmpost-master

For NAASC (cvpost)

ssh cvpost-master
For more advanced connections like VNC, see the appendix.

 


Running Jobs

There are two types of jobs: batch and interactive. Batch jobs involve submitting a script (bash, python, etc) to the cluster with memory and CPU requirements and the cluster decides where and when to run that job. Interactive jobs give you exclusive access to an entire node including all its memory and CPUs. Batch is preferred over Interactive as it is a more efficient use of resources and allows for many other users to use the cluster. Please use batch jobs instead of interactive jobs when possible.


Batch Jobs

The NRAO provides two methods to start batch jobs: qsub and submit. qsub is the native program that comes with Torque for submitting jobs. If you have used Torque or another PBS-style scheduler, you may be familiar with qsub. Submit is an in-house program for submitting batch jobs that tries to simplify the options required. We suggest trying qsub as it is more flexible and not much harder than submit.


qsub

The command to submit scripts to be run on the cluster is qsub.  It has many options which you can see in the qsub manual.  These options can be added either to the command-line or into a submit script via the #PBS directive.  Below is a very simple example using qsub.  More detailed examples are in the appendix.

Create a run_casa.sh script like the following (nmpost example)

#!/bin/sh

#Don't put any commands before the #PBS options or they will not work
#PBS -V    # Export all environment variables from the qsub commands environment to the batch job.
#PBS -l pmem=16gb,pvmem=16gb       # Amount of memory needed by each processor (ppn) in the job.
#PBS -d /lustre/aoc/observers/nm-4386 # Working directory (PBS_O_WORKDIR) set to your Lustre area
#PBS -m ae    # Send email when Jobs end or abort

# casa's python requires a DISPLAY for matplot, so create a virtual X server
xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/run_casa.py

Make it executable

chmod u+x run_casa.sh

Run job

qsub run_casa.sh

Submit

nm-* and cv-* type accounts have their home set to the Lustre filesystem. NRAO staff accounts are on a different filesystem and therefore should set WORK_DIR to their Lustre area. There are new options (MEM, PMEM, VMEM, PVMEM) which allow for more granularity in requesting memory. We recommend using PMEM so that the cluster scheduler knows how much memory each process needs and PVMEM so that it can kill processes that go over the requested amount.

Below is a very simple example using submit.  More detailed examples are in the appendix.

Create a cluster.req file like the following (nmpost example)

WORK_DIR="/lustre/aoc/observers/nm-4386" # set to your Lustre area
COMMAND="/lustre/aoc/observers/nm-4386/run_casa.sh"

PMEM="16gb"    # physmem used by any process. Won't kill job.
PVMEM="16gb"    # physmem + virtmen used by any process. Kills job if exceeded.

MAIL_OPTIONS="abe"   # default is "n" therefore no email

Create a run_casa.sh file like the following (nmpost example)

#!/bin/sh

WORK_DIR="/lustre/aoc/observers/nm-4386/scheduler"
cd ${WORK_DIR}

# casa's python requires a DISPLAY for matplot so create a virtual X server
xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/ParallelScript.py

Make it executable

chmod u+x run_casa.sh

Submit job

submit -f cluster.req

Interactive Jobs

If you require exclusive access to a node, run the nodescheduler program. You are allowed to request up to two weeks of time on a single node. For example:

nodescheduler --request 14 1

This requests 14 days of time on one node. The nodescheduler command will return a job id like 3001.nmpost-master.aoc.nrao.edu to your terminal window. This job id is used by the system to uniquely identify the job. You will need the number from this job id (e.g. 3001) to terminate your job, if you finish with it early, or as a reference if you need to ask for more time.

When your node is ready, you will receive an email from the system with the subject "Cluster interactive: time begun" which will tell you the name of the cluster machine available to you. Please login to this node from our ssh gateway instead of the cluster master node. If you do not receive an email within a few minutes it is likely that your request is waiting in the queue because there are no available nodes. You can use the qstat -1nu <USERNAME> command to check the status of your jobs.

qstat -1nu nm-4386

After approximately 75% of your time has passed, the system will send you an email warning. This would be a good time to get an extension if you are needing one. Please do not wait until the last minute. You will receive another email warning approximately one hour from the end of your requested time. When your time expires the system will kill all your processes and remove your /tmp files on that node.

If you finish your work before your time ends, please release the node. The original email you received has the terminate command in it. For example:

nodescheduler --terminate 3001

It's best to release a node if you finish early and then request another node later, rather than locking one up for weeks at a time and leaving it idle. Only one user per node is allowed for interactive time.


Checking Status

All the following commands should be run from either the master node or a compute node.

To see all the jobs on the cluster, both running and queued, run the following.  The S column indicates if a job is running (R) or if it is waiting to start (Q).

qstat -1n

To see only your jobs (user nm-4386 is an example)

qstat -1nu nm-4386

To see the full status of a running or queued job, like start time and memory usage   (job id 3001 is an example)

qstat -f 3001

To see what nodes are available with no jobs running on them and what queue they are in

nodesfree

Killing jobs

To kill a running or queued job (either batch or interactive), use the qdel command and the job id as an argument (job id 3001 is an example)

qdel 3001

You can also use the nodescheduler command to kill interactive jobs

nodescheduler --terminate 3001

or the following which will kill all your interactive jobs

nodescheduler --terminate me

 

 

Info Services Contacts
 
Search All NRAO