Access and Running Jobs
If you are an NRAO staff member, you can use your regular Linux login to access the cluster. If you are a remote observer, you will need an observer login. See either User Accounts for creating a remote observer account.
-
Login to the NRAO
Note: to preserve ssh agent, and X11 forwarding, you may need to use either -AX (Linux) or -AY (OSX) options on the ssh command line.
If you are outside the Socorro and Charlottesville facilities, you will need to login to our ssh gateway first.
For NMASC (nmpost)
-
ssh <account name>@ssh.aoc.nrao.edu
For NAASC (cvpost)
-
ssh <account name>@ssh.cv.nrao.edu
Login to the Master Node
Once on the NRAO network, login to the master node. It is from this node that you submit either batch or interactive jobs. Please refrain from doing compute intensive work, like CASA, on the master node as it is a very limited resource that everyone must share.
For NMASC (nmpost)
-
ssh nmpost-master
For NAASC (cvpost)
-
ssh cvpost-master
Running Jobs
There are two types of jobs: batch and interactive. Batch jobs involve submitting a script (bash, python, etc) to the cluster with memory and CPU requirements and the cluster decides where and when to run that job. Interactive jobs give you exclusive access to an entire node including all its memory and CPUs. Batch is preferred over Interactive as it is a more efficient use of resources and allows for many other users to use the cluster. Please use batch jobs instead of interactive jobs when possible.
Batch Jobs
The NRAO provides two methods to start batch jobs: qsub and submit. qsub is the native program that comes with Torque for submitting jobs. If you have used Torque or another PBS-style scheduler, you may be familiar with qsub. Submit is an in-house program for submitting batch jobs that tries to simplify the options required. We suggest trying qsub as it is more flexible and not much harder than submit.
qsub
The command to submit scripts to be run on the cluster is qsub. It has many options which you can see in the qsub manual. These options can be added either to the command-line or into a submit script via the #PBS directive. Below is a very simple example using qsub. More detailed examples are in the appendix.
Create a run_casa.sh script like the following (nmpost example)
#!/bin/sh #Don't put any commands before the #PBS options or they will not work #PBS -V # Export all environment variables from the qsub commands environment to the batch job. #PBS -l pmem=16gb,pvmem=16gb # Amount of memory needed by each processor (ppn) in the job. #PBS -d /lustre/aoc/observers/nm-4386 # Working directory (PBS_O_WORKDIR) set to your Lustre area #PBS -m ae # Send email when Jobs end or abort # casa's python requires a DISPLAY for matplot, so create a virtual X server xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/run_casa.py
-
Make it executable
chmod u+x run_casa.sh
Run job
-
qsub run_casa.sh
Submit
nm-* and cv-* type accounts have their home set to the Lustre filesystem. NRAO staff accounts are on a different filesystem and therefore should set WORK_DIR to their Lustre area. There are new options (MEM, PMEM, VMEM, PVMEM) which allow for more granularity in requesting memory. We recommend using PMEM so that the cluster scheduler knows how much memory each process needs and PVMEM so that it can kill processes that go over the requested amount.
Below is a very simple example using submit. More detailed examples are in the appendix.
Create a cluster.req file like the following (nmpost example)
WORK_DIR="/lustre/aoc/observers/nm-4386" # set to your Lustre area COMMAND="/lustre/aoc/observers/nm-4386/run_casa.sh" PMEM="16gb" # physmem used by any process. Won't kill job. PVMEM="16gb" # physmem + virtmen used by any process. Kills job if exceeded. MAIL_OPTIONS="abe" # default is "n" therefore no email
Create a run_casa.sh file like the following (nmpost example)
#!/bin/sh WORK_DIR="/lustre/aoc/observers/nm-4386/scheduler" cd ${WORK_DIR} # casa's python requires a DISPLAY for matplot so create a virtual X server xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/ParallelScript.py
Make it executable
chmod u+x run_casa.sh
Submit job
-
submit -f cluster.req
-
Interactive Jobs
If you require exclusive access to a node, run the nodescheduler program. You are allowed to request up to two weeks of time on a single node. For example:
nodescheduler --request 14 1
This requests 14 days of time on one node. The nodescheduler command will return a job id like 3001.nmpost-master.aoc.nrao.edu to your terminal window. This job id is used by the system to uniquely identify the job. You will need the number from this job id (e.g. 3001) to terminate your job, if you finish with it early, or as a reference if you need to ask for more time.
When your node is ready, you will receive an email from the system with the subject "Cluster interactive: time begun" which will tell you the name of the cluster machine available to you. Please login to this node from our ssh gateway instead of the cluster master node. If you do not receive an email within a few minutes it is likely that your request is waiting in the queue because there are no available nodes. You can use the qstat -1nu <USERNAME> command to check the status of your jobs.
qstat -1nu nm-4386
After approximately 75% of your time has passed, the system will send you an email warning. This would be a good time to get an extension if you are needing one. Please do not wait until the last minute. You will receive another email warning approximately one hour from the end of your requested time. When your time expires the system will kill all your processes and remove your /tmp files on that node.
If you finish your work before your time ends, please release the node. The original email you received has the terminate command in it. For example:
nodescheduler --terminate 3001
It's best to release a node if you finish early and then request another node later, rather than locking one up for weeks at a time and leaving it idle. Only one user per node is allowed for interactive time.
Checking Status
All the following commands should be run from either the master node or a compute node.
To see all the jobs on the cluster, both running and queued, run the following. The S column indicates if a job is running (R) or if it is waiting to start (Q).
qstat -1n
-
-
To see only your jobs (user nm-4386 is an example)
qstat -1nu nm-4386
To see the full status of a running or queued job, like start time and memory usage (job id 3001 is an example)
qstat -f 3001
To see what nodes are available with no jobs running on them and what queue they are in
nodesfree
Killing jobs
To kill a running or queued job (either batch or interactive), use the qdel command and the job id as an argument (job id 3001 is an example)
qdel 3001
You can also use the nodescheduler command to kill interactive jobs
nodescheduler --terminate 3001
or the following which will kill all your interactive jobs
nodescheduler --terminate me