Welcome to the Info TEST server!

Skip to content. | Skip to navigation

Sections
Info Services > Computing Guide > Cluster Processing > Appendix > Advanced Job Submitting (Torque/Slurm)

Advanced Job Submitting (Torque/Slurm)

Below are some advanced examples of submitting jobs to the various scheduling systems available at NRAO.

As of Oct. 2019, mpicasa is not aware of the Torque scheduling system. This means there is no cooperation between CASA and Torque for things like cgroups or other resource containers for multi-node jobs. It works but on a sort-of honer system. For example, imagine a script requesting 2 nodes, each with 4 cores. Torque will return the list of hostnames and mpicasa will launch processes on those hostnames, but only the mother superior node has any resource limits. The other processes on the other hosts are not bound to the cgroup created by Torque limiting the number of cores via cpuset. So as long as they are well behaved, it all works.

However, mpicasa is aware of the Slurm scheduling system because of the version of Open MPI that mpicasa uses.  This means that all processes on all nodes in the job will be contained in the proper cgroups.  It also means that you no longer need to use the -n or the -machinefile options with mpicasa because mpicasa will get this information from Slurm-created variables.

 

Torque

Serial Job

You can use #PBS directives in the script you submit via qsub.  These directives are the same as command-line options to qsub.  For example, if you wanted to use the -V command-line option to qsub, you could instead include it in your script with the line #PBS -V.  See below for more examples.

The default walltime for batch jobs is 100 days. Your job will be killed if it is still running after 100 days unless you have set a walltime. Also, setting a walltime shorter than 100 days will increase the odds of your job starting when resources are scarce.

Jobs are not restarted if there is a node failure.  Also, any reservations are removed from a node if that node reboots.

#!/bin/sh

# Set PBS Directives
# Lines starting with "#PBS", before any shell commands are
# interpreted as command line arguments to qsub.
# Don't put any commands before the #PBS options or they will not work.
#
#PBS -V                               # Export all environment variables from the qsub command environment to the batch job.
#PBS -l pmem=16gb                     # Amount of memory needed by each process (ppn) in the job.
#PBS -d /lustre/aoc/observers/nm-4386 # Working directory (PBS_O_WORKDIR)
#PBS -m bea                           # Send email on begin, end, and abort of job

# Because these start with "##PBS", they are not read by qsub.
# These are here as examples. ##PBS -l mem="16gb" # physmem used by job. Ignored if NUM_NODES > 1. Won't kill job. ##PBS -l pmem="16gb" # physmem used by any process. Won't kill job. ##PBS -l vmem="16gb" # physmem + virtmem used by job. Kills job if exceeded. ##PBS -l pvmem="16gb" # physmem + virtmen used by any process. Kills job if exceeded. ##PBS -l nodes=1:ppn=1 # default is 1 core on 1 node ##PBS -M nm-4386@nrao.edu # default is submitter ##PBS -W umask=0117 # default is 0077 ##PBS -l walltime=1:0:0:0 # default is 100 days. This set it to 1 day # casa's python requires a DISPLAY for matplot, so create a virtual X server
xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/run_casa.py

 

Parallel Single-node Job

The procedure for submitting parallel batch jobs is very similar to submitting serial jobs.  The differences are setting the ppn qsub option to something other than 1 and how casa is executed.

The qsub option ppn specifies the number of cores per node requested by the job.  If this option is not set, it defaults to 1.  It is used in conjunction with the -l nodes option.  For example, to request one node with 8 cores you would type -l nodes=1:ppn=8.

The scheduler creates a file containing the requested node and core count assigned to the job.  The location of this file is stored in the environment variable PBS_NODEFILE.  This file can tell mpicasa on which nodes to run.

#!/bin/sh

# Set PBS Directives
# Lines starting with "#PBS", before any shell commands are
# interpreted as command line arguments to qsub.
# Don't put any commands before the #PBS options or they will not work.
#
#PBS -V    # Export all environment variables from the qsub command environment to the batch job.
#PBS -l pmem=16gb        # Amount of memory needed by each process (ppn) in the job.
#PBS -d /lustre/aoc/observers/nm-4386 # Working directory (PBS_O_WORKDIR)
#PBS -l nodes=1:ppn=8 # Request one node with 8 cores

CASAPATH=/home/casa/packages/RHEL7/release/current

xvfb-run -d mpicasa -machinefile $PBS_NODEFILE $CASAPATH/bin/casa --nogui -c /lustre/aoc/observers/nm-4386/run_mpicasa.py

For more information regarding how to set memory requests see the Memory Options section of the documentation.

 

Parallel Multi-node Job

For multi-node jobs we recommend using the -L options instead of the -l options.

Since some of our nodes run docker, which creates a 172.17.0.1 interface, you should tell mpicasa to exclude that interface with the --mca command (see below).  Otherwise mpicasa may try to use this docker interface to talk to other nodes which will fail.

 #!/bin/sh

# Set PBS Directives
# Lines starting with "#PBS", before any shell commands are
# interpreted as command line arguments to qsub.
# Don't put any commands before the #PBS options or they will not work.
#
#PBS -V    # Export all environment variables from the qsub command environment to the batch job.
#PBS -d /lustre/aoc/observers/nm-4386 # Working directory (PBS_O_WORKDIR)

#PBS -L tasks=2:lprocs=4:memory=10gb
# tasks is the number of nodes
# lprocs is the number of cores per node
# memory is the amount of memory per node

CASAPATH=/home/casa/packages/RHEL7/release/current

xvfb-run -d mpicasa --mca btl_tcp_if_exclude "172.17.0.0/16" -machinefile $PBS_NODEFILE $CASAPATH/bin/casa --nogui -c /lustre/aoc/observers/nm-4386/run_mpicasa.py

 

Slurm

You can use #SBATCH directives in the script you submit via sbatch. These directives are the same as command-line options to sbatch. For example, if you wanted to use the --mem=2G command-line option to sbatch, you could instead include it in your script with the line #SBATCH --mem=2G. See below for more examples. If possible, please set a time limit with the --time option.  You job will be killed after this amount of runtime but it can also allow your job to start sooner because the scheduler knows how much time it needs.  If you have not set a time limit in your job, it will be killed after 100 days.  Jobs are also killed if the node reboots.

Serial Job

Save the following example to a file called run_casa.sh and edit as needed.

#!/bin/sh

# Set SBATCH Directives
# Lines starting with "#SBATCH", before any shell commands are
# interpreted as command line arguments to sbatch.
# Don't put any commands before the #SBATCH directives or they will not work.
#
#SBATCH --export=ALL                          # Export all environment variables to job.
#SBATCH --mail-type=BEGIN,END,FAIL # Send email on begin, end and fail of job.
#SBATCH --chdir=/lustre/aoc/observers/nm-4386 # Working directory #SBATCH --time=1-2:3:4 # Request 1day, 2hours, 3minutes, and 4seconds. #SBATCH --mem=16G # Memory needed by the whole job. # casa's python requires a DISPLAY for matplot, so create a virtual X server xvfb-run -d casa --nogui -c /lustre/aoc/observers/nm-4386/run_casa.py

Run job

sbatch run_casa.sh

 

Parallel Single-node Job

Because CASA uses one process as an "MPI Client", requesting 8 cores for example will produce 7-way parallelization.  If you actually want 8-way parallelization, you have two options.  1. You can request N + 1 cores e.g. --ntasks-per-node=9, but this is less efficient since this "MPI Client" usually uses very little resources.  2. You can add the --oversubscribe option to mpicasa along with -n 9 which forces this "MPI Client" to run on one of the 8 processing cores which should not affect performance in most cases.

#!/bin/sh

# Set PBS Directives
# Lines starting with "#SBATCH", before any shell commands are
# interpreted as command line arguments to sbatch.
# Don't put any commands before the #SBATCH directives or they won't work.

#SBATCH --export=ALL                          # Export all environment variables to job
#SBATCH --chdir=/lustre/aoc/observers/nm-4386 # Working directory #SBATCH --time=8-0:0:0 # Request 8days
#SBATCH --mem=128G # Memory for the whole job
#SBATCH --nodes=1 # Request 1 node
#SBATCH --ntasks-per-node=8 # Request 8 cores
CASAPATH=/home/casa/packages/RHEL7/release/current # Use a specific version of CASA xvfb-run -d mpicasa ${CASAPATH}/bin/casa --nogui -c run_mpicasa.py

# mpicasa should be able to detect the number of nodes and cores
# defined by Slurm, so a machinefile shouldn't be necessary.
# But if you still want one, here is how to create it and use it.
#srun hostname > /tmp/machinefile.$$
#xvfb-run -d mpicasa -machinefile /tmp/machinefile.$$ ${CASAPATH}/bin/casa --nogui -c run_mpicasa.py
#rm -f /tmp/machinefile.$$

# If you actually want 8-way parallelization instead 7-way
#xvfb-run -d mpicasa --oversubscribe -n 9 casa --nogui -c run_mpicasa.py

Parallel Multi-node Job

The procedure for submitting parallel batch jobs is very similar to submitting serial jobs. The differences are setting the --nodes and --ntasks-per-node options, and how casa is executed. The --nodes and --ntasks-per-node options specifies the number of nodes and the number of cores on each node requested by the job. The default of both of these options is 1.  Since the version of Open MPI that CASA uses is aware of Slurm, a machinefile is not necessary like it is with Torque/Moab. If you still want to use a machinefile, you will need to create it with the srun command. See the comments in the example file below.

Since some of our nodes run docker, which uses IPs on the 172.17.0.0/16 network, you should tell mpicasa to exclude interfaces on that network with the --mca command (see below).  Otherwise mpicasa may try to use this docker-created interface to talk to other nodes, which will fail.

#!/bin/sh

# Set PBS Directives
# Lines starting with "#SBATCH", before any shell commands are
# interpreted as command line arguments to sbatch.
# Don't put any commands before the #SBATCH directives or they will not work.

#SBATCH --export=ALL                          # Export all environment variables to job
#SBATCH --chdir=/lustre/aoc/observers/nm-4386 # Working directory
#SBATCH --time=30-0:0:0 # Request 30days after which the job will be killed
#SBATCH --mem=64G # Memory per node
#SBATCH --nodes=2 # Request exactly 2 nodes
#SBATCH --ntasks-per-node=6 # Request 12 cores total (6 per node)
CASAPATH=/home/casa/packages/RHEL7/release/casa-6.4.0-16 # Use specific version of CASA xvfb-run -d mpicasa --mca btl_tcp_if_exclude "172.17.0.0/16" ${CASAPATH}/bin/casa --nogui -c /lustre/aoc/observers/nm-4386/run_mpicasa.py

# mpicasa should be able to detect the number of nodes and cores
# defined by Slurm, so a machinefile shouldn't be necessary.
# But if you still want one, here is how to create it and use it.
#srun hostname > /tmp/machinefile.$$
#xvfb-run -d mpicasa -machinefile /tmp/machinefile.$$ ${CASAPATH}/bin/casa --nogui -c /lustre/aoc/observers/nm-4386/run_mpicasa.py
#rm -f /tmp/machinefile.$$

# If you actually want 12-way parallelization instead of 11-way use --oversubscribe
#xvfb-run -d mpicasa --oversubscribe -n 13 casa --nogui -c run_mpicasa.py

Run job

sbatch run_casa.sh

 

Search All NRAO