Translating between Torque, Slurm, and HTCondor
This page attempts to describe the differences in various cluster scheduling systems. Some basic understanding of each system is expected. While Torque and Slurm are very similar in their usage, HTCondor is somewhat different. With Torque and Slurm you can use command-line arguments to specify the requirements of a job. With HTCondor you need to create a Submit Description File that specifies the requirements and defines the script to execute.
These translations are not meant to be exact. You should have basic understanding of the cluster systems involved. There are some examples at the end of this document.
Submit Options
Description | Torque/Moab | Slurm | HTCondor |
---|---|---|---|
Script directive | #PBS | #SBATCH | NA |
Queue/Partition | -q <queue> | -p <partition> |
requirements = (<partition> == True) +partition = "<partition>" |
Node count | -l nodes=<count> | -N <min>[-max]> | NA |
Core count | -l ppn=<count> | -n <count> OR -c <count> |
request_cpus = <count> |
Wall clock limit | -l walltime=<hh:mm:ss> | -t <min> OR -t <days-hh:mm:ss> | periodic_remove = (time() - JobStartDate) > (<seconds>) |
Stdout | -o <filename> | -o <filename> | output = <filename> |
Stderr | -e <filename> | -e <filename> | error = <filename> |
Copy environment | -V | --export=ALL | getenv = true |
Email notification | -m [a|b|e] |
--mail-type=[ALL, END, FAIL, BEGIN, NONE] |
notification = [Always, Complete, Error, Never] |
Email address | -M <user_list> | --mail-user=<user_list> | notify_user = <user_list> |
Job name | -N <name> | -J <name> OR --job-name=<name> | batch_name = <name> |
Working directory | -d <path> OR -w <path> |
-D <path> | initialdir |
Memory per node | -l mem=<count[kb, mb, gb, tb]> | --mem=<count[K, M, G, T]> | request_memory = <count> G |
Memory per core | -l pmem=<count[kb, mb, gb, tb]> | --mem-per-cpu=<count[K, M, G, T]> | NA |
Virtual memory per node | -l vem=<count[kb, mb, gb, tb]> | NA | NA |
Virtual memory per core | -l pvmem=<count[kb, mb, gb, tb]> | NA | NA |
Memory per job | -L tasks=1:memory=<count[kb, mb, gb, tb]> | --mem=<count[K, M, G, T]> | request_memory = <count> G |
Job arrays | -t <arrayspec> | --array=<arrayspec> | queue seq <first> [<increment>] <last> | |
Variable list | -v <var>=<val>[,<var>=<val>] | --export=<var>=<val>[,<var>=<val>] | environment = "<var>=<val> [<var>=<val>]" |
Script args | -F <arg1>[,<arg2>,...] | sbatch script <arg1>[,<arg2>,...] |
Commands
Description | Torque/Moab | Slurm | HTCondor |
---|---|---|---|
Job alter | qalter | scontrol update | condor_qedit |
Job connect to | NA | srun --jobid <jobid> --pty bash -l | condor_ssh_to_job <jobid> |
Job delete | qdel <jobid> | scancel <jobid> | condor_rm <jobid> |
Job delete all user's jobs | qdel all | scancel --user=<user> | condor_rm <user> |
Job info detailed | qstat -f <jobid> | scontrol show job <jobid> | condor_q -long <jobid> |
Job info detailed | qstat -f <jobid> | scontrol show job <jobid> | condor_q -analyze -verbose <jobid> |
Job info detailed | qstat -f <jobid> | scontrol show job <jobid> | condor_q -better-analyze -verbose <jobid> |
Job info detailed | qstat -f <jobid> | scontrol show job <jobid> | condor_q -better-analyze -reverse -verbose <jobid> |
Job show all | qstat -1n | squeue | condor_q -global -all |
Job show all verbose | qstat -1n | squeue -all | condor_q -global -all -nobatch |
Job show all verbose | qstat -1n | squeue -all | condor_q -global -all -nobatch -run |
Job show DAGs | NA | NA | condor_q -dag -nobatch |
Job submit | qsub | sbatch | condor_submit |
Job submit simple | echo "sleep 27" | qsub | srun sleep 27 | condor_run "sleep 27" & |
Job submit interactive | qsub -I | srun --pty bash | condor_submit -i |
Node show free nodes | nodesfree | sinfo --states=idle --partition=<partition> -N | condor_status -const 'PartitionableSlot && Cpus == TotalCpus' |
Node show resources | qstat -q | sjstat -c | |
Node show state | pbsnodes -l all | sinfo -Nl | condor_status -state |
Variables
Description | Torque/Moab | Slurm | HTCondor |
---|---|---|---|
Job Name | PBS_JOBNAME | SLURM_JOB_NAME | |
Job ID | PBS_JOBID | SLURM_JOBID | |
Tasks per node | PBS_NUM_PPN | SLURM_NTASKS_PER_NODE | |
Cores per step on this node | PBS_NUM_PPN | SLURM_CPUS_ON_NODE | |
Queue/Partition submitted to | PBS_O_QUEUE | SLURM_JOB_PARTITION | |
Queue/Partition running on | PBS_QUEUE | SLURM_JOB_PARTITION | |
User | PBS_O_LOGNAME | SLURM_JOB_USER | |
Number of nodes in job | PBS_NUM_NODES | SLURM_NNODES | |
Number of nodes in job | PBS_NUM_NODES | SLURM_JOB_NUM_NODES | |
Submit Host | PBS_O_HOST | SLURM_SUBMIT_HOST | |
Working dir | PBS_O_WORKDIR | PWD | |
Machine file | PBS_NODEFILE | NA |
Example Commands
Torque
qsub -V -N casatest01 -l pmem=16gb,pvmem=16gb -d /lustre/aoc/observers/nm-4386 -l walltime=2:30:00 -m ae run_casa.sh
Slurm
sbatch --export ALL -J casatest01 --mem=16G -D /lustre/aoc/observers/nm-4386 -t 0-2:30:00 --mail-type=END,FAIL run_casa.sh
HTCondor
Create a Submit Description File (E.g. run_casa.htc)
executable = run_casa.sh
getenv = true
batch_name = casatest01
request_memory = 16 G
notification = Always
environment = "CASA_HOME=/home/casa/packages/RHEL7/release/current PPR_FILENAME=PPR.xml"
initialdir = /lustre/aoc/observers/nm-4683
log = condor.$(ClusterId).log
stdout = condor.$(ClusterId).log
stderr = condor.$(ClusterId).log
queue
Then submit that file
condor_submit run_casa.htc
While you can set a wall clock limit for an HTCondor job, it isn't advised in most cases.