Cluster Scheduler - CV
New Cluster Paradigm as of February 5th, 2015
Please see the updated documentation here.
We will eventually incorporate the FAQ into this page, and merge with the Cluster Processing page.
DO NOT USE THE INSTRUCTIONS BELOW - THEY ARE OUTDATED.
Using the CV Cluster Scheduler
Currently we only make use of an interactive scheduler, where you submit "time requests" instead of a traditional "job requests". Your time requests are queued in the order received and number of nodes. Upon being granted access you and the root user can SSH to the box(es) in question and run what you wish on them.
The scheduler is in its most basic form, and will soon incorporate more error checking and "job requests" as well as "time requests." Eventually "time requests" will go away in favor of "job requests" as the scheduler and CASA matures for such an environment. The scheduler as it exists now is only a preliminary solution to scheduling access to the nodes. A different version currently exists at the AOC, which will be eventually be ported to Charlottesville when needed.
Prerequisites
These are the basic instructions on how a user can request dedicated time on one (or several) of the lustre nodes. A prerequisite to this is that the user is able to ssh into foundation (a.k.a. elwood). If on the NRAO internal network, this should be fine. Externally, a person would have to come through the login host polaris.cv.nrao.edu and then foundation.cv.nrao.edu
nodescheduler
is the command (available only on the machine: foundation, a.k.a. elwood) to request access or understand the state of the nodes in general.
: dklopp @ elwood ; nodescheduler
Usage:
nodescheduler
--list [free|busy|all|mynodes|nodesbyuser|expiration|duration]
--request [dd:]hh:mm:ss [nodecount]
--request days.fractionofdays [nodecount]
--terminate [#|me]
: dklopp @ elwood ;
Reserving Time And Terminating Node Access
Requesting time
ssh into foundation: (Password Required)
ssh foundation
To request time on a node or nodes, use the command:
nodescheduler --request dd:hh:mm:ss nodecount
adding the length of time, and how many nodes you would like, e.g.
nodescheduler --request 01:01:02:03 2
Means 2 nodes have been requested by you for 1day, 1 hour, 2 minutes, and 3 seconds
Alternatively, you can request time in days:
nodescheduler --request 2.0 1
Which would request 2 days with 1 node. It is important to provide days as a floating point number. For example, never write 2 days as "2", it must always be written as "2.0". This is a limitation of the parser.
Once this command is sent, it is put in a queue, to start running when the nodes are available. It immediately returns a Job Id of the format: XXX.elwood.
When the nodes become available, you will receive an email containing something like:
b Id: 139.elwood
Job Name: requestnodes
Exec host: multivac19/0+multivac18/0+multivac17/0+multivac14/0
Begun execution
Three important things to notice:
- your job number (in this case 139)
- Exec host: nodes you are using during your requested time
- Begun execution: your time requested is underway, and the system is ready for you to use
Now the nodes are dedicated to your use for the amount of time requested. It may take up to one minute after the email is received before you can access the node. If you still can't access the node after then, please send an email to the Charlottesville Helpdesk.
Terminating Your Time
If you complete your work on the node(s) before your allotted time, please free up the nodes for others to use through this command:
nodescheduler --terminate [#|me]
where # is your job ID, e.g.
nodescheduler --terminate 139
following the example above. Again, you will receive an email when termination completes. It will look something like this:
PBS Job Id: 139.elwood
Job Name: requestnodes
Exec host: multivac19/0+multivac18/0+multivac17/0+multivac14/0
Execution terminated
Exit_status=271
resources_used.cput=00:00:00
resources_used.mem=3804kb
resources_used.vmem=210452kb
resources_used.walltime=00:03:39
If you have multiple requests underway concurrently, or otherwise in the queue, and you want to release them all, you can use this command:
nodescheduler --terminate me
This will terminate all jobs you've requested that are either running, or in the queue under your login.
Checking Nodes' Status
A quick way to see the status of all nodes, at a high level: (as in the previous section, you must begin by ssh into foundation)
nodescheduler --list all
Typical output:
multivac09 rindebet
multivac10 nkimani
multivac11 jcrossle
multivac12 mrawling
multivac13 ahale
multivac14 jtobin
multivac17 akimball
multivac18 aremijan
multivac19 free
multivac20 free
multivac21 free
multivac22 down
multivac23 free
multivac24 rindebet
To list all the free nodes:
nodescheduler --list free
Typical output:
multivac19 free
multivac20 free
multivac21 free
multivac23 free
Which Nodes are Reserved by Whom
To see the nodes currently reserved by user:
nodescheduler --list busy
Typical output:
multivac09 rindebet
multivac10 nkimani
multivac11 jcrossle
multivac12 mrawling
multivac13 ahale
multivac14 jtobin
multivac17 akimball
multivac18 aremijan
multivac24 rindebet
To see the nodes you have requested that are under your control:
nodescheduler --list mynodes
Typical output:
multivac10
multivac11
Checking Job Expiration and Duration Times
To see how much time before a node is released, broken down by user - again useful in understanding and planning soon to be released resources:
nodescheduler --list expiration
Typical output:
jcrossle will free 1 nodes in 1 day, 20 minutes
nkimani will free 1 nodes in 1 day, 59 minutes
mrawling will free 1 nodes in 6 days, 7 hours
ahale will free 1 nodes in 7 days, 1 hour, 7 minutes
aremijan will free 1 nodes in 8 hours, 58 minutes
jtobin will free 1 nodes in 9 hours, 54 minutes
rindebet will free 1 nodes in 9 days, 7 hours, 56 minutes
rindebet will free 1 nodes in 12 days, 22 hours, 59 minutes
akimball will free 1 nodes in 15 days, 6 hours, 58 minutes
To see how much time people have had nodes:
nodescheduler --list duration
This should provide output similar to:
nkimani has had 1 nodes for 28 days, 23 hours, 1 minute
rindebet has had 1 nodes for 20 days, 16 hours, 4 minutes
rindebet has had 1 nodes for 17 days, 1 hour, 1 minute
mrawling has had 1 nodes for 14 days, 17 hours
aremijan has had 1 nodes for 6 days, 15 hours, 3 minutes
akimball has had 1 nodes for 5 days, 17 hours, 2 minutes
jtobin has had 1 nodes for 2 days, 14 hours, 7 minutes
jcrossle has had 1 nodes for 2 days, 23 hours, 40 minutes
ahale has had 1 nodes for 2 days, 22 hours, 54 minutes