Cluster Scheduler - CV

New Cluster Paradigm as of February 5th, 2015

Please see the updated documentation here.

We will eventually incorporate the FAQ into this page, and merge with the Cluster Processing page.

DO NOT USE THE INSTRUCTIONS BELOW - THEY ARE OUTDATED.

Using the CV Cluster Scheduler

Currently we only make use of an interactive scheduler, where you submit "time requests" instead of a traditional "job requests". Your time requests are queued in the order received and number of nodes. Upon being granted access you and the root user can SSH to the box(es) in question and run what you wish on them.

The scheduler is in its most basic form, and will soon incorporate more error checking and "job requests" as well as "time requests." Eventually "time requests" will go away in favor of "job requests" as the scheduler and CASA matures for such an environment. The scheduler as it exists now is only a preliminary solution to scheduling access to the nodes. A different version currently exists at the AOC, which will be eventually be ported to Charlottesville when needed.

Prerequisites

These are the basic instructions on how a user can request dedicated time on one (or several) of the lustre nodes. A prerequisite to this is that the user is able to ssh into foundation (a.k.a. elwood). If on the NRAO internal network, this should be fine. Externally, a person would have to come through the login host polaris.cv.nrao.edu and then foundation.cv.nrao.edu

nodescheduler is the command (available only on the machine: foundation, a.k.a. elwood) to request access or understand the state of the nodes in general.

Reserving Time And Terminating Node Access

Requesting time

ssh into foundation: (Password Required)

ssh foundation

To request time on a node or nodes, use the command:

nodescheduler --request dd:hh:mm:ss nodecount

adding the length of time, and how many nodes you would like, e.g.

nodescheduler --request 01:01:02:03 2

Means 2 nodes have been requested by you for 1day, 1 hour, 2 minutes, and 3 seconds

Alternatively, you can request time in days:

nodescheduler --request 2.0 1

Which would request 2 days with 1 node. It is important to provide days as a floating point number. For example, never write 2 days as "2", it must always be written as "2.0". This is a limitation of the parser.

Once this command is sent, it is put in a queue, to start running when the nodes are available. It immediately returns a Job Id of the format: XXX.elwood.

When the nodes become available, you will receive an email containing something like:

b Id: 139.elwood Job Name: requestnodes Exec host: multivac19/0+multivac18/0+multivac17/0+multivac14/0 Begun execution

Three important things to notice:

your job number (in this case 139)
Exec host: nodes you are using during your requested time
Begun execution: your time requested is underway, and the system is ready for you to use

Now the nodes are dedicated to your use for the amount of time requested. It may take up to one minute after the email is received before you can access the node. If you still can't access the node after then, please send an email to the Charlottesville Helpdesk.

Terminating Your Time

If you complete your work on the node(s) before your allotted time, please free up the nodes for others to use through this command:

nodescheduler --terminate [#|me]

where # is your job ID, e.g.

nodescheduler --terminate 139

following the example above. Again, you will receive an email when termination completes. It will look something like this:

PBS Job Id: 139.elwood Job Name: requestnodes Exec host: multivac19/0+multivac18/0+multivac17/0+multivac14/0 Execution terminated Exit_status=271 resources_used.cput=00:00:00 resources_used.mem=3804kb resources_used.vmem=210452kb resources_used.walltime=00:03:39

If you have multiple requests underway concurrently, or otherwise in the queue, and you want to release them all, you can use this command:

nodescheduler --terminate me

This will terminate all jobs you've requested that are either running, or in the queue under your login.

Checking Nodes' Status

A quick way to see the status of all nodes, at a high level: (as in the previous section, you must begin by ssh into foundation)

nodescheduler --list all

Typical output:

multivac09 rindebet multivac10 nkimani multivac11 jcrossle multivac12 mrawling multivac13 ahale multivac14 jtobin multivac17 akimball multivac18 aremijan multivac19 free multivac20 free multivac21 free multivac22 down multivac23 free multivac24 rindebet

To list all the free nodes:

nodescheduler --list free

Typical output:

multivac19 free multivac20 free multivac21 free multivac23 free

Which Nodes are Reserved by Whom

To see the nodes currently reserved by user:

nodescheduler --list busy

Typical output:

multivac09 rindebet multivac10 nkimani multivac11 jcrossle multivac12 mrawling multivac13 ahale multivac14 jtobin multivac17 akimball multivac18 aremijan multivac24 rindebet

To see the nodes you have requested that are under your control:

nodescheduler --list mynodes

Typical output:

multivac10 multivac11

Checking Job Expiration and Duration Times

To see how much time before a node is released, broken down by user - again useful in understanding and planning soon to be released resources:

nodescheduler --list expiration

Typical output:

jcrossle will free 1 nodes in 1 day, 20 minutes nkimani will free 1 nodes in 1 day, 59 minutes mrawling will free 1 nodes in 6 days, 7 hours ahale will free 1 nodes in 7 days, 1 hour, 7 minutes aremijan will free 1 nodes in 8 hours, 58 minutes jtobin will free 1 nodes in 9 hours, 54 minutes rindebet will free 1 nodes in 9 days, 7 hours, 56 minutes rindebet will free 1 nodes in 12 days, 22 hours, 59 minutes akimball will free 1 nodes in 15 days, 6 hours, 58 minutes

To see how much time people have had nodes:

nodescheduler --list duration

This should provide output similar to:

nkimani has had 1 nodes for 28 days, 23 hours, 1 minute rindebet has had 1 nodes for 20 days, 16 hours, 4 minutes rindebet has had 1 nodes for 17 days, 1 hour, 1 minute mrawling has had 1 nodes for 14 days, 17 hours aremijan has had 1 nodes for 6 days, 15 hours, 3 minutes akimball has had 1 nodes for 5 days, 17 hours, 2 minutes jtobin has had 1 nodes for 2 days, 14 hours, 7 minutes jcrossle has had 1 nodes for 2 days, 23 hours, 40 minutes ahale has had 1 nodes for 2 days, 22 hours, 54 minutes

Sections

Cluster Scheduler - CV

New Cluster Paradigm as of February 5th, 2015

Using the CV Cluster Scheduler

Prerequisites

Reserving Time And Terminating Node Access

Requesting time

Terminating Your Time

Checking Nodes' Status

Which Nodes are Reserved by Whom

Checking Job Expiration and Duration Times