SLURM Queue Manager

SLURM Job Manager

Chimera uses SLURM to manage its queues. Please see the SLURM Quick Start User Guide for basic usage instruction. The man pages for the individual SLURM commands also contain more detailed information. This page summarizes some of the most commonly-used commands and also describes Chimera-specific considerations.

Partitions and Jobs

Every job handled by SLURM is inserted into a partition. Currently, there are at least three partitions on Chimera: all, 64g, and 128g. 64g and 128g are for nodes with 64GB and 128GB of memory, respectively. To see what partitions exist and get basic information about their status, use sinfo:

-bash-3.2$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all          up   infinite     64  alloc node[00-63]
64g*         up   infinite     56  alloc node[08-63]
128g         up   infinite      8  alloc node[00-07]
test         up   infinite      2   idle node[64-65]
-bash-3.2$

To see a list of all jobs currently queued or running, use squeue:

-bash-3.2$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
     26       all   sbatch   siegel   R    1:06:37     64 node[00-63]
-bash-3.2$

SLURM epilog

After a job exits a node, an epilog script is run which will kill all processes for users who are not authorized to be running on that node. This has two useful effects: to clean up jobs which may declare themselves done without actually killing all sub-processes, and to terminate programs started on the node though means other than the SLURM manager.

Partition priorities and preemption

In order to accommodate large jobs as well as long-running jobs, Chimera has the following partitions. Each partition has associated priorities at different times. Higher-priority jobs will preempt lower priority jobs causing them to be suspended until the higher-priority job completes. Note that priorities depend on the time of day!

Note: currently, only the first three partitions are implemented!

Partition	Description	Normal Priority	Evening Priority	Weekend Priority	Max nodes	Max Run Time
all	All nodes for general use	10	10	10	64	1 week
64g	Nodes with 64GB of memory for general use	10	10	10	56	1 week
128g	Nodes with 128GB of memory for general use	10	10	10	8	1 week
full	For jobs needing the whole machine	5	5	20	64	64 hours*
half	For jobs needing half the nodes	5	20	18	32	16 hours*
long	For long-running jobs	5	5	5	8	12 weeks
express	For short turn-around jobs	15	15	15	2	15 minutes

(*) Jobs running in the "full" or "half" queues will be killed at 7AM on the next business day

MPI

As indicated in [Quick Start], there are several versions of MPI installed, each of which has been built with different compilers. SLURM can be used to submit MPI jobs directly, but will require some additional configuration. Please see the MPI Use Guide on the main SLURM site for full instructions.

Note: slurm.conf has MpiDefault set to "none". This is the correct (though counter-intuitive) setting for MVAPICH2 and OpenMPI.

MVAPICH2

Choose your build and execution environment using modules.

MVAPICH2 jobs may be started directly using the srun command, but the program must be linked with the SLURM PMI library:

mpicc -L/usr/local/lib -lpmi ...
srun -n6 a.out

The [modules] environment will include a MPI*_PROFILE variables to use the SLURM profile, which will cause the SLURM PMI to be linked.

OpenMPI

Choose your build and execution environment with modules.

OpenMPI (1.4.2) uses SLURM to allocate resources. The job is then run using mpirun.

$ salloc -n4 sh    # allocates 4 processors and spawns shell for job
> mpirun a.out
> exit             # exits shell spawned by initial salloc command

You can also do this in one step if there are no other steps that need be taken within the allocation: $ salloc -n4 mpirun a.out