SLURM Queue Manager
SLURM Job Manager
Chimera uses SLURM to manage its queues. Please see the SLURM Quick Start User Guide for basic usage instruction. The man pages for the individual SLURM commands also contain more detailed information. This page summarizes some of the most commonly-used commands and also describes Chimera-specific considerations.
Partitions and Jobs
Every job handled by SLURM is inserted into a partition. Currently, there are at least three partitions on Chimera: all, 64g, and 128g. 64g and 128g are for nodes with 64GB and 128GB of memory, respectively. To see what partitions exist and get basic information about their status, use sinfo:
-bash-3.2$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all up infinite 64 alloc node[00-63] 64g* up infinite 56 alloc node[08-63] 128g up infinite 8 alloc node[00-07] test up infinite 2 idle node[64-65] -bash-3.2$
To see a list of all jobs currently queued or running, use squeue:
-bash-3.2$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 26 all sbatch siegel R 1:06:37 64 node[00-63] -bash-3.2$
SLURM epilog
After a job exits a node, an epilog script is run which will kill all processes for users who are not authorized to be running on that node. This has two useful effects: to clean up jobs which may declare themselves done without actually killing all sub-processes, and to terminate programs started on the node though means other than the SLURM manager.
Partition priorities and preemption
In order to accommodate large jobs as well as long-running jobs, Chimera has the following partitions. Each partition has associated priorities at different times. Higher-priority jobs will preempt lower priority jobs causing them to be suspended until the higher-priority job completes. Note that priorities depend on the time of day!
Note: currently, only the first three partitions are implemented!
Partition | Description | Normal Priority | Evening Priority | Weekend Priority | Max nodes | Max Run Time |
---|---|---|---|---|---|---|
all | All nodes for general use | 10 | 10 | 10 | 64 | 1 week |
64g | Nodes with 64GB of memory for general use | 10 | 10 | 10 | 56 | 1 week |
128g | Nodes with 128GB of memory for general use | 10 | 10 | 10 | 8 | 1 week |
full | For jobs needing the whole machine | 5 | 5 | 20 | 64 | 64 hours* |
half | For jobs needing half the nodes | 5 | 20 | 18 | 32 | 16 hours* |
long | For long-running jobs | 5 | 5 | 5 | 8 | 12 weeks |
express | For short turn-around jobs | 15 | 15 | 15 | 2 | 15 minutes |
(*) Jobs running in the "full" or "half" queues will be killed at 7AM on the next business day
MPI
As indicated in [Quick Start], there are several versions of MPI installed, each of which has been built with different compilers. SLURM can be used to submit MPI jobs directly, but will require some additional configuration. Please see the MPI Use Guide on the main SLURM site for full instructions.
Note: slurm.conf has MpiDefault set to "none". This is the correct (though counter-intuitive) setting for MVAPICH2 and OpenMPI.
MVAPICH2
Choose your build and execution environment using modules.
MVAPICH2 jobs may be started directly using the srun command, but the program must be linked with the SLURM PMI library:
mpicc -L/usr/local/lib -lpmi ... srun -n6 a.out
The [modules] environment will include a MPI*_PROFILE variables to use the SLURM profile, which will cause the SLURM PMI to be linked.
OpenMPI
Choose your build and execution environment with modules.
OpenMPI (1.4.2) uses SLURM to allocate resources. The job is then run using mpirun.
$ salloc -n4 sh # allocates 4 processors and spawns shell for job > mpirun a.out > exit # exits shell spawned by initial salloc command
You can also do this in one step if there are no other steps that need be taken within the allocation:
$ salloc -n4 mpirun a.out