This is Lite Plone Theme
You are here: Home User guide Running Jobs

Running Jobs

Information about: how to submit jobs to run on the COSMOS machines and basic usage of the MOAB scheduler, the different batch queues on the systems, sample job scripts, and memory considerations for your jobs.

Job submission

It is strongly recommended to only use the MOAB commands, not the PBS/Torque, as the latter are unaware of the multiple partitions and jobs submitted with 'qsub' on universe will be scheduled to run on universe only. Please read the MOAB documentation on end user commands overview, especially on the commonly used commands: msub, showq, canceljob, checkjob, showstart, showbf. You can read a manual for each of those commands using command man, for example:

$ man msub

Usage in brief:

CommandPurpose
showbf To find out how many cores are available right now and on which partition and for how long.
showq To see the queues.
msub
To submit a job via a job script.
showstart
To get the system's opinion on when the job may start.
checkjob
To check job status, how it is running, or why it is not running.
canceljob
To cancel a job or jobs.

Read the manual - there are useful flags for each command, to filter the information

N.B. Please do not be tempted to run automatic scripts which run the above commands over and over, especially the querying commands like showq - MOAB is not very tolerant of insistent poking.

Job Accounting

Your job will be submitted as part of the project which is your primary group. You can check that with the command

id

which will output something like

uid=N(X) gid=M(Y) groups=M(Y),P(Z)

Here the M and Y following gid= are your primary group number and name, P and Z are the number and name for a secondary group, of which there can be any amount.

In order to submit a job as non-primary group, the following needs to be added to the job script

#PBS -W group_list=GROUP

or on the command line:

msub -W group_list=GROUP USUAL PARAMETERS

where GROUP is the desired group name. The order of the command line options makes no difference, so -W group_list=GROUP can just as well come after your usual options but not the file name.

Moab Accounts

If you are a member of more than one project (DiRAC project or internal project), you will need to specify which "Moab account" your job runs under. You can find your default using the command

mcredctl -q config user:YOURUSERNAME

and look for "ADEF". The list following "ALIST" contains the other accounts you are allowed to access. If you need to use non-default account, you need to pass msub the -A parameter, like

msub -A accountname USUAL PARAMETERS

Job Dependencies

If you need to run jobs in a particular order, but would like to just submit them all in one go, Moab supports job dependencies. There are several dependency types available, but most likely you will want job B to start after and only after a successful completion of job A. This can be accomplished by the following msub parameters, where JOBAID is the job ID of job A:

msub -W x=depend:afterok:JOBAID USUAL PARAMETERS

Resource allocation

The following was written up with UV1 architecture in mind for examples.  But all of the considerations apply equally to the jobs to be run on UV2 architecture, only with the following changes: - there are 8 cores per node, not 6 (so use ppn=8 instead of ppn=6) - the queues names are small2, large2 and super2. Ask for further clarifications if confused! It can be daunting, some times.

On UV1 nodes (UNIVERSE and COSMOS): Please assume for simplicity, that 2.5GB of RAM are available per core and cores are allocated in chunks of 6. There are 372 cores available on each partition for the batch processing at the moment and 288 cores is the maximum size for a job (at the moment) There is a possibility to run larger jobs, of course, but that would require a special consideration.

Default maximum walltime is now 12 hours for all but special cases and the long5d queue.

On UV2 node (COSMOS2): The default maximum job size is 512 cores (nodes=64:ppn=8) and available memory per core is 7.3GB, so for the largest regular job up to 3737GB of RAM can be allocated. Even larger jobs require a privilege to access the super2 queue. Ask for it if you believe you have a justification for such jobs. Look below at three templates for job submission scripts for different types of jobs - MPI, OpenMP and Hybrid. Let the COSMOS support team know if you have problems/questions/suggestions about them.

There is now a new queue called "mic" on cosmic, which is intended for using the Intel Xeon Phi coprocessors.

Queues

Please note that the "-q queuename" parameter is now mandatory.

That can be specified either as a job script option or command line parameter, please see below for examples of both.

On COSMOS / UNIVERSE

NameMin ProcsMax ProcsMax Memory / JobMax Wall Clock Time
debug 1 6 15.65 GB 08:00:00
small 6 48 125.25 GB 12:00:00
large 48
288 751.5 GB 12:00:00
long5d 1 6 15.65 GB 5:00:00:00

On COSMOS2/COSMIC

NameMin ProcsMax ProcsMax Memory / JobMax Wall Clock Time
debug2 1 16
117 GB 08:00:00
small2 8 64
468 GB 12:00:00
large2 64 512 3737GB 12:00:00
super2* 64 1024
7474GB 120:00:00
mic 1 200 1460GB 12:00:00

NB debug and debug2 are for interactive job sessions only.

*super2 requires permission from COSMOS management to be used.

Sample Job Scripts

1. A pure MPI job.

#!/bin/bash 
#PBS -q small
#PBS -N my_mpitest_x6   # Name of your job
#PBS -j oe              # Capture STDOUT and STDERR into one output file
#PBS -V                 # Pass on to job your whole current environment
#
# Resource specifications - please be as accurate as possible
#
#PBS -l nodes=1:ppn=6     # No. processors required
#PBS -l walltime=06:00:00 # wall clock time required

# Get the number of allocated CPUs
NP=$PBS_NP
C_LAUNCH="mpiexec_mpt -np $NP dplace -s1"

cd $PBS_O_WORKDIR

########################################################################
# User commands below. Modify accordingly
########################################################################
$C_LAUNCH ./my_exe params ...

2. A pure OpenMP job.

#!/bin/bash 
#PBS -q small
#PBS -N my_omptest_x6   # Name of job
#PBS -j oe              # Capture STDOUT and STDERR into one output file
#PBS -V                 # Pass on to job your whole current environment
#
# Resource specifications - please be accurate
#
#PBS -l nodes=1:ppn=6     # No. processors required
#PBS -l walltime=04:00:00 # wall clock time required

# Get the number of allocated CPUs, or force a smaller number
# if you want to run fewer threads to have more RAM per thread, for example
OMP_NUM_THREADS=$PBS_NP

# additional settings, which you may want to change later
KMP_STACKSIZE=1gb
OMP_NESTED=FALSE

export KMP_LIBRARY KMP_STACKSIZE OMP_NESTED OMP_NUM_THREADS

# Set unlimited soft stacksize limit - important for OpenMP applications
ulimit -Ss unlimited

# It is very important to ensure thread/cpu affinity, using placement
C_LAUNCH="dplace -x2"

cd $PBS_O_WORKDIR

########################################################################
# User commands below
########################################################################
$C_LAUNCH ./my_exe params ...

3. A hybrid MPI/OpenMP job. This is the most complex one - make sure you set
the correct numbers for the total N of cpus (N), MPI processes (P) and
OMP threads (M), so that was true: (N = P x M).

#!/bin/bash 
#PBS -q small
#PBS -N my_hybrid_x2x6  # Name of a job
#PBS -j oe              # Capture STDOUT and STDERR into one output file
#PBS -V                 # Pass on to job your whole current environment

#
# Resource specifications - please be accurate
#
#PBS -l nodes=2:ppn=6     # No. processors required
#PBS -l walltime=04:00:00 # wall clock time required

# In hybrid cases it is always necessary to explicitly set the numbers
NP=2                    # number of MPI processes (P)
OMP_NUM_THREADS=6     # number of OMP threads (M) per MPI process

# some defaults
KMP_LIBRARY=throughput
KMP_STACKSIZE=1gb
OMP_NESTED=FALSE

export KMP_STACKSIZE KMP_LIBRARY OMP_NESTED OMP_NUM_THREADS

# Set unlimited soft stacksize limit
ulimit -Ss unlimited

# the following may increase or decrease the performance -
# try with or without
#export MPI_OPENMP_INTEROP=1

# This is default, but may also require explicit cpu selections
# ask for help if you are not getting the expected performance
C_LAUNCH="mpiexec_mpt -np $NP omplace -nt $OMP_NUM_THREADS"
cd $PBS_O_WORKDIR

########################################################################
# User commands below
########################################################################
$C_LAUNCH ./my_hybrid_exe params ...

Interactive Sessions for Application development/debugging

For the application development and debugging you may need a few more processors and larger memory. Say you reckon that 4 cores and 28GB should be enough, then start a development session like this:

$ msub -q debug2 -I -V -l nodes=1:ppn=4,mem=28gb,walltime=02:00:00 -N dev_session

But before that, it is always worth checking if there are nodes available -

$ showbf -p cosmos2
resources not available

At the moment, unfortunately, all CPUs are claimed, so you can try to figure out when sufficient resources will become available:

$ showstart -c debug2 4@7200
job 4@7200 requires 4 procs for 2:00:00

Estimated Rsv based start in                00:00:00 on Tue Feb  5 19:50:00
Estimated Rsv based completion in            2:00:00 on Tue Feb  5 21:50:00

Best Partition: cosmos2

- ok then, resources have just become available so you can run a session as above.
N.B. Make sure you exit the session, when finished the debugging, so not to leave the resources to stay idle and unavailable to other users.
Interactive sessions can also be run on UV1 systems - COSMOS and UNIVERSE, when it is required, say for an interactive data analysis with Matlab, or IDL, or Paraview or Visit or similar:

To run on a UV1 architecture:

$ msub -q debug -I -V -l ncpus=6,mem=15gb,walltime=01:00:00 -N debug_session
qsub: waiting for job XXXXXX.universe.damtp.cam.ac.uk to start

... (may have to wait till the resources are available, but eventually you get: )

qsub: job XXXXXX.universe.damtp.cam.ac.uk ready
-
- Interactive job (XXXXXX) started on 6 cpu cores
-

time left: 00:59:59
/nfs/home/users/ak591$

- which means that have requested 6 cores for 1 hour

(When finished - enter command: exit)

N.B. With the command like above, the scheduler will place the interactive session on either cosmos or universe - whichever is less loaded at the time. But sometimes it is required for the interactive session to run on the same node as you are logged onto. In that case you need to tell the scheduler that:

$ msub -q debug -I -V -l partition=cosmos -l ncpus=6,mem=15gb,walltime=01:00:00 -N debug_session

 

Other Useful Torque Options

You can have the scheduler e-mail you about the status of the job you've submitted.

#PBS -m abe    # OR...
#PBS -m ae

Specify a string which consists of either the single character "n" (no mail), or one or more of the characters "a" (send mail when job is aborted), "b" (send mail when job begins), and "e" (send mail when job terminates). The default is "a" if not specified. These e-mails will be forwarded to the e-mail address you signed up to COSMOS with. You can also specify the address the e-mails are sent to via the PBS command:

#PBS -M joe.bloggs@email.com

Efficient Memory Usage

Important - resource requests should maintain the balance between the cores and memory - meaning that the 'nodes*ppn' should give you enough memory to actually run your program. That is mempernode*nodes*ppn should be larger than what your program needs; for UV1 mempernode=2.5 GB and for UV2 mempernode=7.3GB.

You can request for more nodes*ppn than the minimum and quite often that will lead to more efficient processor usage, i.e. if using nodes=32:ppn=8 executes your code in 10 hours, using nodes=64:ppn=8 might run in less than 5 hours. However, the downside is that you should expect longer queueing times for bigger jobs. Also note that if your code is sped up by a factor of about 2 when doubling the number of cores it uses (its "strong scaling" is poor), then you might end up running 3/4 of the time, but "paying" for 2 times the cores, i.e. paying 6/4 the price of the half-size job. Therefore, judicious use of resources is always prudent.

The "mem" argument in job scripts is not usually necessary: it is only useful if you request "partial nodes", i.e. ppn<6 on cosmos/universe or ppn<8 on cosmos2. If you request complete nodes, you will get all the memory available on your nodes and asking for less will only cause some of that memory to be useless to everyone, including yourself. On partial nodes, it is prudent to limit your memory use so other users can access the remaining memory, but on full nodes there are no other users so it is not necessary to be friendly towards other users.

If you requested, say 24 cores, but do not actually want to use them all (you still "pay" for the idle ones!), you should simply tell how many cores to use via either mpiexec_mpt flags or/and OMP_NUM_THREADS variable.

Say you want to run on 2 MPI threads - just put it in:

mpiexec_mpt -np 2 dplace -s1 ./your_app ...

Or with 12 OpenMP threads:

export OMP_NUM_THREADS=12 
dplace -x2 ./omp_app ...

Or a hybrid code - 6 MPI x 2 OpenMP:

export OMP_NUM_THREADS=2 
mpiexec_mpt -np 6 omplace -nt 2 ./hyb_app ...

Parallel I/O

New storage directly attached to the COSMOS2 compute system, although not a 'proper' parallel one like Lustre or GPFS on a cluster, is still capable of parallel i/o operations. A few rules need to be followed to achieve the maximum throughput -

  1. The number of concurrent I/O operations should be kept low - 2-8 is the best choice, although up to 16 could be ok - depends on the file sizes. It means that large MPI applications which sometime allow each MPI process to perform I/O will suffer if left with the default settings. Gadget2/3 - is an example of the sensible approach, which allows the user to choose the number of parallel i/o streams.
  2. Only parallel access to different files is efficient - access to one file cannot be accomplished in parallel, unless the file is relatively small to fit into the system cache.
  3. As is a common situation on many HPC systems, Cosmos stores are optimised for the operations on large files - it is always better to keep the data in a few large ones, rather in a many small ones. Optimal file size is in the range of 100KB - 100GB.