Stoomboot¶
The Stoomboot batch cluster allows you to submit jobs to run on its nodes. All Stoomboot nodes run Centos7.
The stbc-users
mailing list has been setup for communication about
stoomboot: announcements, issues, etc. All users are encouraged to
subscribe here: https://mailman.nikhef.nl/mailman/listinfo/stbc-users
Interactive Nodes¶
Six interactive nodes are available for interactive and testing use:
stbc-i1 and stbc-i2 run Centos7
stbc-g1 runs Centos7 and contains an NVIDIA 1080Ti GPU
stbc-g2 and wn-lot-008 run Centos7 and contain NVIDIA V100 GPUs
wn-lot-001 runs Centos7 and contains AMD MI50 GPUs
Please keep CPU/GPU consumption and testing time to a minimum, and run your real jobs on the batch nodes.
Queues¶
Jobs have to be submitted to one of the available queues, listed below:
Queue | Default Length | Max Length |
---|---|---|
express | 10m | 10m |
generic | 24h | 24h |
gpu-nv | 24h | 96h |
gpu-amd | 24h | 96h |
long | 48h | 96h |
multicore | 96h | 96h |
short | 4h | 4h |
The default queue is “generic”.
The status of the queues can be queried using:
$> qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T Cpt
---------------- --- ---- -- -- --- --- --- --- --- --- - ---
gpu7 0 0 yes yes 0 0 0 0 0 0 E 0
gpu 0 0 yes yes 0 0 0 0 0 0 R 0
express 0 0 yes yes 0 0 0 0 0 0 R 0
short 0 0 yes yes 0 0 0 0 0 0 R 0
long7 168 47 yes yes 0 47 0 0 0 0 E 0
generic 0 0 yes yes 0 0 0 0 0 0 R 0
long 0 0 yes yes 0 0 0 0 0 0 R 0
multicore7 0 0 yes yes 0 0 0 0 0 0 E 0
multicore 0 0 yes yes 0 0 0 0 0 0 R 0
generic7 168 1 yes yes 0 1 0 0 0 0 E 0
short7 0 650 yes yes 0 404 0 0 0 0 E 246
express7 0 0 yes yes 0 0 0 0 0 0 E 0
Allocated resources¶
Jobs are allocated a single core and 8 GiB of memory by default. If
more cores are needed, please use the multicore
queue.
Jobs submitted to the queues prefixed with gpu
queue be allocated a
full node (all cores, at least 64 GiB memory, all of the GPUs on the
nodes). The gpu-nv
queue contains two nodes with a 1080Ti GPU and
one node with two V100 GPUs. To select a particular type of GPU,
additional parameters can be passed as explained below.
Submitting Jobs¶
To submit jobs, the qsub
command may be used, with the following
basic syntax:
$> qsub [ -q <queue> ] some_script.sh
qsub
has the following default behaviour:
jobs are submitted to the
generic
queuesome_script.sh
is executed in your home directorysome_script.sh
is executed in a “clean” login shellstandard output and standard error are sent to separate files
These defaults are not very convenient, because they tend to fill your
home directory with a lot of log files and the some_script.sh
has to
include proper initialisation of the environment the job requires.
Common Command line Options¶
-j oe
: merge stdout and stderr in a single file; a single “.o” file is written.-q <queuename>
: choose batch queue; the default queue is “generic”-o <filename>
: choose a different filename for stdout-V
: pass all environment variables of submitting shell to batch job (with exception of$PATH
)-l host=v100
: run the job on a stoomboot node with a specific type of GPU
A full list of options can be obtained from man qsub
. For more info
on the available labels, see https://wiki.nikhef.nl/ct/index.php?title=Stoomboot_cluster.
Job Status¶
The qstat
command shows the status of all jobs in the system. Status
code ‘C’ indicates completed, ‘R’ indicates running and ‘Q’ indicates
queued:
$> qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1237000.burrell ...e_files_batch user_1 07:05:17 R long6
1237001.burrell ...e_files_batch user_1 07:04:59 R long6
1237002.burrell ...e_files_batch user_1 07:05:30 R long6
1240157.burrell ...81115_0910.sh user_2 04:07:05 C generic7
1240186.burrell ...a_asm_p42_p42 user_3 03:25:06 R generic7
1240187.burrell ...a_asm_p27_p27 user_3 03:25:00 R generic7
1240188.burrell ...a_asm_p43_p43 user_3 03:24:57 R generic7
1240189.burrell ...ana_asm_p5_p5 user_3 03:24:52 R generic7
The qstat -u <username>
command shows the status of your own
jobs:
$> qstat -u username
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1241010.burrell.nikhef username generic7 test_qsub.sh 27126 -- -- -- 00:10:00 R 00:00:11
Only completed jobs that completed less than 10 minutes ago are listed with status ‘C’. Output of jobs that completed longer ago is kept, but they are simply no longer listed in the status overview.
The general level of activity on stoomboot is graphically monitored here: http://www.nikhef.nl/grid/stats/stbc/.
Debugging and Trouble Shooting¶
If you want to debug a problem that occurs on a stoomboot batch job, or you want to make a short trial run for a larger series of batch jobs there are two ways to gain interactive login access to stoomboot.
You can either directly login to the [interactive nodes](#Interactive Nodes) or you can request an ‘interactive’ batch job through
$> qsub -X -I
In this mode you can consume as much CPU resources as allowed by the queue that the interactive job was submitted to. The ‘look and feel’ of interactive batch jobs is nearly identical to that of using ssh to connect to an interactive node. The main exception is that when no free job slot is available the qsub command will hang until one becomes available.
Scratch disk usages and NFS disk access¶
When running on stoomboot please be sure to locate all local ‘scratch’
files to the directory pointed to by the environment variable
$TMPDIR
and not /tmp. The latter is very small (a few GiB) and
when filled up will give all kinds of problems for you and other
users. The disk pointed to by $TMPDIR
is large and fast. Be sure to
clean up when your job ends to avoid filling up these disk.
When accessing NFS mounted disks (/project/, /data/) please keep in mind that the network bandwidth between stoomboot nodes and the NFS server is limited and that the NFS server capacity is also limited. Running e.g. 50 jobs that read from or write to files on NFS disks at a high rate (‘ntuple analysis’) may result in poor performance of both the NFS server and your jobs.
qsub
Wrapper¶
W. Verkerke created a small wrapper script which fixes the
inconvenient defaults of qsub
: bsub. Save
this script as bsub
is a directory that is part of
your PATH
and make it executable using chmod u+x bsub
to use it.
It can be used with the following syntax:
$> bsub [ -J <jobname> ] some_command
The bsub
script has the following default behaviour
jobs are submitted to the
generic
queuesome_script.sh
is executed in the current working directorysome_script.sh
inherits the environment of the current shellstandard output and standard error are sent to the same file