Stoomboot

The Stoomboot batch cluster allows you to submit jobs to run on its nodes. All Stoomboot nodes run Centos7.

The stbc-users mailing list has been setup for communication about stoomboot: announcements, issues, etc. All users are encouraged to subscribe here: https://mailman.nikhef.nl/mailman/listinfo/stbc-users

Interactive Nodes

Six interactive nodes are available for interactive and testing use:

  • stbc-i1 and stbc-i2 run Centos7

  • stbc-g1 runs Centos7 and contains an NVIDIA 1080Ti GPU

  • stbc-g2 and wn-lot-008 run Centos7 and contain NVIDIA V100 GPUs

  • wn-lot-001 runs Centos7 and contains AMD MI50 GPUs

Please keep CPU/GPU consumption and testing time to a minimum, and run your real jobs on the batch nodes.

Queues

Jobs have to be submitted to one of the available queues, listed below:

Queue Default Length Max Length
express 10m 10m
generic 24h 24h
gpu-nv 24h 96h
gpu-amd 24h 96h
long 48h 96h
multicore 96h 96h
short 4h 4h

The default queue is “generic”.

The status of the queues can be queried using:

$> qstat -Q
Queue              Max    Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T   Cpt
----------------   ---   ----    --    --   ---   ---   ---   ---   ---   --- -   ---
gpu7                 0      0   yes   yes     0     0     0     0     0     0 E     0
gpu                  0      0   yes   yes     0     0     0     0     0     0 R     0
express              0      0   yes   yes     0     0     0     0     0     0 R     0
short                0      0   yes   yes     0     0     0     0     0     0 R     0
long7              168     47   yes   yes     0    47     0     0     0     0 E     0
generic              0      0   yes   yes     0     0     0     0     0     0 R     0
long                 0      0   yes   yes     0     0     0     0     0     0 R     0
multicore7           0      0   yes   yes     0     0     0     0     0     0 E     0
multicore            0      0   yes   yes     0     0     0     0     0     0 R     0
generic7           168      1   yes   yes     0     1     0     0     0     0 E     0
short7               0    650   yes   yes     0   404     0     0     0     0 E   246
express7             0      0   yes   yes     0     0     0     0     0     0 E     0

Allocated resources

Jobs are allocated a single core and 8 GiB of memory by default. If more cores are needed, please use the multicore queue.

Jobs submitted to the queues prefixed with gpu queue be allocated a full node (all cores, at least 64 GiB memory, all of the GPUs on the nodes). The gpu-nv queue contains two nodes with a 1080Ti GPU and one node with two V100 GPUs. To select a particular type of GPU, additional parameters can be passed as explained below.

Submitting Jobs

To submit jobs, the qsub command may be used, with the following basic syntax:

$> qsub [ -q <queue> ] some_script.sh

qsub has the following default behaviour:

  • jobs are submitted to the generic queue

  • some_script.sh is executed in your home directory

  • some_script.sh is executed in a “clean” login shell

  • standard output and standard error are sent to separate files

These defaults are not very convenient, because they tend to fill your home directory with a lot of log files and the some_script.sh has to include proper initialisation of the environment the job requires.

Common Command line Options

  • -j oe: merge stdout and stderr in a single file; a single “.o” file is written.

  • -q <queuename>: choose batch queue; the default queue is “generic”

  • -o <filename>: choose a different filename for stdout

  • -V: pass all environment variables of submitting shell to batch job (with exception of $PATH)

  • -l host=v100: run the job on a stoomboot node with a specific type of GPU

A full list of options can be obtained from man qsub. For more info on the available labels, see https://wiki.nikhef.nl/ct/index.php?title=Stoomboot_cluster.

Job Status

The qstat command shows the status of all jobs in the system. Status code ‘C’ indicates completed, ‘R’ indicates running and ‘Q’ indicates queued:

$> qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1237000.burrell            ...e_files_batch user_1          07:05:17 R long6
1237001.burrell            ...e_files_batch user_1          07:04:59 R long6
1237002.burrell            ...e_files_batch user_1          07:05:30 R long6
1240157.burrell            ...81115_0910.sh user_2          04:07:05 C generic7
1240186.burrell            ...a_asm_p42_p42 user_3          03:25:06 R generic7
1240187.burrell            ...a_asm_p27_p27 user_3          03:25:00 R generic7
1240188.burrell            ...a_asm_p43_p43 user_3          03:24:57 R generic7
1240189.burrell            ...ana_asm_p5_p5 user_3          03:24:52 R generic7

The qstat -u <username> command shows the status of your own jobs:

$> qstat -u username
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
1241010.burrell.nikhef  username    generic7 test_qsub.sh      27126   --     --     --   00:10:00 R  00:00:11

Only completed jobs that completed less than 10 minutes ago are listed with status ‘C’. Output of jobs that completed longer ago is kept, but they are simply no longer listed in the status overview.

The general level of activity on stoomboot is graphically monitored here: http://www.nikhef.nl/grid/stats/stbc/.

Debugging and Trouble Shooting

If you want to debug a problem that occurs on a stoomboot batch job, or you want to make a short trial run for a larger series of batch jobs there are two ways to gain interactive login access to stoomboot.

You can either directly login to the [interactive nodes](#Interactive Nodes) or you can request an ‘interactive’ batch job through

$> qsub -X -I

In this mode you can consume as much CPU resources as allowed by the queue that the interactive job was submitted to. The ‘look and feel’ of interactive batch jobs is nearly identical to that of using ssh to connect to an interactive node. The main exception is that when no free job slot is available the qsub command will hang until one becomes available.

Scratch disk usages and NFS disk access

When running on stoomboot please be sure to locate all local ‘scratch’ files to the directory pointed to by the environment variable $TMPDIR and not /tmp. The latter is very small (a few GiB) and when filled up will give all kinds of problems for you and other users. The disk pointed to by $TMPDIR is large and fast. Be sure to clean up when your job ends to avoid filling up these disk.

When accessing NFS mounted disks (/project/, /data/) please keep in mind that the network bandwidth between stoomboot nodes and the NFS server is limited and that the NFS server capacity is also limited. Running e.g. 50 jobs that read from or write to files on NFS disks at a high rate (‘ntuple analysis’) may result in poor performance of both the NFS server and your jobs.

qsub Wrapper

W. Verkerke created a small wrapper script which fixes the inconvenient defaults of qsub: bsub. Save this script as bsub is a directory that is part of your PATH and make it executable using chmod u+x bsub to use it.

It can be used with the following syntax:

$> bsub [ -J <jobname> ] some_command

The bsub script has the following default behaviour

  • jobs are submitted to the generic queue

  • some_script.sh is executed in the current working directory

  • some_script.sh inherits the environment of the current shell

  • standard output and standard error are sent to the same file