Stoomboot job submission

Documentation for stoomboot (CT Wiki)

The Stoomboot cluster is the local batch computing facility at Nikhef. It is accessible for users from scientific groups to perform for example data analysis or Monte Carlo calculations.

The Stoomboot facility consists of 2 interactive nodes and a batch cluster with 25 nodes with 32 cores each, and 18 nodes with 64 cores, all running Scientific Linux CERN 7 operating system. The dCache storage system us accessible for any of the interactive and worker nodes.

Interactive nodes

There are 2 interactive nodes:

  • stbc-i1 and stbc-i2 run CentOS 7 on the same node type as the stoomboot batch nodes. They have 32 cores and 256GB RAM.

These machines are intended for the following purposes:

  • Running interactive jobs, like analysis work (making plots etc).
  • Interaction with the batch system (see below for the relevant commands).

They are not suitable for very resource (memory or CPU) intensive work because they are shared with other users. If your program needs a lot of memory, or needs to run for a long time, and/or heavily uses multiple cores, you should use the batch nodes instead.

If, for some reason, have to run a long CPU-intensive job on the interactive nodes (i.e., you cannot run it on the batch system), please be friendly to your fellow users of the system and run your jobs with low CPU priority using the nice command:

nice -20 <your script or program and its arguments>

This way, your program gets a lower priority than other programs, which means it will take a bit longer to complete for you, but makes the system more responsive for other users.

You can login to these machines via ssh.

Batch cluster

The Stoomboot batch cluster uses a combination of the Torque resource manager and the Maui scheduler. To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on a linux machine (either on the console or via ssh). Machines that can be used include desktops managed by the CT department, the login hosts and the interactive Stoomboot nodes.

Job Submission

The command qsub is used to submit jobs to the cluster. A typical use of this command is:

qsub [-q <queue>] [-l resource_name[=[value]][,resource_name[=[value]],...]] [script]

The optional argument script is the user-provided script that does the work. If no script is provided, the input is read from the console (STDIN).

Please read the section Queues below for more information about available queues and their properties. It is recommended to specify a queue.

For simple jobs, it usually not needed to provide the option -l with its resource list. However, if the job needs more than one core or node, or if the wall time limits of the job (maximum time that the job can exist) should be specified, this option should be used. Example:

  • -l nodes=1:ppn=4 requests 4 cores on 1 node
  • -l walltime=32:10:05 requests a wall time of 32 hours, 10 minutes and 5 seconds

More detailed information can be found in the manual page for qsub:

man qsub

Submitting under another group

By default, qsub uses the primary group of the Unix account when submitting a job. Sometimes it may be needed to use a different group, for example because the primary group has no allocation on the batch server. To submit with a different group, add -W group_list=<new-group> to qsub. The following example forces the use of group atlas:

qsub -W group_list=atlas

Job Status

Users can look at the status of all current jobs with the command qstat:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1001.burrell            user1           21:22:33 R generic6          
1002.burrell              myscript.csh     user2           08:15:38 C long6          
1003.burrell              myscript.csh     user2           00:02:13 R long6
1004.burrell              myscript.csh     user2                  0 Q long6

The above example shows 4 jobs from 2 different users. Two jobs are running (1001 and 1003), one has finished (1002) and one is still waiting in the the queue.

More detailed output is shown with qstat -n1:

qstat -n1
                                                                         Req'd  Req'd   Elap
Job ID                  Username Queue       Jobname          SessID NDS   TSK Memory Time  S Time
--------------------    -------- --------    ---------------- ------ ----- --- ------ ----- - -----
1001.burrell.nikhef.n   user1    generic6        10280   --   --    --   36:00 R 21:22   stbc-081
1002.burrell.nikhef.n   user2    long6       myscript.csh     28649   --   --    --   96:00 R 08:15   stbc-043
1003.burrell.nikhef.n   user2    long6       myscript.csh     12365   --   --    --   96:00 R 00:02   stbc-028
1004.burrell.nikhef.n   user2    long6       myscript.csh       --    --   --    --   96:00 R   --     --

More information can be found in the manual page:

man qstat

Job Deletion

Users can delete their own jobs with the command qdel:

qdel 1001

This example removes the job with ID 1001. The ID can be obtained with the qstat command described above.

More information can be found in the manual page:

man qdel


The queues have different parameters for how many jobs you can submit and run at the same time, and how long the jobs can keep running (measured in wall clock time).

centos7 queues walltime [HH:MM] remarks
express7 00:10 test jobs (max 2 running jobs per user)
short7 04:00
generic7 24:00 default queue
long7 48:00 max. walltime of 96:00 via resource list

The routing queues are a holdover from when we still had CentOS 6. They are equivalent to their '7' destinations.

routing queues destination queue
express express7
short short7
generic generic7
long long7

GPU Nodes

The stoomboot system has a number of GPU nodes that are suitable for running certain types of algorithms. We have both interactive nodes as well as batch nodes.

Currently, the GPUs can be used by only one user at a time. This means that the interactive nodes need more discipline from the users to share this interactive resource than the generic interactive nodes. Contact for coordination.

The GPU batch nodes can be used through the gpu queue, but only use this queue for actual gpu jobs!

There are two main GPU manufacturers and software will typically only work on one brand or the other, so we have queues that direct jobs only to the nodes with one type:

  • gpu-amd is the queue to use for jobs that run on AMD GPUs
  • gpu-nv is the queue to use for jobs that run on NVIDIA GPUs

To make a more precise subselection of GPU types, see the section on batch GPU nodes below.

Types of GPU nodes

There are different generations and brands of GPUs in our systems. Depending on your software you may see different performance figures on these nodes, and you may need to tweak and/or recompile your sources for different types.

To make things a bit easier, we have interactive nodes for each type available. These nodes can be used for compilation and tests, as well as computing. For scaling up the computing needs you should use the batch system gpu nodes instead.

Interactive GPU nodes
Node name Node manufacturer Node type name GPU manufacturer GPU type GPU number
stbc-g1 Fujitsu CELCIUS C740 NVIDIA GeForce GTX 1080 1
stbc-g2 Fujitsu CELCIUS C740 NVIDIA Quadro GV100 1
wn-lot-001 Lenovo ThinkSystem SR655 AMD Radeon Instinct MI50 2
wn-lot-008 Lenovo ThinkSystem SR655 NVIDIA Tesla V100 2

Batch GPU nodes

In order to direct your jobs to a particular type of node, you can set the job requirements to match the node property. The following example would direct the job to a full node with the dual AMD MI50 cards.

qsub -l 'nodes=1:mi50,walltime=12:00:00,mem=4gb' -q gpu-amd

This example selects the NVIDIA V100 node(s):

qsub -l 'nodes=1:v100,walltime=12:00:00,mem=4gb' -q gpu-nv
Batch GPU nodes
Node name Number of nodes Node manufacturer Node type name GPU manufacturer GPU type GPU number qsub tags
wn-cuyp-{002..003} 2 Fujitsu CELCIUS C740 NVIDIA GeForce GTX 1080 1 nvidia, gtx1080
wn-lot-{002..007} 6 Lenovo ThinkSystem SR655 AMD Radeon Instinct MI50 2 amd, mi50
wn-lot-009 1 Lenovo ThinkSystem SR655 NVIDIA Tesla V100 2 nvidia, v100

Where to put my data

Not all places are appropriate for your output data. Home directories, /project, /data, and dCache all have different stengths (and weaknesses). Read about our storage classes at .

Where to find software and additional resources

There are some useful hints about using Stoomboot and additional software on the computing course pages:

General Considerations on submission and scheduling

The system is a shared facility, mistakes you make can cause others to be blocked or lose work, so pay attention to some considerations.

  • If you'll be doing a lot of I/O, think about it a bit (or come talk to us in the PDP group). Some example gotchas:
    • large output files (like logs) going to /tmp might fill up /tmp if many of your jobs land on the same node, and this will hang the node.
    • submitting lots of jobs, each of which opens lots of files, can cause problems on the storage server. Organizing information in thousands of small files is problematic.
  • there is an inherent dead time of about 30 seconds associated with each job. Hence very short jobs are horrendously inefficient, they also cause high loads on several services and this sometimes causes problems for others. If your average job run time is not at least 1 minute, please consider how to re-pack your work into jobs.
  • the system does not handle a combination of multicore and single core scheduling in a single queue, this is why there is a multicore queue. Do not submit single core work to the multicore queue (actually you're only allowed to use it by request to the admin team) and do not submit multicore work to any other queue.
  • make sure that your program actually does what it says it does, with respect to cores ... sometimes a job will try to grab all the cores it sees, automatically. Such jobs should be submitted to the multicore queue, or else turn the feature off so that it only accesses a single core.

How scheduling is done

Or, some things to think about if you don't understand why your jobs are not running, or there are fewer running than you expected.

The system works on a fair-share scheduling basis. Each group at Nikhef gets an equal share allocated, and within each group, all users are equal. The scheduler makes its decision based on a couple pieces of information:

  • how much time has your group (eg atlas or virgo or km3net) used over the last 8 hours
  • how much time have you (eg templon or verkerke or stanb) used over the last 8 hours.

The group number is converted to a group ranking component ... if your group has used less than the standard share in the last 8 hours, this number is positive, getting larger the less the group has used. If you've used more than the standard share, the number is negative, getting more negative the more you've used. The algorithm is absolutely the same for all groups at Nikhef. There is a similar conversion for the user number, the scale of the group number being larger than the group one. The two components are added, resulting in a ranking ... the jobs that have the highest ranking run first. So jobs are essentially run in this order:

  • low group usage in the past 8 hours, also low user usage
  • low group usage in the past 8 hours, higher user usage
  • higher group usage in the past 8 hours, lower user usage
  • higher group usage in the past 8 hours, higher user usage

The system is fair in that it gives priority to jobs from users/groups that haven't used many cycles in the last 8 hours. However the response is not always immediate, the scheduler can't run a new job until another one ends, and if there are more than two groups running, it could be that your particular group has higher ranking than one but lower than the other.

The scheduling for the multicore queue and the rest of the (single core) queues is independent. There is one more consideration in scheduling, this is that the maximum number of running jobs is limited per queue and per user. Some example reasons why: the long queue is limited to running 312 single-core jobs, not even half of the total capacity; this is done to prevent long jobs from blocking all other use of the cluster for a long (yeah, that's why the queue is called long) period of time. Also, despite the generic queue being able to run 390 jobs max, a single user cannot run more than 342 jobs, this is to prevent a single user from grabbing all the slots in the generic queue for herself. The numbers are a compromise; if the cluster is empty, you want these numbers to be big, if its full, you want them to be smaller, there is no good way to automate this so we set them at a compromise.

If somebody submits multicore work to a single core queue, this has the potential to block all other users from the same group, due to a limitation in the scheduling software, so please don't submit multicore work to single-core queues.