Dennis van Dok
2024 European HTCondor Workshop, Nikhef, Amsterdam, Thu 2024-09-26
How do you move away from a system you’ve been operating for over twenty years? How do you transfer two decades worth of technical proficiency to an entirely new system, while keeping the lights on? How do you get your users to make the transition with you, who are set in their ways and stand to gain little?
The Physics Data Processing group (PDP) at Nikhef operates two computational facilities at some scale. Since the early 2000s we participate in the world-wide Grid, where we are committed to run the NL-T1 together with our partners at SURF.
The Stoomboot facility is a local cluster for our own researchers in physics. While smaller in scale, it can be more complicated because of a diversity of users and their workloads.
In the coming pages we will learn something about the design of the original system and how this has recently been re-implemented with HTCondor.
The work described here was done by my esteemed colleagues: Mary Hester, Jeff Templon, Emily Kooistra, and Andrew Pickford. I am grateful for their hard work, the spirited discussions we had and their insightful contributions.
Many thanks alsa to the team of Miron Livny for their steadfast commitment and support.
At the time there weren’t many options for running batch systems and we opted for Torque, an open source PBS derivative.
Both Stoomboot and Grid were (eventually) aligned on the same version.
Stoomboot and Grid are different in scale, usage pattern and data access. The piece of software that actually matters is not Torque, but it’s companion scheduler called Maui.
Users will almost never have to interact with Maui directly, but it controls which jobs can run on which worker nodes at what time. It decides on the prioritisation of all jobs based on a policy document that describes how we distribute the available resources to our various user (groups).
It is being clever about distributing the job slots in a variable supply of jobs. With the intent to let no cycles go to waste, a group can use more than their allocated share, at least for a while, if no other groups have jobs queued. This use-above-share is taken into account when other groups do show up, so eventually things should balance out.
This mechanism relies on a steady rate of job slots becoming available for scheduling decisions. (To be clear: it only matters if the cluster is full, as scheduling in a cluster that is not fully occupied is trivial.)
The churn of job slots is guaranteed by the fact that a job can only last so long.
Stoomboot | Grid |
---|---|
Shared file system | No shared file system |
Accounting by user and group | Accounting by VO |
shorter jobs | longer jobs |
bursty | always full |
The Stoomboot cluster has short, medium, and long jobs. There is also an express category.
name | maximum duration | availability |
---|---|---|
express | 10 minutes | immediate |
short | 4 hours | same day |
medium | 24 hours | some days |
long | 96 hours | maybe next week |
The choice for these queues is a careful balance between various interests. We could fill the entire cluster with long jobs, but that may lock up the system for the week. That is not nice for users who want to run a couple of jobs today.
We have some hardware that’s been paid for by specific groups. It is part of the large shared pool and others may use it, but the funding group has priority. We do not preempt or kill jobs, so we only allow short jobs on these nodes from other groups; this makes the system available within 4 hours.
Purely for testing purposes, the express queue is for very short duration jobs with a low maximum of total jobs.
We learned over the past twenty years how to do things with Torque and Maui. This introduced a certain inertia where moving to anything else was going to be difficult (i.e. a lot of work) and there was hardly a reason to ever do it. The tooling around this system, from custom command-line tools to monitoring and accounting has all been developed over the many years.
With time, we found ourselves further and further from the common ground of labs that run batch systems. Torque and Maui were no longer updated (by us) and out of support. Yet everything kept working.
There was, with intervals, discussion over what software would replace it (like SLURM).
It was gradually becoming clear that HTCondor was going to be the right choice.
The (re-)introduction of HTCondor on Nikhef infrastructure already started in 2020; a researcher from the gravitational waves group had some grant money to spend on computers, and the plan was that we would buy and integrate this into our existing systems. It became apparent that setting up a dedicated HTCondor cluster would be a good plan.
This cluster (’Ganymede’) was much simpler in setup than what we have now, it served only a single group and we implemented practically no special policies. We learned valuable lessons operating this system since around 2021.
One of the driving forces was the end of support for CentOS 7. To condense a really long discussion, we looked at the drama around the RHEL rebuilders and opted to install Debian as the base OS for most systems unless it was directly used as a platform by users (or it was otherwise unfeasible). For these we would use Alma Linux 9.
We really did not want to carry Torque into the new era on Alma Linux 9 so there was pressure to implement HTCondor as soon as possible. It turned out that we managed to do it only for Stoomboot; the Grid is still a work in progress.
There was mounting pressure to get everything done in time: the upgrades for all our CentOS 7 systems; the installation of HTCondor on Debian and the move of Stoomboot.
We managed in the end but it was not a pleasant experience.
On the considerations that went into the transition.
We set out to make the transition as smooth as possible for our users. While some users were already familiar with HTCondor on other systems (e.g. CERN) we wanted to give our Torque addicts at least some sense of familiarity.
We started setting up a brand new cluster with only a few worker nodes. We invited some early adopters to be our guinea pigs and validate the basic setup. Then we sneakily started to move more and more worker nodes over to the new system and suggested to our users to join the nice newer (and larger) cluster. A freshly added set of worker nodes with 128 core AMD Bergamo CPUs was a great lure.
During the transition phase we spend more time on user support via various channels. Luckily most of the users found the transition quite easy and were reasonably happy with the new system.
At some point we had to announce that the old system was going to be turned off. We allowed the last stragglers to finish their work so they did not have to make all kind of late changes in their workflows.
We received great support from the HTCondor team in Wisconsin. A timely visit there, on the heels of an in-person LIGO meeting, was especially useful to hammer out the policies and settings.
When we started the process of implementing a local facility under HTCondor control, the mindset was very much on the classic way of thinking about batch systems. The consideration was that operators and users were already familiar with a certain way of working and the transition would go much more smoothly if we could shape the new system to look like the old.
This meant that we had to introduce a concept that is not native to HTCondor: a limited wall time. This can be done through a job transformation.
Users set the MaxWallTime
(directly, or via a job category) and we add a
SYSTEM_PERIODIC_HOLD
condition if this gets exceeded.
We therefore introduced the concept of job categories (a bad name indeed!) and used a transformation to enforce the choice of a category and implement placing the job on hold once it overran the specified duration.
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) JobCategoryNotDefined JobCategorySetDefaults \ SetMaxWallTime JobCategoryChecks SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) JobCategoryNotDefinedCheck JobCategoryInvalidCheck \ JobCategoryMaxWallTimeCheck SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) JobCategoryDefaultWallTime \ JobCategoryMaxWallTime SYSTEM_PERIODIC_HOLD_NAMES = $(SYSTEM_PERIODIC_HOLD_NAMES) JobCategoryWallTime SYSTEM_PERIODIC_REMOVE_NAMES = $(SYSTEM_PERIODIC_REMOVE_NAMES) JobMaxHoldTime JOB_TRANSFORM_JobCategoryNotDefined @=end NAME JobCategoryNotDefined REQUIREMENTS (JobUniverse =?= 5 && isUndefined(JobCategory)) SET JobCategoryNotDefined TRUE @end SUBMIT_REQUIREMENT_JobCategoryNotDefinedCheck = isUndefined(JobCategoryNotDefined) SUBMIT_REQUIREMENT_JobCategoryNotDefinedCheck_REASON = "+JobCategory must be defined, see https://__knowledge_base_url__" # Default Walltime (in seconds) per job category CLASSAD_USER_MAPDATA_JobCategoryDefaultWallTime @=end * express 600 * short 4*3600 * medium 24*3600 * long 96*3600 @end # Maximum Walltime (in seconds) per job category CLASSAD_USER_MAPDATA_JobCategoryMaxWallTime @=end * express 600 * short 4*3600 * medium 24*3600 * long 96*3600 @end
JOB_TRANSFORM_JobCategorySetDefaults @=end NAME JobCategorySetDefaults REQUIREMENTS (JobUniverse =?= 5 && !isUndefined(JobCategory)) EVALSET JobCategoryDefaultWallTime eval(userMap("JobCategoryDefaultWallTime", toLower(JobCategory))) EVALSET JobCategoryMaxWallTime eval(userMap("JobCategoryMaxWallTime", toLower(JobCategory))) EVALSET JobCategoryOK (isError(JobCategoryDefaultWallTime) || isError(JobCategoryMaxWallTime)) ? FALSE : TRUE @end SUBMIT_REQUIREMENT_JobCategoryInvalidCheck = JobUniverse =?= 5 ? JobCategoryOK : TRUE SUBMIT_REQUIREMENT_JobCategoryInvalidCheck_REASON = "Invalid JobCategory, see https://__knowledge_base_url__" JOB_TRANSFORM_SetMaxWallTime @=end NAME SetMaxWallTime REQUIREMENTS (JobUniverse =?= 5 && isUndefined(MaxWallTime)) EVALSET MaxWallTime JobCategoryDefaultWallTime @end JOB_TRANSFORM_JobCategoryChecks @=end NAME SetMaxWallTime REQUIREMENTS JobUniverse =?= 5 EVALSET JobCategoryMaxWallTimeOK MaxWallTime > JobCategoryMaxWallTime ? FALSE : TRUE @end SUBMIT_REQUIREMENT_JobCategoryMaxWallTimeCheck = JobUniverse =?= 5 ? JobCategoryMaxWallTimeOK : TRUE SUBMIT_REQUIREMENT_JobCategoryMaxWallTimeCheck_REASON = strcat("MaxWallTime must be less than or equal to \ the job category (", JobCategory, ") maximum walltime of ", JobCategoryMaxWallTime, " seconds") # hold running jobs that exceed their max walltime SYSTEM_PERIODIC_HOLD_JobCategoryWallTime = JobStatus == 2 && time() - EnteredCurrentStatus > MaxWallTime SYSTEM_PERIODIC_HOLD_JobCategoryWallTime_REASON = strcat("job exceed MaxWallTime of ", MaxWallTime) # delete held jobs after one week SYSTEM_PERIODIC_REMOVE_JobMaxHoldTime = JobStatus == 5 && time() - EnteredCurrentStatus > 7 * 24 * 3600 SYSTEM_PERIODIC_REMOVE_JobMaxHoldTime_REASON = "job removed after being held for 1 week"
The time table for the move to HTCondor coincided with the end of life for CentOS 7. We were worried that some users would not be prepared to port their work to a newer operating system and therefore we opted to run everything in containers from the get go.
Because we run everything in a container, the user is not going to notice that they are on a Debian system unless they have a close look at the kernel.
HTCondor has support for singularity (or Apptainer) and we went with three flavours of Enterprise Linux derivatives: el7, el8, and el9. We also allow users to bring their own container image.
The following job transformation takes care of setting the SingularityImage accordingly.
# Submit Requirement: ValidSingularityJob SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) NotDefinedOSorImageCheck DefinedOSandImageCheck \ SetSingularityImageFromOSCheck ValidSingularityImage SUBMIT_REQUIREMENT_ValidSingularityImage = JobUniverse =?= 5 ? (!isUndefined(SingularityImage) && \ !isError(SingularityImage)) : TRUE SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) Trust JobImages JobOS JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) NotDefinedOSorImage DefinedOSandImage SetSingularityImageFromOS # Handle container image CLASSAD_USER_MAPDATA_JobImages @=end * el7 /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/sft/docker/centos7-core:latest * el8 /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/sft/docker/alma8-core:latest * el9 /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/sft/docker/alma9-core:latest @end CLASSAD_USER_MAPDATA_JobOS @=end * el7 CentOS7 * el8 CentOS8 * el9 AlmaLinux9 @end
JOB_TRANSFORM_NotDefinedOSorImage @=end NAME NotDefinedOSorImage REQUIREMENTS (JobUniverse =?= 5 && isUndefined(SingularityImage) && isUndefined(UseOS)) SET NotDefinedOSorImage TRUE @end SUBMIT_REQUIREMENT_NotDefinedOSorImageCheck = isUndefined(NotDefinedOSorImage) SUBMIT_REQUIREMENT_NotDefinedOSorImageCheck_REASON = "Either set +UseOS = \"el7\" or \"el8\" or \"el9\"; or \ set +SingularityImage = (path to container image file or unpackaged container dir)" JOB_TRANSFORM_DefinedOSandImage @=end NAME DefinedOSandImage REQUIREMENTS (JobUniverse =?= 5 && !isUndefined(SingularityImage) && !isUndefined(UseOS)) SET DefinedOSandImage TRUE @end SUBMIT_REQUIREMENT_DefinedOSandImageCheck = isUndefined(DefinedOSandImage) SUBMIT_REQUIREMENT_DefinedOSandImageCheck_REASON = "Specify either +UseOS or +SingularityImage not both" JOB_TRANSFORM_SetSingularityImageFromOS @=end NAME SetSingularityImageFromOS REQUIREMENTS (JobUniverse =?= 5 && isUndefined(SingularityImage) && !isUndefined(UseOS)) EVALSET MappedImage userMap("JobImages", toLower(UseOS)) EVALSET SingularityImage (MappedImage =?= Undefined) ? Error : MappedImage SET SetSingularityImageFromOS TRUE @end SUBMIT_REQUIREMENT_SetSingularityImageFromOSCheck = isUndefined(SetSingularityImageFromOS) ? \ TRUE : !isError(SingularityImage) SUBMIT_REQUIREMENT_SetSingularityImageFromOSCheck_REASON = "The requested OS is not valid. Set \ +UseOS = \"el7\" or \"el8\" or \"el9\""
One of the more complicated tweaks of the original system was that we accounted users by their groups.
For instance the alice group is treated as a single entity for the sake of the use of their share, so one particularly productive user could take a considerable chunk out of that.
To complicate matters further, we have some groups that have contributed entire subclusters through their funding. This entitles them to have priority use on the hardware they paid for.
In the transition to HTCondor we loosened the ties to the specific hardware somewhat, because there is nothing special about one worker node or another; a cycle is a cycle. We acknowledge their entitlement to a certain amount of benchmark units that is equivalent to what the hardware provides.
We still struggle to define a meaningful configuration that would reflect the respective priorities for these groups.
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) ValidAcctGroup SCHEDD_CLASSAD_USER_MAP_NAMES = $(SCHEDD_CLASSAD_USER_MAP_NAMES) PosixGroups QuotaGroups JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) ValidAcctGroup SetAccountingGroupDefault SetAccountingGroupRequested CLASSAD_USER_MAPFILE_PosixGroups = /etc/condor/user-group.map # Groups with quotas, plus datagrid for testing CLASSAD_USER_MAPDATA_QuotaGroups @=end * smefit smefit * gravwav gravwav * datagrid datagrid @end
# Check if the user is a member of the posixgroup that they requested # via accounting_group (which is the AcctGroup Classad) JOB_TRANSFORM_ValidAcctGroup @=end NAME ValidAcctGroup REQUIREMENTS (JobUniverse =?= 5 || JobUniverse =?= 7) && !isUndefined(AcctGroup) EVALSET PosixGroup = userMap("PosixGroups", Owner, AcctGroup) EVALSET isMemberofPosixGroup = PosixGroup == AcctGroup @end SUBMIT_REQUIREMENT_ValidAcctGroup = isUndefined(AcctGroup) || isMemberofPosixGroup SUBMIT_REQUIREMENT_ValidAcctGroup_REASON = "Invalid accounting_group set. accounting_group must \ either be undefined, or set to a posix group that you are a member of." JOB_TRANSFORM_SetAccountingGroupDefault @=end NAME SetAccountingGroupDefault REQUIREMENTS (JobUniverse =?= 5 || JobUniverse =?= 7) && isUndefined(AcctGroup) EVALSET AcctGroupUser = Owner EVALSET AccountingGroup = Owner @end JOB_TRANSFORM_SetAccountingGroupRequested @=end NAME SetAccountingGroupRequested REQUIREMENTS (JobUniverse =?= 5 || JobUniverse =?= 7) && !isUndefined(AcctGroup) EVALSET isQuotaGroup = userMap("QuotaGroups", AcctGroup) EVALSET AcctGroupUser = Owner EVALSET AccountingGroup = isUndefined(isQuotaGroup) ? Owner : join(".", AcctGroup, AcctGroupUser) @end IMMUTABLE_JOB_ATTRS = $(IMMUTABLE_JOB_ATTRS) AccountingGroup
We’ve set aside a machine whose slots will only go to jobs with a very short deadline. Basically these are only going to be test jobs. By combining two negotiators, one for express jobs and one for everything else, we make sure these express jobs make it to that one node.
# regular negotiator NEGOTIATOR_SLOT_CONSTRAINT = !regexp("wn-sate-079", Machine)
# express negotiator NEGOTIATOR_SLOT_CONSTRAINT = regexp("wn-sate-079", Machine) # Match only specific jobs NEGOTIATOR_JOB_CONSTRAINT = MaxWallTime < 600
We have a single access point on an old worker node. The choice for a bit of beefy hardware was due to the expected load on the machine. We do not allow users to log on to the AP at all, but they can reach it from any of the interactive nodes.
Now that our local cluster has successfully made the transition to HTCondor, it is time to look at the Grid cluster. Although the parameters are different, we’ve built up enough confidence with HTCondor that we should be able to set up a cluster to suit our needs. But work on this has only just begun.
The advantage is that users won’t notice anything at all, as the front end system (ARC-CE) stays the same.
We did not want to carry our ageing Torque system to a newer platform, but we ended up doing this for the Grid anyway because there was no time to implement HTCondor for grid before CentOS7 EOL.
It turns out that Torque would run on Alma Linux 9, but it is beginning to show some cracks.
We will continue working on making this transition.
We are still tweaking the Stoomboot system. The system that schedules and prioritises jobs from different users and groups needs some work.
He that breaks a thing to find out what it is has left the path of wisdom.
— Gandalf the Grey
Go not to the HTCondor developers for counsel, for they will say both no and yes.
— Frodo