More Experience with upgrading from CentOS 7 to Debian

Dennis van Dok

CaRCC Systems-Facing Track, Thursday 2024-06-20

A bit of background

This is Nikhef
That is what we call Grid Computing

Nikhef

Dutch National Institute for Subatomic Physics
Involved in 3 LHC experiments
KM3NeT (Neutrino telescope), XENON (Dark matter), LIGO/VIRGO, Einstein Telescope (Gravitational Waves)
Contributions in technical expertise, instrumentation, theory, and computing.

Grid Computing

Driven by LHC developments and scale of computing needs
European coordination of development through EU projects
Focus lies on High Throughput rather than High Performance (think delivery truck vs. race car).
Welcoming other sciences outside of (particle) physics

The Grid rode in on the back of the plans for the Large Hadron Collider: when people at CERN did some back-of-the-envelope calculations about the cemputing resources that would be required to process the data coming off these experiments, they figured that the cost and electricity bill would be difficult to handle for a single lab, even one as large as CERN.

And since fibre optic cable was really becoming mainstream, global high speed networking brought remote data and computing within reach.

We actually tried to make this sharing of resources a bit more generic so other science fields could benefit, but the main drivers were the event-based experiments at the LHC whose computations are embarrassingly parallel; they cared about high throughput more than about high performance. This shaped what the Grid would be like for many years since. Think delivery truck versus race care.

Grid development

Grid computing development since 2000
- European Data Grid (initial development)
- EGEE I, II, and III (until 2012)
- EGI (infrastructure and federation)
Close ties with CERN's WLCG¹
The single supported platform: Red Hat EL\(x\) (compatible)

OK, RedHat

We were content to have a stable, company controlled distribution that we could basically leech of off by way of the 'free' rebuilders.
Scientific Linux, Scientific Linux CERN and CentOS were the choices since version 3.
All the way to version 7 this 'worked', but the only remaining choices was CentOS by then and it was owned by Red Hat

Grid Operations

This modest (400 kW) data room houses one part of the Netherlands Tier 1 for the ATLAS experiment as part of the larger Dutch National Infrastructure.
We heavily rely on automation since we are only a 3 person team.

The Problem

As we all know, Red Hat dismantled CentOS as a binary compatible rebuild of their Enterprise Linux OS.

While others have stepped in to fill that void (Rocky, Alma, do we even mention Oracle?) Red Hat seems to actively discourage these efforts by limiting access to the (packaging) sources and restrictive licences.

This has made some of us a bit queasy about sticking with the Red Hat compatible platform that served us for so long.

The plan

Many of the services that are not user-facing are probably served just as well on any other platform. And what other platform would offer better long-term stability and guaranteed openness than Debian?

The Case for (or against) Debian

For the longest time, most labs did not consider Debian as a viable alternative. Factors that were named:

Hardware vendors officially support 'enterprise' Linux vendors, not Debian
The release cycle is unpredictable
The support for older releases is not so long
Supporting an additional platform would add to the maintenance burden for developers/packagers

pros and cons

There are however some very good reasons why Debian would be a viable choice:

Many sysadmins already know it and even prefer it
There is no dependency on an entity driven by 'business needs' or 'shareholder value'
Proven track record of very stable distribution and excellent response times for security issues
Upgrade-in-place instead of a clean re-install looks very good for some type of systems

The plan was to port all CentOS7 work to either:

Alma Linux 9: for the user facing systems, such as user interfaces, worker nodes, login nodes, and systems that otherwise have to be RHEL compatible.
Debian 12: for all other systems

Problems on the non-HTC side

One colleague came up with the following metaphor:

Trying to renovate an old building from the ground up while all the residents remain living and working in it.

(Which is not very different from our recent experience with renovating our building and keeping the lights on in the data centre.)

On the CT-b side

(Nikhef IT is basically split between general services, called CT-beheer, and the High-Throughput computing side or NDPF. They face many challenges, some similar, some different.)

It is not in the end so much an issue of having so many systems, but more a matter of having so many different services that have to be ported.

Not everything is well

We would have loved to have a unified, automated system setup and configuration system, but various admins used to do things differently from one another in the past and such a system never materialised. This has build a large technical debt over time that now needs to be paid off.

During a crucial phase in the project, effective management and supervision were lacking. It has various reasons, the renovation mentioned earlier certainly contributed. It led to a lack of overarching planning and guidance.

Many services were built up over the years with many layers of complexity, difficult to reverse-engineer.

Porting legacy software with outdated Python and PHP code was hard because often the original developers were unavailable.
The overall amount of work was underestimated by management. Only recently external capacity was hired to help out.

A happy end

The admin who shared these points for the presentation for HEPiX in March 2024 (and who could not attend, being to busy fixing things) got it done in time in the end.

The HTC side

(That is the Nikhef Data Processing Facility) things are looking slightly differently:

Standardised configuration management for many years
Fewer legacy systems
many nodes but fewer different types of services overall.

Moving away from Torque (to HTCondor)

What complicates things a little more is the plan to (finally) say goodbye to our Torque batch systems and move everything to HTCondor. The timing of this effort is under considerable pressure.

In the short term

Move our Grid Computing clustert to ALma Linux 9 on Torque (This is the quickest, easiest path)
Move the cluster for our local users to HTCondor on Debian

HTCondor on Debian

All jobs run in a container
If the user does not provide their own container, they have a choice of
- "el7"
- "el8"
- "el9"
as provided by the CVMFS repository unpacked.cern.ch

Actively pressing users to move to new cluster with carrot and stick tactics:
- shrinking capacity on old cluster, to be turned off on July 1 (stick)
- Increasing capacity on new cluster (carrot)

Longer Term

As we gain more experience operating the local cluster, we will move the Grid cluster to HTCondor as well, possibly on Debian with containers in the same fashion.

Shameless plug

Come to Amsterdam in September! The HTCondor Workshop Autumn 2024 is held at Nikhef from 24–27 September.

https://indico.cern.ch/event/1386170/

Reflection

There is a fair question one can raise why we decided to do two things (or three things if you count HTCondor) instead of one: just upgrade to Alma Linux 9 instead of also switching platforms.

To counter this argument, many of the issues we face would have been a problem on Alma 9 as well, simply because the legacy software needs to be ported to newer software such as Python 3.

Investment in Debian

Having a reproducible way to set up a Debian system with full automation was finalised rather late, delaying a good chunk of the work.

However, this currently works like clockwork.

In order to port more software to Debian we are setting up a build system to produce native packages.

This is currently an old worker node with Debian 12 with a gitlab runner to run sbuild.

For building RPM packages we use podman+mock on the ame machine.

In order to help out getting software to the community, striving to become a full Debian Developer.
Which as of last month, I am. https://nm.debian.org/person/dvandok/