More Experience with upgrading from CentOS 7 to Debian

Dennis van Dok

CaRCC Systems-Facing Track, Thursday 2024-06-20

A bit of background

  • This is Nikhef
  • That is what we call Grid Computing

Nikhef

  • Dutch National Institute for Subatomic Physics
  • Involved in 3 LHC experiments
  • KM3NeT (Neutrino telescope), XENON (Dark matter), LIGO/VIRGO, Einstein Telescope (Gravitational Waves)
  • Contributions in technical expertise, instrumentation, theory, and computing.

Grid Computing

  • Driven by LHC developments and scale of computing needs
  • European coordination of development through EU projects
  • Focus lies on High Throughput rather than High Performance (think delivery truck vs. race car).
  • Welcoming other sciences outside of (particle) physics

Grid development

  • Grid computing development since 2000
    • European Data Grid (initial development)
    • EGEE I, II, and III (until 2012)
    • EGI (infrastructure and federation)
  • Close ties with CERN's WLCG1
  • The single supported platform: Red Hat EL\(x\) (compatible)

OK, RedHat

  • We were content to have a stable, company controlled distribution that we could basically leech of off by way of the 'free' rebuilders.
  • Scientific Linux, Scientific Linux CERN and CentOS were the choices since version 3.
  • All the way to version 7 this 'worked', but the only remaining choices was CentOS by then and it was owned by Red Hat

Grid Operations

  • This modest (400 kW) data room houses one part of the Netherlands Tier 1 for the ATLAS experiment as part of the larger Dutch National Infrastructure.
  • We heavily rely on automation since we are only a 3 person team.

The Problem

As we all know, Red Hat dismantled CentOS as a binary compatible rebuild of their Enterprise Linux OS.

While others have stepped in to fill that void (Rocky, Alma, do we even mention Oracle?) Red Hat seems to actively discourage these efforts by limiting access to the (packaging) sources and restrictive licences.

This has made some of us a bit queasy about sticking with the Red Hat compatible platform that served us for so long.

The plan

Many of the services that are not user-facing are probably served just as well on any other platform. And what other platform would offer better long-term stability and guaranteed openness than Debian?

The Case for (or against) Debian

For the longest time, most labs did not consider Debian as a viable alternative. Factors that were named:

  • Hardware vendors officially support 'enterprise' Linux vendors, not Debian
  • The release cycle is unpredictable
  • The support for older releases is not so long
  • Supporting an additional platform would add to the maintenance burden for developers/packagers

pros and cons

There are however some very good reasons why Debian would be a viable choice:

  • Many sysadmins already know it and even prefer it
  • There is no dependency on an entity driven by 'business needs' or 'shareholder value'
  • Proven track record of very stable distribution and excellent response times for security issues
  • Upgrade-in-place instead of a clean re-install looks very good for some type of systems

The plan was to port all CentOS7 work to either:

Alma Linux 9
for the user facing systems, such as user interfaces, worker nodes, login nodes, and systems that otherwise have to be RHEL compatible.
Debian 12
for all other systems

Problems on the non-HTC side

One colleague came up with the following metaphor:

Trying to renovate an old building from the ground up while all the residents remain living and working in it.

(Which is not very different from our recent experience with renovating our building and keeping the lights on in the data centre.)

On the CT-b side

(Nikhef IT is basically split between general services, called CT-beheer, and the High-Throughput computing side or NDPF. They face many challenges, some similar, some different.)

It is not in the end so much an issue of having so many systems, but more a matter of having so many different services that have to be ported.

Not everything is well

We would have loved to have a unified, automated system setup and configuration system, but various admins used to do things differently from one another in the past and such a system never materialised. This has build a large technical debt over time that now needs to be paid off.

During a crucial phase in the project, effective management and supervision were lacking. It has various reasons, the renovation mentioned earlier certainly contributed. It led to a lack of overarching planning and guidance.

  • Many services were built up over the years with many layers of complexity, difficult to reverse-engineer.
  • Porting legacy software with outdated Python and PHP code was hard because often the original developers were unavailable.
  • The overall amount of work was underestimated by management. Only recently external capacity was hired to help out.

A happy end

The admin who shared these points for the presentation for HEPiX in March 2024 (and who could not attend, being to busy fixing things) got it done in time in the end.

The HTC side

(That is the Nikhef Data Processing Facility) things are looking slightly differently:

  • Standardised configuration management for many years
  • Fewer legacy systems
  • many nodes but fewer different types of services overall.

Moving away from Torque (to HTCondor)

What complicates things a little more is the plan to (finally) say goodbye to our Torque batch systems and move everything to HTCondor. The timing of this effort is under considerable pressure.

In the short term

  • Move our Grid Computing clustert to ALma Linux 9 on Torque (This is the quickest, easiest path)
  • Move the cluster for our local users to HTCondor on Debian

HTCondor on Debian

  • All jobs run in a container
  • If the user does not provide their own container, they have a choice of

    • "el7"
    • "el8"
    • "el9"

    as provided by the CVMFS repository unpacked.cern.ch

  • Actively pressing users to move to new cluster with carrot and stick tactics:
    • shrinking capacity on old cluster, to be turned off on July 1 (stick)
    • Increasing capacity on new cluster (carrot)

Longer Term

  • As we gain more experience operating the local cluster, we will move the Grid cluster to HTCondor as well, possibly on Debian with containers in the same fashion.

Shameless plug

Come to Amsterdam in September! The HTCondor Workshop Autumn 2024 is held at Nikhef from 24–27 September.

https://indico.cern.ch/event/1386170/

Reflection

There is a fair question one can raise why we decided to do two things (or three things if you count HTCondor) instead of one: just upgrade to Alma Linux 9 instead of also switching platforms.

To counter this argument, many of the issues we face would have been a problem on Alma 9 as well, simply because the legacy software needs to be ported to newer software such as Python 3.

Investment in Debian

Having a reproducible way to set up a Debian system with full automation was finalised rather late, delaying a good chunk of the work.

However, this currently works like clockwork.

In order to port more software to Debian we are setting up a build system to produce native packages.

This is currently an old worker node with Debian 12 with a gitlab runner to run sbuild.

For building RPM packages we use podman+mock on the ame machine.