CaRCC Systems-Facing Track, Thursday 2024-06-20
As we all know, Red Hat dismantled CentOS as a binary compatible rebuild of their Enterprise Linux OS.
While others have stepped in to fill that void (Rocky, Alma, do we even mention Oracle?) Red Hat seems to actively discourage these efforts by limiting access to the (packaging) sources and restrictive licences.
This has made some of us a bit queasy about sticking with the Red Hat compatible platform that served us for so long.
Many of the services that are not user-facing are probably served just as well on any other platform. And what other platform would offer better long-term stability and guaranteed openness than Debian?
For the longest time, most labs did not consider Debian as a viable alternative. Factors that were named:
There are however some very good reasons why Debian would be a viable choice:
The plan was to port all CentOS7 work to either:
One colleague came up with the following metaphor:
Trying to renovate an old building from the ground up while all the residents remain living and working in it.
(Which is not very different from our recent experience with renovating our building and keeping the lights on in the data centre.)
(Nikhef IT is basically split between general services, called CT-beheer, and the High-Throughput computing side or NDPF. They face many challenges, some similar, some different.)
It is not in the end so much an issue of having so many systems, but more a matter of having so many different services that have to be ported.
We would have loved to have a unified, automated system setup and configuration system, but various admins used to do things differently from one another in the past and such a system never materialised. This has build a large technical debt over time that now needs to be paid off.
During a crucial phase in the project, effective management and supervision were lacking. It has various reasons, the renovation mentioned earlier certainly contributed. It led to a lack of overarching planning and guidance.
The admin who shared these points for the presentation for HEPiX in March 2024 (and who could not attend, being to busy fixing things) got it done in time in the end.
(That is the Nikhef Data Processing Facility) things are looking slightly differently:
What complicates things a little more is the plan to (finally) say goodbye to our Torque batch systems and move everything to HTCondor. The timing of this effort is under considerable pressure.
If the user does not provide their own container, they have a choice of
as provided by the CVMFS repository unpacked.cern.ch
Come to Amsterdam in September! The HTCondor Workshop Autumn 2024 is held at Nikhef from 24–27 September.
There is a fair question one can raise why we decided to do two things (or three things if you count HTCondor) instead of one: just upgrade to Alma Linux 9 instead of also switching platforms.
To counter this argument, many of the issues we face would have been a problem on Alma 9 as well, simply because the legacy software needs to be ported to newer software such as Python 3.
Having a reproducible way to set up a Debian system with full automation was finalised rather late, delaying a good chunk of the work.
However, this currently works like clockwork.
In order to port more software to Debian we are setting up a build system to produce native packages.
This is currently an old worker node with Debian 12 with a gitlab runner to run sbuild.
For building RPM packages we use podman+mock on the ame machine.