2019-03-26
The mission of the National Institute for Subatomic Physics Nikhef is to study the interactions and structure of all elementary particles and fields at the smallest distance scale and the highest attainable energy.
New sysadmin in the PDP group: Mary Hester joined in Februari.
The Computer Technology group hired a new webmaster: Roel Roomeijer. joined the system administration team in March.
Nikhef-PDP is half of the Netherlands Tier-1.
The other half is SURFsara.
Together we also participate in the DNI, the Dutch National Infrastructure for data/compute intensive sciences.
(thanks to Onno Zweers)
81 DELL R6415 nodes (3 racks) with one AMD EPYC 7551P 32-Core Processor.
Price/performance is good.
CPU | HEPSPEC06/core | €/core |
---|---|---|
Intel(R) Xeon(R) Gold 6148 | 19.57 | 315 |
AMD EPYC 7551P | 14.94 | 247 |
Tried both with and without hyperthreading. Seemed to provide little benefit so turned off now.
Single socket system doesn't suffer cache coherence penalty.
Fast 3.2 TB NVMe SSDs for local storage
Old Hitachi storage system replaced by Netapp.
This was bought for the /project storage systems; important data that has an external backup. Data is available on Unix and Windows systems, so mixed mode of Unix permissions and Windows ACLs.
Not too many vendors offer this combination. In fact, the EMC is the last of its kind to offer it.
Partnership with University of Groningen led to a geographically separated off-site backup solution.
Hardware and software managed and maintained by Groningen; component replacements done by local team.
Much more economical than the previous commercial offering.
Setting up a high-throughput private compute cloud with Openstack.
Project was not getting much traction. Several avenues were explored; lack of long-term stability in the software development played a role.
Currently given a higher priority due to internal demand for alternatives to classic cluster computing. Nikhef as a lab is trying agile as a way of managing projects and the cloud project is likewise approached as agile.
Progress has been made to bring more systems under salt's control. LDAP, DNS, dCache.
Integrated generation of Icinga checks.
Not replacing the legacy grid services which may not be with us much longer.
High-throughput clustered local file access.
Much appreciated by users
But…see below.
Standard Icinga2 install; configuration on Gitlab server; server will install and test new config automatically.
Transferring ACLs from our older systems overloaded the write buffer which locked op the entire volume.
A workaround was eventually found but it took months for DELL EMC to have the correct engineer on the case and to come up with a good diagnosis and a workable solution.
The problem was reported in July, a resolution came in October.
A second issue involved the memory consumption of tcp NFS connections. These turned out to be significantly higher than for udp, leading to a system crash and downtime.
This issue took even longer to resolve; the initial problem was found in Spring 2018; the conclusion was that the machine was sold with the wrong specs. A solution was proposed in December which means that in April we will have a planned downtime for a system upgrade.
The result was too much traffic through the NFS door and hanging clients.
Resolution:
The vision of this is still blurry. It is hard to get a good sense from the users what they would want. Finding a representative pilot community is important.
If possible, we will virtualize the current legacy grid capacity and explore its elasticity.
Considering alternatives:
Talk to Tristan.
Trying to minimize the legacy Knowledge maintainance (technical debt); since dCache is now used at Nikhef, having both dCache and DPM in one lab makes no sense. We're going to phase out DPM and set up dCache for grid storage.
HEPiX Amsterdam, 14–18 October 2019