Nikhef Overview

Dennis van Dok

HEPiX autumn 2023 Workshop, University of Victoria, Tuesday 2023-10-17

About Nikhef

  • long term partner in the HEPiX community
  • traditionally in accelerator physics, but also:
  • neutrinos (KM3NeT)
  • Dark matter (XENON)
  • Cosmics (Auger)
  • Gravitational waves (LIGO-VIRGO-KAGRA)
  • also theory, eEDM

Organisation

  • ~243 scientific staff (96 permanent, 147 PhD candidates and Postdocs)
  • ~122 technical and support staff

(source: https://www.nikhef.nl/en/facts-figures-2022/personnel-2022/)

Nikhef is a partnership between the Institutes Organisation of NWO and six universities:

Datacentre housing

The early involvement with the internet puts Nikhef right on one of the oldest exchanges in the world.

Currently being extended due to popular demand.

See talk by Floris Bieshaar later this week.

High Throughput Computing

  • Running a modest facility 'at scale'.
  • Limited to 400 kW total power because of air cooling capacity*

The Dutch National Infrastructure

  • Together with SURF (SARA), we provide compute power and storage to projects that have a dutch partner.
  • NL-T1 Jointly with SURF
  • SURF provides funding for personnel and equipment.

Clusters

  • Still running Torque, but thinking very hard now about a transition to HTCondor.
  • Under investigation: running all payloads in containers to support more platforms
  • All our current systems (\(<5y\) old) have single socket AMD CPUs.
  • Three ARC CEs on the front-end

Storage

  • Happily running dCache for several years now.
  • Latest addition was a 'bargain' 6PB in 6 Lenovo SR655/D3284 84x16TB (1.3 PB raw, 1.0 PB net)
  • 3 installed in 2022, 3 more in May 2023.

Configuration Management

Token transitions

There are too many incompatible token profiles doing the rounds today.

  • Scitokens profile
  • WLCG profile
  • EGI/AARC profile

Everybody agrees it would be a good idea to introduce:

The GUT (Grand Unified Token) profile

standards.png

(source: https://xkcd.com/927/)

Providing neutral terrain

  • We don't have a special interest in pushing any token technology.
  • But we do like to talk to everyone.
  • The main goal is interoperability.
  • So we also advocate for a mapping plug-in standard.

Leading this effort is Mischa Sallé.

DDOSsing our government (with permission)

Joint exercise between government, IT providers, academia

  • High bandwidth enables use for 'other' purposes
  • Integrated exercise between government departments, IT providers and academia (the red team)
  • Actually DDOSsing the live systems (during night time to have minimal disruption)
  • Regularly repeated to improve resilience over time
  • …but attacks get more sophisticated, too

Innovation

Several new(ish) developments were tested to see how they suit our needs.

Storage

Storage II

  • WEKA distributed file system
  • extensively tested
  • interesting maybe for large shared scratch space across worker nodes
  • No support for IPv6?

Storage III

  • SSD backed data transfer node
  • System not capable of 400 Gbit/s yet
  • reached about 30 GB/s

Networking

  • New core router setup, dual Nokia 7750
  • 800 Gbit/s ready
  • More in talk by Bart van der Wal
  • Plans for 800 Gbit/s (coherent) link to CERN.
  • Generally testing routing/switching platforms

Computing

  • AMD Genoa test (96 cores, 192 threads)

    Seems to deliver same performance per thread as AMD Rome delivers per core.

  • AMD Bergamo (128 cores) is coming.

End of generic computing?

  • Trade-offs between number of cores and memory bandwidth
  • Adding more cores won't increase the total memory bandwidth.
  • Genoa-X has a (much) larger L3 cache, which could be interesting for applications that need high throughput with small data.

Conclusion: type of system really begins to depend on the type of application we are aiming for.

More computing

  • AMD GPU MI210 now on par with NVIDIA offerings

Generally: we're having good relationships with hardware vendors and are open for collaboration on testing!

And more computing

In the new lab (post-renovation) we're going to dive into water cooling.

  • Currently too many competing systems and standards
  • Understanding what is out there and what we could buy is key
  • Avoiding vendor lock-in
  • This would lift the 400kW air-cooling limit

To be presented at a future HEPiX!

Renovation

  • Oldest building on the campus. Needed a facelift.

Challenges

  • Many challenges:
  • This is not your ordinary office building
  • Keeping the lights on for computing facilities 24/7
  • Building a new extension to our housing facility.

See talk by Floris on Friday.

Come visit Nikhef in our new building some time! Maybe for the HTCondor Europe Workshop in 2024.

Thanks for your attention

With thanks to my colleagues who provided this material:

Mischa Sallé, Tristan Suerink, David Groep, Erik Kooistra