Next business day (or whenever)

Dennis van Dok

HEPiX Spring 2022 on-line workshop, 27 April 2022

How to own a computer

So you buy computers

to implement services

to fulfil your mission

but the hardware breaks

service goes down

mission in jeopardy

what do you do?

Support contracts

Here is how it should work. We pay for the hardware plus a 3 or 4 year support contract with the vendor (or support organisation).

Something breaks, we call them up and the next day a support engineer shows up with parts to fix the machine.

(Or, a replacement part is shipped overnight and we replace it ourselves. We ship the defective part back using the same box and carrier.)

Case study

Here is a virtual exchange with Support Unit™ based on real experience.

A complaining controller

Us: Hey, one of our 4 storage blocks is complaining that controller A has reached end-of-life on it supercapacitor

SU: OK, we will send you a replacement overnight

Us: Thanks, that was quick

(a little later)

Us: Hey, it happened again on another storage block

SU: OK, we will send you a replacement overnight

Us: thanks, good job

(a few weeks later)

Us: Hey, another controller with the same problem?

SU: Er, hang on, your controller will be there in two weeks.

Us: ???

What happened?

This case raises all kinds of questions:

  • What went wrong?
  • Will this happen again?
  • Can they just get away with this?
  • What are we paying for, exactly?
  • How does this impact our operations?

This is not a unique case

We've experienced an increase in cases where vendors do not meet the agreed deadline.

This raises some concerns; what is it we are doing here exactly?

The case for hardware support

How to take care of your computer

  • given that everything breaks at some point
  • consider the consequences when it does
  • who's problem is this anyway?
    • ours?
    • the manufacturer?
    • the vendor/integrator?

Warranty

  • 'standard' warranty for \(x\) number of years
  • dependent on manufacturer and component
  • claim and replacement procedure may vary

Expected lifetime

Although it is possible to extend the warranty, the cost goes up with age[citation needed]. Vendors don't really like to support stuff endlessly…

…but then again if a customer is willing to pay a high price this is a potential goldmine.

Economic lifetime

We buy hardware with the expectation to operate it for a number of years (typically 4). This is driven by the aforementioned price of support but at some points cost of operation catches up with the initial purchase in cooling and electricity. Although Moore's law no longer holds, there are still sufficient improvements in CPU speed and efficiency, data density, and network bandwidth to render hardware obsolete in half a decade.

Commodity vs. special hardware

Some components are generic and interchangeable

  • hard drives (but mind firmware)
  • memory modules
  • network cards
  • optics

But many are specific to the system

  • motherboards
  • power supplies
  • switch/router modules

Classes of service hardware

Not all hardware is created equally.

Mission critical services go onto hardware with sufficient redundancy and resilience.

Less critical systems (worker nodes) often don't even get dual power supplies.

Implementing redundancy

  • buying redundant hardware ($$$)
  • stocking up on spare parts
    • but which parts? What if other things break?
  • taking out paid support ($$)

Special hardware

In some cases we don't really have much of a choice; critical storage and network systems operate with special hardware, hard drive firmware, network modules, etc. that cannot be sourced other than from the vendor.

This kind of vendor lock-in is undesirable but unavoidable for the higher tier equipment. The type of support is also the most expensive, as we need solutions within hours, not days.

Cost/benefit vs. Risk analysis

  • cost/benefit considerations are less useful here
  • taking a risk based approach:
    • risk of losing services
    • risk of losing data
    • risk of damaged reputation

Risk mitigation

We could make sure we maintain availability by buying double the hardware up front so we have enough spare parts. But that is very expensive.

It is possible to mitigate some of the risk by taking out insurance in the form of a support contract. This takes care of getting replacements to you quickly so you don't suffer downtime (or not as much).

The sweet spot

Taking everything into account and with the right risk analysis we usually land on next business day support for most systems. We don't run a 24/7 operation anyway; some resilience goes into the design of the system to hold out for at least a few days for the most important systems.

What happens when things go wrong

We have kept our part of the bargain by paying for support, but the support did not keep their end up.

What gives?

No Hard data

We do not measure supplier performance when it comes to meeting their targets. I'd love to hear from people who do this.

We have seen suppliers struggle to deliver in the past; we've also seen marked improvement through restructuring and shaping up their support organisation.

Why is this happening

  • No single cause
  • logistics worldwide working on same-day shipping, why doesn't this work for us?
  • no more weak links in the supply chain—they're all equally weak

Economic complications

  • centralisation of warehouses
  • overextended supply lines
  • low stock
  • difficulty in supplying stock

Additional complications

  • global chip shortage
  • High oil prices
  • geopolitical unrest/war
  • pandemics
  • climate change

Is there anything we can do?

A better use of ‘the five stages of grief.’

Denial

This is not so bad, a little slip-up with few consequences and surely they'll do better next time

Anger

We're not going to stand for this, you better shape up or else we will send in our lawyers! (Wait, do we have those?)

Bargaining

OK so we don't really have a lawyer on retainer but you could at least try and talk to us and explain what went wrong? We can be really annoying when we keep calling you up and we could tell our friends how bad your service is.

Depression

This is not going to get better, is it?

Acceptance

The path forward

Dealing with this new reality

  • refuse to buy from Bad Vendor™
    • ‘cut your nose to spite your face’
  • open a dialog and work with them
  • (move to the cloud)

Local spare parts kit

One vendor agreed to send us a collection of the most common spare parts for our systems.

In case we need a replacement:

  1. use part from crash kit
  2. open a case with support
  3. wait for the shipment, send back replaced part
  4. received part replenishes crash kit

Not taking out a support contract

For one class of worker nodes we went with 'standard warranty'

  • different warranty conditions for different parts
  • advanced replacement still an option

backplane replacement

backplane.png

Interesting case where defective backplane rendered one of six blades unusable.

  • choice was between letting it go or replacing the part ourselves
  • opted for replacement. About 1 hour work for two persons all told.

Conclusions

We are living in interesting times.

Buying hardware involves taking into account what kind of support is going to be needed, but also the reality of how that support is organised.

Care to share experiences?