Here is how it should work. We pay for the hardware plus a 3 or 4 year support
contract with the vendor (or support organisation).
Something breaks, we call them up and the next day a support
engineer shows up with parts to fix the machine.
(Or, a replacement part is shipped overnight and we replace it
ourselves. We ship the defective part back using the same box and
Here is a virtual exchange with Support Unit™ based on real experience.
A complaining controller
Us: Hey, one of our 4 storage blocks is complaining that controller A
has reached end-of-life on it supercapacitor
SU: OK, we will send you a replacement overnight
Us: Thanks, that was quick
(a little later)
Us: Hey, it happened again on another storage block
SU: OK, we will send you a replacement overnight
Us: thanks, good job
(a few weeks later)
Us: Hey, another controller with the same problem?
SU: Er, hang on, your controller will be there in two weeks.
This case raises all kinds of questions:
What went wrong?
Will this happen again?
Can they just get away with this?
What are we paying for, exactly?
How does this impact our operations?
This is not a unique case
We've experienced an increase in cases where vendors do not meet
the agreed deadline.
This raises some concerns; what is it we are doing here exactly?
The case for hardware support
How to take care of your computer
given that everything breaks at some point
consider the consequences when it does
who's problem is this anyway?
'standard' warranty for \(x\) number of years
dependent on manufacturer and component
claim and replacement procedure may vary
Although it is possible to extend the warranty, the cost goes up
with age. Vendors don't really like to support
…but then again if a customer is willing to pay a high price this
is a potential goldmine.
We buy hardware with the expectation to operate it for a number of
years (typically 4). This is driven by the aforementioned price of
support but at some points cost of operation catches up with the
initial purchase in cooling and electricity. Although Moore's law no
longer holds, there are still sufficient improvements in CPU speed and
efficiency, data density, and network bandwidth to render hardware
obsolete in half a decade.
Commodity vs. special hardware
Some components are generic and interchangeable
hard drives (but mind firmware)
But many are specific to the system
Classes of service hardware
Not all hardware is created equally.
Mission critical services go onto hardware with sufficient redundancy and
Less critical systems (worker nodes) often don't even get dual power supplies.
buying redundant hardware ($$$)
stocking up on spare parts
but which parts? What if other things break?
taking out paid support ($$)
In some cases we don't really have much of a choice; critical storage
and network systems operate with special hardware, hard drive firmware,
network modules, etc. that cannot be sourced other than from the vendor.
This kind of vendor lock-in is undesirable but unavoidable for the higher
tier equipment. The type of support is also the most expensive, as we need
solutions within hours, not days.
Cost/benefit vs. Risk analysis
cost/benefit considerations are less useful here
taking a risk based approach:
risk of losing services
risk of losing data
risk of damaged reputation
We could make sure we maintain availability by buying double the hardware
up front so we have enough spare parts. But that is very expensive.
It is possible to mitigate some of the risk by taking out insurance in the
form of a support contract. This takes care of getting replacements
to you quickly so you don't suffer downtime (or not as much).
The sweet spot
Taking everything into account and with the right risk analysis we
usually land on next business day support for most systems. We
don't run a 24/7 operation anyway; some resilience goes into the design
of the system to hold out for at least a few days for the most important
What happens when things go wrong
We have kept our part of the bargain by paying for support, but
the support did not keep their end up.
No Hard data
We do not measure supplier performance when it comes to meeting
their targets. I'd love to hear from people who do this.
We have seen suppliers struggle to deliver in the past; we've also
seen marked improvement through restructuring and shaping up their
Why is this happening
No single cause
logistics worldwide working on same-day shipping, why doesn't this work
no more weak links in the supply chain—they're all equally weak
centralisation of warehouses
overextended supply lines
difficulty in supplying stock
global chip shortage
High oil prices
Is there anything we can do?
A better use of ‘the five stages of grief.’
This is not so bad, a little slip-up with few consequences and surely
they'll do better next time
We're not going to stand for this, you better shape up or else we will
send in our lawyers! (Wait, do we have those?)
OK so we don't really have a lawyer on retainer but you could at least
try and talk to us and explain what went wrong? We can be really annoying
when we keep calling you up and we could tell our friends how bad your
This is not going to get better, is it?
The path forward
Dealing with this new reality
refuse to buy from Bad Vendor™
‘cut your nose to spite your face’
open a dialog and work with them
(move to the cloud)
Local spare parts kit
One vendor agreed to send us a collection of the most common spare parts
for our systems.
In case we need a replacement:
use part from crash kit
open a case with support
wait for the shipment, send back replaced part
received part replenishes crash kit
Not taking out a support contract
For one class of worker nodes we went with 'standard warranty'
different warranty conditions for different parts
advanced replacement still an option
Interesting case where defective backplane rendered one of six blades
choice was between letting it go or replacing the part ourselves
opted for replacement. About 1 hour work for two persons all told.
We are living in interesting times.
Buying hardware involves taking into account what kind of support
is going to be needed, but also the reality of how that support is