Here is how it should work. We pay for the hardware plus a 3 or 4 year support
contract with the vendor (or support organisation).
Something breaks, we call them up and the next day a support
engineer shows up with parts to fix the machine.
(Or, a replacement part is shipped overnight and we replace it
ourselves. We ship the defective part back using the same box and
carrier.)
Case study
Here is a virtual exchange with Support Unit™ based on real experience.
A complaining controller
Us: Hey, one of our 4 storage blocks is complaining that controller A
has reached end-of-life on it supercapacitor
SU: OK, we will send you a replacement overnight
Us: Thanks, that was quick
(a little later)
Us: Hey, it happened again on another storage block
SU: OK, we will send you a replacement overnight
Us: thanks, good job
(a few weeks later)
Us: Hey, another controller with the same problem?
SU: Er, hang on, your controller will be there in two weeks.
Us: ???
What happened?
This case raises all kinds of questions:
What went wrong?
Will this happen again?
Can they just get away with this?
What are we paying for, exactly?
How does this impact our operations?
This is not a unique case
We've experienced an increase in cases where vendors do not meet
the agreed deadline.
This raises some concerns; what is it we are doing here exactly?
The case for hardware support
How to take care of your computer
given that everything breaks at some point
consider the consequences when it does
who's problem is this anyway?
ours?
the manufacturer?
the vendor/integrator?
Warranty
'standard' warranty for \(x\) number of years
dependent on manufacturer and component
claim and replacement procedure may vary
Expected lifetime
Although it is possible to extend the warranty, the cost goes up
with age[citation needed]. Vendors don't really like to support
stuff endlessly…
…but then again if a customer is willing to pay a high price this
is a potential goldmine.
Economic lifetime
We buy hardware with the expectation to operate it for a number of
years (typically 4). This is driven by the aforementioned price of
support but at some points cost of operation catches up with the
initial purchase in cooling and electricity. Although Moore's law no
longer holds, there are still sufficient improvements in CPU speed and
efficiency, data density, and network bandwidth to render hardware
obsolete in half a decade.
Commodity vs. special hardware
Some components are generic and interchangeable
hard drives (but mind firmware)
memory modules
network cards
optics
But many are specific to the system
motherboards
power supplies
switch/router modules
Classes of service hardware
Not all hardware is created equally.
Mission critical services go onto hardware with sufficient redundancy and
resilience.
Less critical systems (worker nodes) often don't even get dual power supplies.
Implementing redundancy
buying redundant hardware ($$$)
stocking up on spare parts
but which parts? What if other things break?
taking out paid support ($$)
Special hardware
In some cases we don't really have much of a choice; critical storage
and network systems operate with special hardware, hard drive firmware,
network modules, etc. that cannot be sourced other than from the vendor.
This kind of vendor lock-in is undesirable but unavoidable for the higher
tier equipment. The type of support is also the most expensive, as we need
solutions within hours, not days.
Cost/benefit vs. Risk analysis
cost/benefit considerations are less useful here
taking a risk based approach:
risk of losing services
risk of losing data
risk of damaged reputation
Risk mitigation
We could make sure we maintain availability by buying double the hardware
up front so we have enough spare parts. But that is very expensive.
It is possible to mitigate some of the risk by taking out insurance in the
form of a support contract. This takes care of getting replacements
to you quickly so you don't suffer downtime (or not as much).
The sweet spot
Taking everything into account and with the right risk analysis we
usually land on next business day support for most systems. We
don't run a 24/7 operation anyway; some resilience goes into the design
of the system to hold out for at least a few days for the most important
systems.
What happens when things go wrong
We have kept our part of the bargain by paying for support, but
the support did not keep their end up.
What gives?
No Hard data
We do not measure supplier performance when it comes to meeting
their targets. I'd love to hear from people who do this.
We have seen suppliers struggle to deliver in the past; we've also
seen marked improvement through restructuring and shaping up their
support organisation.
Why is this happening
No single cause
logistics worldwide working on same-day shipping, why doesn't this work
for us?
no more weak links in the supply chain—they're all equally weak
Economic complications
centralisation of warehouses
overextended supply lines
low stock
difficulty in supplying stock
Additional complications
global chip shortage
High oil prices
geopolitical unrest/war
pandemics
climate change
Is there anything we can do?
A better use of ‘the five stages of grief.’
Denial
This is not so bad, a little slip-up with few consequences and surely
they'll do better next time
Anger
We're not going to stand for this, you better shape up or else we will
send in our lawyers! (Wait, do we have those?)
Bargaining
OK so we don't really have a lawyer on retainer but you could at least
try and talk to us and explain what went wrong? We can be really annoying
when we keep calling you up and we could tell our friends how bad your
service is.
Depression
This is not going to get better, is it?
Acceptance
The path forward
Dealing with this new reality
refuse to buy from Bad Vendor™
‘cut your nose to spite your face’
open a dialog and work with them
(move to the cloud)
Local spare parts kit
One vendor agreed to send us a collection of the most common spare parts
for our systems.
In case we need a replacement:
use part from crash kit
open a case with support
wait for the shipment, send back replaced part
received part replenishes crash kit
Not taking out a support contract
For one class of worker nodes we went with 'standard warranty'
different warranty conditions for different parts
advanced replacement still an option
backplane replacement
Interesting case where defective backplane rendered one of six blades
unusable.
choice was between letting it go or replacing the part ourselves
opted for replacement. About 1 hour work for two persons all told.
Conclusions
We are living in interesting times.
Buying hardware involves taking into account what kind of support
is going to be needed, but also the reality of how that support is
organised.