Clever is Stupid

Factoring Failure Rates Into Service Architecture

Published in

FAUN — Developer Community 🐾

9 min readMar 9, 2021

A while ago, someone was talking to me about a design idea they had. They were thinking about adding some logic to their infrastructure with the hope of improving reliability. One part of the discussion reminded me that — in spite of best intentions — adding complexity to a system can make it less reliable.

Fortunately, there are some useful tricks that can guide you in the right direction when thinking about designing for higher availability.

In one of my classes at university, we spent what seemed to be a horrific amount of time learning about reliability and failure. In particular, there were metrics described by the four-letter acronyms MTBF, MTTD, MTTR, and MTTF. It was the most boring of things! I was way more interested in learning how to write code and build hardware, and less about what I saw as stupid theoretical calculations of little relevance. Although I’ve been subconsciously using them since those days, they were important lessons that I failed to articulate well to others.

Recently, though, it’s become far more relevant to the point that I thought I should write about it. But first some background. Let’s talk about Mean Time Between Failures (MTBF).

A component of any particular specification will have a corresponding mean time between failures. That’s not to say that all components of that specification will fail after a given number of hours of operation: some will fail much faster, others will perform perfectly for longer. But on average, given a statistically significant sample size, you’ll see a fairly consistent mean time between failures for that type of component. Subsequently, the failure rate (λ) is calculated as 1/MTBF ✝︎.

Serial Dependency Chains

It gets more interesting when looking at a system made of several or many components. In this case, we can say that in general, the failure rate of the system is the sum of the individual failure rates of its components, like this:

Maybe that’s not super intuitive though, so let’s try it another way. Consider a strip of fairy lights that are connected in series.

At some point, the tungsten filament in one of these lamps will fail and go open circuit.

When that happens, the circuit is broken, and the current can no longer flow. The whole string of lamps will go out.

But really, any lamp can fail at any time, and every lamp failure will cause the entire strip to go dark.

If each lamp has an MTBF of 1,000 hours, then their individual failure rate is 1/1000. For three lamps connected in series, the failure rate is 3/1000, and for 100 lamps connected in series, the failure rate is 100/1000 (or 1/10). Since failure rates are always positive fractions, adding component dependencies to a system will always increase the rate of failure.

The equivalent from a services perspective would be something like a web service that has a dependency on several other resources or web services: if one of those goes down, then the whole service will also go down.

Improving Reliability Through Parallelization

But if you know much about fairy lights, you’ll realize that they’re rarely designed as a series circuit these days. If your power supply is designed right, all of the lamps can be connected in parallel. To calculate the system failure rate in this case, we find the reciprocal of the sum of reciprocals of each of the component failure rates. Like this:

I’m not a maths whizz, so that’s starting to look a bit hairy. Fortunately, the electrical schematic makes it a bit more intuitive, as you might expect.

Now if one of the lamps fails, all of the others will keep ongoing. Because they are each still powered, there is still light.

The net is that running parallel systems offer significantly lower failure rates than those that are run in series.

In Practice, Though…

The real world is very different, of course: rarely does a system fit perfectly into one of those categories or the other.

A case in point is the parallel example above. I neglected to point out that all of those lamps are in series with the battery. The lamps will all go out when the battery runs flat.

So let’s take a look at some other, more practical examples.

Cloud Load Balancer

A load balancer distributes requests to a set of compute instances. The instances are effectively in parallel and connect serially to the load balancer.

But it’s not quite as simple as that.

Because load balancers are a commodity and battle-hardened over the millions of instances that are running at any one time, the code that they execute as well as the high standard infrastructure they run on are incredibly reliable.

This means that their MTBF will be really really high. Subsequently, of course, their failure rate will be really, really low. Although it’s greater than zero, it’s not by much, Let’s call it negligible.

The compute instances are way more complicated: they run with one of a near-infinite possible combinations of operating system, configuration, patch levels, and packages. They run code that is likely undergoing much more churn and executing in a far less commoditized way.

One of those instances is likely to fail because it will get a bad request, run out of disk space, or memory leaks, or something. But that’s okay because it has friends who will quickly step in to help.

Active-Active Versus Active-Standby

Imagine a hypothetical cloud service that runs in a region (Block One). To improve availability, a second copy of the service is also created (Block Two).

Imagine that this service runs either as…

Active-Active, where requests are routed to either block one or block two in real-time. This is like two or more lamps running in parallel

If all of the services autoscale, then the only challenge (and it’s not always trivial) is how to keep data eventually consistent across regions and designing services that use it. But Netflix has been doing that for years.

Active-Standby, where requests are always routed to block one, and — in case of error — all traffic is switched to block two. This is like running one bulb, and keeping a second lamp in the closet in case the first lamp goes out.

Probably the replacement will work, but you can never be sure until you need to use it.

But there are other challenges with this approach: are you paying for capacity sitting idle, on the off-chance that something will go wrong? How much does that cost you? Are you sure that services are patched and updated, and how confident can you be that patching didn’t break anything? How confident are you that the data was replicated correctly, and do you have a plan to roll-back when normal service is restored? Probably you’d want to do that manually. All of these things reduce predictability and increase the failure rate of Block Two, not to mention the toil that it adds.

So it’s like having a second lamp in the closet, but you have to pay rent on the closet space, and you still have to take the lamp out every week to keep it clean and make sure it hasn’t been stolen.

Back-Porting a Library

Software rots. Just by leaving it idle, it rots.

It doesn’t physically rot, of course, but with respect to the things it needs to function, it ages. Most obviously, hardware and media become obsolete, but more subtly, versions of languages, frameworks, and libraries change and evolve, and develop new features… themselves eventually succumbing to rot.

Generally, it’s good practice to take time out of feature development to update your code to use the latest version of its various dependencies. Avoiding regular updates makes the task increasingly difficult.

Eventually, you’ll realize you need features in version six of a library, but your service is stuck three years in the past on version two. What do you do?

The smart thing is usually to bite the bullet and go through the tedious effort to upgrade to the current version. Sometimes, pressures of project delivery force us into short-term thinking and we decide to back-port the code in version six of the library so that it will work on version three. This is invariably a mistake.

Software libraries can be viewed as being organized serially: if anyone library fails, then the service will fail. The failure rate of a piece of software consumed and tested by thousands of teams is often far better than that created or modified by a single team or company.

Finally…

You probably realized that I’ve talked in grossly general terms, and likely have questions around how you take these formulas and turn them into concrete, quantifiable numbers. When you’ve read an intro on the topic, something like the Wikipedia article on Reliability Engineering, you’ll figure out that the answer is it’s complicated.

The qualitative principle remains, however: the more complexity there is in a system that contains dependent components, the more likely it will fail. The more redundancy there is in a system, the less likely it will fail.

Another way of saying this is that a system with components in parallel will always have a failure rate lower than that of the individual components. Whereas in series, the failure rate is always higher than that of individual components.

As we architect and design systems and services (or even processes), it’s vital to understand that those clever complex approaches which are highly interdependent in nature are more prone to fail than those that are based on simplicity and make use of redundancy through parallelization. So:

Drive simplicity. Minimize the number of pieces or components in your system, and — in turn — the complexity of each sub-component in its own right. Be ruthless about it.
Where necessary, use parallelization to reduce dependence on any one thing and so improve reliability.

It doesn’t matter if it’s a set of fairy lights, a load-balanced microservice, or a set of Raptor engines. It will help. These are essential foundations when thinking about creating reliable systems that remain trustworthy at scale.

✝︎ Failure rate for non-repairable systems (such as lamps) is calculated as 1/MTTF. But hopefully, the systems that you work on are repairable.

p.s. If you prefer Python to algebra, the series equation reads as sum(failure_rates) and the parallel equation reads as 1/sum((1.0/i for i in failure_rates))

p.p.s. Thanks to Chelsea Dorich, Jan Baez, and Kevin Tuber for their review and comments.

👋 Join FAUN today and receive similar stories each week in your inbox! ️ Get your weekly dose of the must-read tech stories, news, and tutorials.

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬