Part 3: Sweat the small stuff — Building a 5–nines (99.999%) Available Platform

8 min readMay 8, 2022

Thank you for coming back to read Part 3. We discussed the Math behind the 5–9’s platform in Part 2 and the context in Part 1.

We will discuss some of the critical aspects of building and maintaining a 5-nines platform in this part of the series. We discussed the step-wise increasing complexity of keeping the platform available and achieving a goal of fewer than 5 minutes of downtime in a year. This goal is not going to be possible if we don’t focus on the smallest of details.

No amount of system design is going to help you with this lofty goal if we don’t handle these nuances we are going to discuss today, because, we know things will fail, sooner or later. To Quote AWS CTO Werner Vogels, “Everything fails, all the time”.

Photo by Guillaume de Germain on Unsplash

Based on our discussion in Part 2, let’s assume you have come up with a 5–9’s architecture that has synchronous replication of some data and asynchronous replication of high volume but a low priority data with a multi-region, active-standby solution. The specifics that the business agreed on do not matter in this particular context — e.g., RTO, RPO, and for this discussion.

When you have a scenario where a single region is unavailable, which happened to us, when we were in the middle of a launch of a new product in…

Part 3: Sweat the small stuff — Building a 5–nines (99.999%) Available Platform

Written by Suresh Kandula