Part 3: Sweat the small stuff — Building a 5–nines (99.999%) Available Platform

8 min readMay 8, 2022

Thank you for coming back to read Part 3. We discussed the Math behind the 5–9’s platform in Part 2 and the context in Part 1.

We will discuss some of the critical aspects of building and maintaining a 5-nines platform in this part of the series. We discussed the step-wise increasing complexity of keeping the platform available and achieving a goal of fewer than 5 minutes of downtime in a year. This goal is not going to be possible if we don’t focus on the smallest of details.

No amount of system design is going to help you with this lofty goal if we don’t handle these nuances we are going to discuss today, because, we know things will fail, sooner or later. To Quote AWS CTO Werner Vogels, “Everything fails, all the time”.

Based on our discussion in Part 2, let’s assume you have come up with a 5–9’s architecture that has synchronous replication of some data and asynchronous replication of high volume but a low priority data with a multi-region, active-standby solution. The specifics that the business agreed on do not matter in this particular context — e.g., RTO, RPO, and for this discussion.

When you have a scenario where a single region is unavailable, which happened to us, when we were in the middle of a launch of a new product in December 2021, this is not uncommon, you have to be ready for a switch over. Some platforms use multi-region as an active-active solution, not for HA purposes, but to direct traffic to local regions, nonetheless, they too have to worry about these specific details.

If you do want to achieve a 5–9s availability using cloud service providers, given the recent history of region level outages, it is unavoidable to think of multi-region deployment. This is a separate topic on its own, so, let’s come back to focusing on smaller but very important details.

Here are 8 critical areas to be prepared with, when designing for HA and critical solutions in building a 5-nines platform:

1 — Timeouts

Have you ever seen this on your production site? This generally means you have never thought about alignment of timeout settings across your API Gateway, Application Server, and Database Server…



