Part 4: The downgrade & resilience — Building a 5–nines (99.999%) Available Platform
--
Thank you for coming back to read Part 4.
We discussed all the small nuances you need to account for in Part 3, the Math behind the 5–9’s platform in Part 2, and the context in Part 1.
We will discuss interesting (and relatively easy to achieve) parts of building a 5-nines platform.
We will cheat a little here (we will take every advantage we can get) and understand the ways to keep our application working in a reasonably user-friendly way even when some services are unavailable.
We will discuss 2 main topics today — How to downgrade your application/platform APIs and How to go beyond resiliency.
Application downgrade
Imagine your application is set up for HA in this fashion, both sites are available to serve traffic based on DNS rules that are applicable to your platform. This is a happy state. You are serving your front-end application or APIs through both sites.
Now, let’s imagine, that Site 1 is down due to a region-level failure at your Cloud Service Provider, you are still OK, you might have trouble meeting the response SLAs only through Site 2, say, something like this.
You have 2 options:
a) Continue to serve the entire traffic through Site 2 and risk bringing it down, since you might have accounted for twice the capacity. (We will discuss more regarding this when we discuss Antifragility later in this write-up).
b) Degrade the functionality, say serve the minimum, the “must-haves” with minimum menu options available, so you can continue to serve the most important features of your application (say, checking your balance, customer care, the core feature of the application, chatbot, avoid streaming, turn off batch jobs, etc., — very specific to your business). Sometimes it has to be granular enough, for example, “Robinhood Problem”.