Part 1: The Setup — Building a 5–nines (99.999%) Available Platform
As engineers, we strive to build platforms that are available at all times.
It is exciting to work on mission-critical platforms and solve system design issues that balance throughput, availability, and cost of keeping the lights on.
The most pessimistic you are about the availability of infrastructure, your applications, and integrations — the more it is going to cost you to meet SLOs and associated SLIs of each of your application components.
Refer to my article on a quick introduction to SLAs, SLOs, and SLIs and understand the math and documentation behind building an SLA (technically SLO) table for your application.
In this 5-part series, we are going to look at the anatomy of a 5-nines application platform (a typical one), and understand what it takes to build one and constraints in achieving Maximum Availability.
Before Cloud — i.e., in BC times, only large corporations attempted to think about developing highly available platforms, given the capital expense, and need for enterprise licenses such as Oracle HA, SQL Server Replication, etc., and the luxury of having multiple data centers separated by 50 and up to 250 miles, depending on the industry and associated regulations.
To set the stage, we are going to use the below application, a hypothetical transaction processing web application with real monetary implications in case of data loss or availability/downtime. The discussions we are going to have are cloud-agnostic and more focused on system design and infrastructure & architecture.
We have here below, an internet-facing web application accepting transactions, have the ability to failover within the same site (load balancer), across multiple sites, and have containerized compute resources (APIs, Batch), queuing, storage, cache, and file systems. The data is replicated to the secondary site (more specifics later). Pretty standard.
Primary and Secondary sites are ‘carbon copies’ albeit could be partial capacity.