Member-only story
High Availability Patterns: Building the right Heartbeats (Health checks) for effective failover
System designers and cloud architects tend to look at software as components on paper, working inside the boxes of infrastructure, PaaS Services, Serverless, and compute clusters.
There is an implicit assumption by cloud architects that application (and API) developers and SRE will work together to operate this ‘paper’ in real life. While this is OK, as paper becomes real, there is quite a bit that is left on the table.
One topic that is seldom thought through when it comes to High Availability and keeping up with SLA/SLO/SLI, is Heartbeats, or health checks, which I will cover today. (Note: I will use the terms heartbeats and health checks interchangeably)
A better health check strategy leads to a timely failover and recovery. Simply, more resilient.
Global Health Check Hierarchy
- A platform has one or more APIs in front of a load balancer that is regional or zonal (a typical setup is depicted below)
- Each Regional Load balancer talks to auto-scaled instances (Cloud-specific clusters such as ECS or MIG or Kubernetes) and continuously checks for healthy instances, we will cover more on this topic further