Member-only story
High Availablity Series | Mean Time to Recovery (MTTR) is the only (DORA metric) KPI you need to measure for Product Engineering maturity
Product Engineering extends well beyond defining and building a great Product. It needs to run flawlessly day in and day out.
Once you lose a customer, you lose them for good.
In our book, Shiva and I, discuss in detail, running Software Engineering Operations, through this Principle:
Invest as Much Time in the Health of the Journey that Produces Your Products as You Do in the Product Itself.
MTTR is about measuring this health. Let’s get into it.
If your product needs to be 99.99 available, you have less than 5 minutes to recover from a major defect that impacts SLA (as per your business need). If you need a refresher on SLO/SLI/SLA, please read my intro article here.
However, if you need to be 99.9 available, you have approx. 45 minutes to recover. A bit more about designing your infrastructure to be high availability here.
Both of the above metrics assume you have < 10 production issues in a year. Those might seem like a lot of issues, but if your deployment frequency is fast, there is a…