Beyond SLAs: A Practical Guide to Calculating and Leveraging TTx Metrics for System Reliability

Beyond SLAs: A Practical Guide to Calculating and Leveraging TTx Metrics for System Reliability

Even of a service with an SLA 99.999%, there is still an allowable downtime of 5 minutes and 16 seconds per year. To clarify the calculation method of the SLA, it is necessary to provide mode detailed metrics for the entire downtime period, and this metric is called TTx (Time to)

TTx

TTx is a set of time metrics for measuting an outage from the relibility/servicing persepctive. The diagram below illustrates the relationship between those metrics, and TTM covers the entire range.

Created by Seeking SRE by David N. Blank-Edelman and Martin Check, Microsoft

You can also refer to the original post from O'reilly in the diagram to understand the meanings of these TTx metrics

Abbreviation Metric Name Formula Definition (Mandarin)
TTM Time to Mitigate (Incident Mitigation Time, SLA restablished) - (Incident Impact Start Time, SLA breached) SLA 總消耗時間
TTD Time to Detect (Incident Detection Time) - (Incident Impact Start Time, SLA breached) 發現事件總消耗時間
TTE Time to Engage (Last Engineer Engaged <= Mitigation Time) - (Incident Detection Time) 最後一位進場工程師進場總進場時間
TTF Time to Fix (Incident Mitigation Time, SLA restablished) - ((Last Engineer Engaged <= Mitigation Time) 修復總消耗時間
TTN Time to Notify (First Customer Advisory Time) - (Incident Impact Start Time, SLA breached) 通知使用者總消耗時間

Example of TTx

We use the example below to understand the meaning of TTx

First, we need to record the specific time of each event, as listed below

  • 10:00 UTC+8 on 27 June 2025 - Service anomaly occured, SLA breached
  • 10:01 UTC+8 on 27 June 2025 - Someone or the monirtoring system has raised a case to notify system administrator
  • 10:10 UTC+8 on 27 June 2025 - Someone from the service side began their first conversation with the user or customer
  • 10:20 UTC+8 on 27 June 2025 - The last engineer (often also the first) responsible for service side responsible for maintenance identified the root cause
  • 10:30 UTC+8 on 27 June 2025 - Service restoration completed, SLA restablished

Based on the above timeline, you can easily calculate the specific indicator time for each TTx

Abbreviation Metric Name Formula (HH:MM) Total elapsed time
TTM Time to Mitigate 10:30 - 10:00 30 mins
TTD Time to Detect 10:01 - 10:00 1 min
TTE Time to Engage 10:20 - 10:01 19 mins
TTF Time to Fix 10:30 - 10:20 10 mins
TTN Time to Notify 10:10 - 10:00 10 mins

The unit of time calculation here is HH:MM. If you need precision down to seconds, you can adjust to HH:MM:ss as the measurement standard.

TTx Improvement

TTD (Time to Detect)

Azure Network Watcher

This is the part of the entire restoration process that is easiest to automate and monitor. In addition to using Azure Network Watcher for networking monitor, you can also utilize Azure Managed Prometheus, Azure Managed Grafana, Azure Monitor, or any 3rd parties monitoring and alerting solution to accomplish this.

Azure DevOps Work Items

Regarding the creation of incident cases, Azure DevOps can be used to record and manage them by automatically creating Work Items. Of course, I believe that project management on the market with boards / kanban, such as Jira, can certainly achieve the same functionality. This can be evaluated independently.

TTE (Time to Engage)

Wikipedia - Swiss cheese model
The Swiss cheese model of accident causation illustrates that, although many layers of defense lie between hazards and accidents, there are flaws in each layer that, if aligned, can allow the accident to occur. In this diagram, three hazard vectors are stopped by the defenses, but one passes through where the "holes" are lined up.

This is the most time-consuming part of the entire process. In pratice, it is usually the responsible engineers who collect information and analze a large volume of logs. However, with the advancement of AI, information collection and preliminary assessments are now being carried out through AI, provding references for subsequent engineers. Microsoft has attempted to adopt AIOps technology to optimize the Triangle System - Optimizing incident management with AIOps using the Triangle System. And Google has also launched kubectl-ai for Kubernetes to help administrators more quickly identify and provide solutions.

As data collection becomes more refined, I personally am quite optimistic that AIOps can help many administrator reduce operational pressure. However, the accompanying issue is as humans, How caa we continously cultivate talent to enable them to identify which solution is best for customers or users?

TTD (Time to Fix)

Old fashion Style SRE

No matter if its a reboot, update, rebuild, hot patch or rollback, it's a solution because, for users, they don't care how you fix it - the key is that the service is restored to normal.

Assuming it's a reboot or rollback, the TTF time shouldn't be too long. In other situations, it might take a very long time, and at that point, alternative solutions MUST BE sought to mitigate the incident

TTN (Time to Notify)

Notify user

This notification target could be users, customers or your boss. It is recommend to have a single point of contact to provide them with regular and contiinuous updates on information.

The key point of the feedback with users are

  1. Explain alternative solutions
  2. Ensure transparency in progress handling
  3. Provide information about the next steps.

To ensure that engineers can focus on completing repair tasks, it is more sutiable to have dedicated person handle communication, which is usually carried out by a Technical Account Manager (TAM) or Customer Success Account Manager (CSAM).

Phil's Memo

Microsoft Reactor - Azure Incident Retrospectives

Although having an incident is not a good thing, the interesting part is that Microsoft Azure is willing to conduct a restrospective for Azure incident, not only documenting it on Azure Status Histor but also recording the discussion and publicly sharing record on YouTube.