For this section, let's assume you operate a simple service with the ability for users to log in. The service consists of a database, a backend, and a frontend.
Cause-based alerting has been a prevalent practice since the rise of monitoring and observability. With cause-based alerting, alerts are added based on system metrics and known failures. In our example, we'd add an availability alert for each one of our components (database, backend, frontend), as unavailability in one of those would result in users being unable to log in. Further, we'd add more alerts for potential failures of our components. Following this procedure, many companies ended up with an overly complex alerting stack experiencing some of the above-mentioned symptoms.
For some time, reliability professionals have suggested a more user-centric approach to alerting, often referred to as "symptom-based alerting" or "customer-centric alerting". On a high level, this concept suggests that your users don't care about your system internals; they don't care about what storage or database technology you are using, and if it is up or not, they care about whether the functionality that is exposed to them is working or not! Similarly, your alerting should follow this mentality and be based on metrics representative of your users' experience.
While going into all the details of how to implement user-centric alerting is beyond this article, on a high level, the steps to follow are:
- Identify and prioritize critical user journeys (CUJ), as you'll not be able to cover everything at the same time
- Determine metrics that are representative of the user experience of those user journeys (often referred to as Service Level Indicators (SLIs))
- Define clear objectives for your teams and services based on those metrics (often referred to as Service Level Objectives (SLOs))
Using the previous example, instead of adding alerts for each one of the components, we'd add a metric that represents the users' ability to log in and alert based on that metric. This one alert would cover failures in each one of our components resulting in fewer, simpler alerts.
Symptom-based alerting offers a benefit by addressing the shortcomings of cause-based alerting and promoting a comprehensive understanding of the system's health. Here's why symptom-based alerting shines:
- Protecting User-Centric Reliability: Symptom-based alerting is purpose-built to protect the user experience. Monitoring and alerting based on business-driven metrics ensures that alerts are triggered only when there is a direct impact on the customer.
- Better experience: Fewer, targeted alerts based on symptoms create an improved oncall experience, which helps increase engineering productivity and reduce outage duration.
- Independent of system implementation: Symptom-based alerts are independent of system implementation. In our previous example, we don't need to adapt our symptom-based alert if we replace the storage system.
- Simplifying Understanding Across the Organization: Symptom-based alerting provides a shared language of reliability throughout the organization. Product teams and executives can easily comprehend user journeys and the associated impacts on the business, ensuring a unified understanding of the system's health.
You might wonder where to start if you are a company with an existing alerting stack. You cannot do everything at once. A transition will take time. Focus on the most critical user journeys and implement some metrics there. No need to have the perfect metric right away; having something now is better than having something better later. Your metrics and alerts will evolve over time.