Three Site Reliability Engineering practices that help you improve service operations: Incident Detection (1/3)

Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"

Three Site Reliability Engineering practices that help you improve service operations: Incident Detection (1/3)

José Velez

Founder & CEO

Rely.io

Pascal Bovet

Reliability Advisor & Former Head of SRE at Robinhood

Rely.io

August 11, 2023

•

min read

Over the following weeks, we will publish a mini-series on three concepts that help improve reliability and engineering velocity at your company. In this series, we focus on concepts of pain points that we believe every company has, provide ways to tell if your company suffers from those pains, and provide actionable solutions to overcome them.

Incident detection

Every company that offers multiple products knows that from time to time, those products tend to break, whether it is an update that went wrong or a massive spike of users that caused the servers to melt down. Outages can range from full-service outages to performance degradation or quality issues that impact users. There are times when companies notice outages quickly because they significantly affect their users, and sometimes when it takes a while to notice.

As a company, we want to know whether our products work as expected and whether we live up to our customer's expectations. We want to know quickly if components affecting our users break so we can work on fixing the underlying issues.

How can we implement an effective alerting stack?

Maintaining an effective alerting stack is hard

When companies start on their reliability journey, there is very little and spotty coverage. Monitoring and alerting get added for parts that break regularly, or that an engineer perceives as useful to cover, then the thresholds for those alerts get tuned over weeks and months to find the right thresholds to be alerted on. Often adding monitoring and alerting happens organically.

The monitoring stack grows, and the more it grows, the harder it becomes to maintain and understand. Alerts need to be tuned and refined constantly, either because they are not efficient or due to system changes. Alert tuning is often time-consuming as you want to gain confidence in the changes before entirely relying on them.

It is not uncommon for those parts of the system to be maintained by a few engineers that are experts in this field. Every time an engineer leaves the team to work on a different product, a little bit of knowledge on why things are how they are set up is lost. The knowledge loss is even more noticeable if the engineer maintains those configurations actively.

Eventually, teams don't fully understand anymore why things are set up and configured in a certain way and are left with large, complex configurations that are hard to maintain. At this point, the effectiveness will decrease quickly as it receives little maintenance.

The above is one example of what can lead to an ineffective alerting stack, but there are many reasons companies end up with one. It could be that it wasn't enough of a priority due to tight launch deadlines or other engineering priorities, because of missing knowledge, that the product got acquired or built by a different team, that products have different levels of maturity, or that there were no central standards.

Whatever the reason, we have seen that the quality and effectiveness of the alerting stack varies greatly between companies and within larger companies, even between teams.

How do you recognize if your alerting stack is suboptimal?

Seven common symptoms of a suboptimal alerting stack

On a high level, there are two common extremes on the spectrum of alert quality:

First, companies don't have enough signal - oncall engineers aren't informed about outages.
Second, companies have too much noise - oncall engineers are overwhelmed with notifications.

Either of those cases leads to real customer-impacting issues going unnoticed. While companies rarely end up on one of the extremes, often there is room for improvement, and companies need to find the optimal point for them.

Symptoms that help you spot an inefficient alerting stack include:

Alert Fatigue
The system sends so many alerts that the team can't respond to each one, leading to ignored alerts and possible overlooked issues.
False Positives
The system regularly sends alerts for issues that are nonexistent or transient. This contributes to alert fatigue and can undermine trust in the alerting system.
False Negatives
The system fails to send alerts for real, customer-facing production issues. This could lead to significant downtime or performance issues.
No Alert Prioritization
Alerts are not categorized or prioritized based on severity or impact, making identifying the most critical issues challenging.
Unactionable Alerts
Alerts are sent for issues that cannot be acted upon or are beyond the control of the team.
No Continuous Improvement
The team does not learn from past alerts to reduce false positives/negatives and improve alert accuracy over time. At the same time, if the team is spending too much time improving and turning alerts, it could also signal an inefficient setup.
Overemphasis on System Metrics
The system is overly focused on raw metrics like CPU usage, disk space, etc., without considering customer-focused metrics like response time or error rates.

All of the above can lead to a longer than necessary time to detection (TTD), which leads to longer time to recovery (TTR) - meaning that customers are experiencing outages for longer than needed.

How can this be prevented? Is there a better approach for alerting?

User centric incident detection

For this section, let's assume you operate a simple service with the ability for users to log in. The service consists of a database, a backend, and a frontend.

Cause-based alerting has been a prevalent practice since the rise of monitoring and observability. With cause-based alerting, alerts are added based on system metrics and known failures. In our example, we'd add an availability alert for each one of our components (database, backend, frontend), as unavailability in one of those would result in users being unable to log in. Further, we'd add more alerts for potential failures of our components. Following this procedure, many companies ended up with an overly complex alerting stack experiencing some of the above-mentioned symptoms.

For some time, reliability professionals have suggested a more user-centric approach to alerting, often referred to as "symptom-based alerting" or "customer-centric alerting". On a high level, this concept suggests that your users don't care about your system internals; they don't care about what storage or database technology you are using, and if it is up or not, they care about whether the functionality that is exposed to them is working or not! Similarly, your alerting should follow this mentality and be based on metrics representative of your users' experience.

While going into all the details of how to implement user-centric alerting is beyond this article, on a high level, the steps to follow are:

Identify and prioritize critical user journeys (CUJ), as you'll not be able to cover everything at the same time
Determine metrics that are representative of the user experience of those user journeys (often referred to as Service Level Indicators (SLIs))
Define clear objectives for your teams and services based on those metrics (often referred to as Service Level Objectives (SLOs))

Using the previous example, instead of adding alerts for each one of the components, we'd add a metric that represents the users' ability to log in and alert based on that metric. This one alert would cover failures in each one of our components resulting in fewer, simpler alerts.

Symptom-based alerting offers a benefit by addressing the shortcomings of cause-based alerting and promoting a comprehensive understanding of the system's health. Here's why symptom-based alerting shines:

Protecting User-Centric Reliability: Symptom-based alerting is purpose-built to protect the user experience. Monitoring and alerting based on business-driven metrics ensures that alerts are triggered only when there is a direct impact on the customer.
Better experience: Fewer, targeted alerts based on symptoms create an improved oncall experience, which helps increase engineering productivity and reduce outage duration.
Independent of system implementation: Symptom-based alerts are independent of system implementation. In our previous example, we don't need to adapt our symptom-based alert if we replace the storage system.
Simplifying Understanding Across the Organization: Symptom-based alerting provides a shared language of reliability throughout the organization. Product teams and executives can easily comprehend user journeys and the associated impacts on the business, ensuring a unified understanding of the system's health.

You might wonder where to start if you are a company with an existing alerting stack. You cannot do everything at once. A transition will take time. Focus on the most critical user journeys and implement some metrics there. No need to have the perfect metric right away; having something now is better than having something better later. Your metrics and alerts will evolve over time.

Common objections towards symptoms-based alerting

When organizations migrate to symptoms-based alerting, common objections are:

Purely symptom-based alerting doesn't work. There are cases where you want to know beforehand.
You are right. While the majority of alerts should be symptoms based, there is room for a small number of alerts that are cause based. Think about running out of disk space. You don't need to wait until the disk is full and you have customer-facing impact. Use cause-based alerts sparingly for issues that if left unattended will lead to customer impact soon.
Symptoms-based alerting means that you are constantly reacting to impact. We want to be proactive with cause-based alerts.
In general, you should only alert if there is an impact and urgent action is needed. With complex systems, something is always broken. As long as it isn't causing customer impact it might be okay to not worry about this. But similar to the previous point, there is a small set of cases where cause-based alerts might make sense. You should be especially careful not to alert on transient causes that might resolve independently.
Do you want me to throw away all the alerts we currently have? We spend so much time fine-tuning them.
Partly! We understand there is hesitation in throwing away existing solutions; we don't expect you to do this overnight. Instead, using shadow alerting - setting up some new symptoms-based alerts that are not active and letting them run in parallel for some time to gain confidence. Once there is confidence in the new alerts, you can switch over component after component. In some cases, it makes sense to keep some of the cause-based alerts as non-alerting signals as they can help to diagnose root causes.
This doesn't work for our team; we are an internal service.
Your team might not be externally facing but providing some infrastructure used by other teams. Your users are not the end customers but rather the internal users. Similar to external facing teams, you should set up your alerting to ensure that your (internal) users are receiving the service they are expecting.
With symptoms-based alerting, how do you alert the right team?
That is a challenge that needs to be solved on a case-by-case basis. There is no easy answer, and the best way forward depends on the specifics of a user journey. Zalando described a concept called "Adaptive Paging" in this blog post.

‍

Your opinion matters

We hope you enjoyed this episode on effective incident detection and symptoms-based alerting. Please let us know if you have any feedback on how to make our content more useful for you or any other topic you'd wish us to cover.

Do you have experience with an ineffective alerting system, revamping an alerting stack, or implementing symptoms-based alerting? We are curious to hear from you. Leave a comment on LinkedIn or drop us a note!

‍