In-Depth SRE Guide

Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"

Guide

In-depth guide on SLIs, SLOs and Error Budgets

André Cavalheiro

Customer Reliability Engineer

Rely.io

December 14, 2021

•

min read

The Reliability Stack

Behind the customer's eyes products are complex systems made of interconnected components that rely on each other. Ensuring reliability ends up being harder because of this complexity. However, as usual, you can deal with a big problem by breaking it down into several smaller problems. Despite your product being a service on its own, you can further divide it into smaller and smaller services depending on the function that each component (or group of components) has. You can then think about the reliability of your product as the combined reliability of the services that compose it.

Since services have well-defined objectives and interfaces, it is easier to determine whether they are performing how they are supposed to, plus this strategy makes it easier to find the root cause of undesired behaviors.

Regarding service reliability, three things are always true, as perfectly stated by Hidalgo et al. [1]:

A proper level of reliability is the most important operational requirement of a service. A service exists to perform reliably enough for its users, where reliability encompasses not only availability but also quality, dependability, and responsiveness.
How you appear to be operating to your users is what determines whether you’re being reliable or not — not what things look like from your end.
Nothing is perfect all the time, so your service doesn’t have to be either. Not only is it impossible to be perfect, but the costs in both financial and human resources as you creep ever closer to perfection scale at something much steeper than linear.

Site Reliability Engineering (SRE) emerged as a way of measuring reliability upon these three pillars. At its core, is the reliability stack made of three building blocks that feed into each other: Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets (EBs).

We will go over each one of them on a conceptual and technical level. Then we'll go over a few examples and describe how we, at detech.ai, provide you with the proper tools to make your journey into an SLO-driven methodology as convenient as possible. Hopefully, when you finish reading it you'll have a greater understanding of how these concepts can be useful in your organization.

To keep in mind...

Before we start, let us highlight three things to keep in mind when implementing an SLO-based approach to reliability.

The purpose of this methodology is to provide you with new and valuable data that allows you to look at your services in a new way. Data that is more easily discussed within your organization and that you can use to make better decisions. It won't fix all your problems and it won't make your service reliable on its own, but if used correctly, it can become the most powerful tool that you have at your disposal to assess the state of your product and plan ahead during your organization's lifetime.

This leads us to the second point. SLOs are a journey! They're a continuous process, not a project that you can use during a sprint or two. A common misconception is that you can just use your SLOs as OKRs that you discuss in your quarterly meeting. They are actually more like a philosophy that you can implement within your organization, a new way of thinking that must be practiced and refined.

And finally, given that each service has its specificities and that they are always evolving, you shouldn't be afraid to adjust your SLO/SLI configurations. Obviously, this shouldn't happen too often, or people will lose track (or interest) of what they are supposed to follow. Nonetheless, given that there isn't such a thing as a one-model-fits-all, it is natural if you don't get the ideal configurations on your first try, ... or on your second, ... or on your third... We live and we learn! We learn new perspectives and gain new insights every day, it's only normal to adapt! It's also normal if a particular service has evolved to a degree where old configurations don't make as much sense as they once did... The most important thing is to constantly adapt to the evolution of your product to extract as much value from this methodology as possible, do what you have to do to ensure that happens.

Understanding SLIs

A service level indicator is a quantitative indicator that should measure some level of quality in your service. A common good practice in the SRE community is to define SLIs that have a clear translation into user happiness. Don't forget that behind the OKRs and day-to-day, the users' happiness is the ultimate goal of any service.

The point behind indicators is to allow for an explicit and objective view into how users experience a particular aspect of your product. Because of this, and despite the math behind the calculation of an SLI, it is of the utmost importance that it can be clearly defined via a short sentence that any stakeholder in an organization can understand, whether they are engineers, product owners, managers, or investors.

A meaningful SLI provides you with a way to create binary outcomes for specific events, them being: "The service did what the users needed it to" or "The service did not do what its users needed it to". Once this is defined, you can then measure the ratio of good events using equation [1]. While in practice some cases require some processing to get here, this is the only bit of math one needs to understand to interpret an SLI.

For example, you may determine that your users are happy if your webapp loads within 1.5 seconds. Well, then an event is automatically defined as your webapp loading, and a good event is when this happens within 1.5 seconds. If in a given week, there are 200.000 accesses and on 198.600 of these, it loaded within the threshold, then the percentage of good events is given by:

The fact that the reliability of your system is the output of the continuous conflict between these two forces throughout time (good events and bad events) is fundamental to understand some of the concepts that will be later explained in this article.

Defining SLOs

Assuming that you have a meaningful quality indicator, you can then set an objective for your organization to strive for. As it is commonly referred to in the SRE community, trying to ensure anything is reliable 100% of the time is a fool’s errand. In the information age, nothing runs in a vacuum, everything depends on something else - mistakes are made, things break, failures happen and unforeseen events take place. It's all part of the journey, that's ok! As long as these don't happen too often... SLOs serve just that, for you to determine how often you are allowed to fail while ensuring your users are still happy.

So let's get technical, how do you go from an indicator to an objective? Besides having an SLI, there are two things you need to define:

The first is the quality target you are aiming for. This is simply a value (between 0 and 100) in respect to your SLI's success ratio, equation 1. This threshold defines the number of failures that are comfortable enduring at a certain point in time.

The second is the time window during which the objective will track on, which involves picking its size and type.

WINDOW TYPE

The size of your window will define how long term the your SLO-based decision making process should be.

Shorter time windows are better for short-term decisions. If you missed your SLO last week then you can prioritize small optimizations, bug fixes, and reducing technical debt so that you can do better during the next few weeks. Longer periods allow you to be more strategic on the general direction your team is heading. "Should I have my engineers focus on moving our back-end to another framework that's more reliable since they're always complaining about the amount of trouble that the current one presents or should I have them automate a pipeline block with a new ML model ?" Simply put: do you want to increase the stability of your back-end with a better framework, or increase the amount of uncertainty with automation? Well, one week of data doesn't provide you with enough information to allow you to make such a big decision. Regardless of your choice, your engineers will have their hands full for the next few weeks so you want to make sure that during that amount of time your service meets your quality standards.

WINDOW SIZE

Regarding the window type: there are two types of windows: rolling windows which are continuously moving as time passes, and static windows, which are bound to calendar periods (e.g., a week, a month, a year). They both have their advantages and limitations. You can even decide to use both simultaneously as long as you are well aware of how to correctly interpret each one.

Rolling windows are more aligned with user experience.

As a rule of thumb, rolling windows should be defined in terms of weeks to consistently include the same number of weekends since it is common for the amount of traffic to vary significantly between weekends and weekdays.

Static windows are more aligned with the usual planning within an organization and overall project calendar, for example, you may wish to evaluate whether you were able to achieve your objective monthly or quarterly so you can plan.

Don't worry if you are feeling a bit overwhelmed by the number of decisions you need to make to establish an SLO. SRE is still an emerging field and there are no absolute right answers! Each service is a service, each organization is an organization. Recall what we said previously, this is a journey, not a destination! Start by choosing something that you can understand. As soon as you have at least one SLO up and running you'll immediately start to get valuable data about your service! You can iterate and make the appropriate changes as you start to get a feel for how these metrics evolve specifically in your service. Over time you can (and should!) re-think and re-iterate your SLO configurations to continuously improve their usefulness to your organization.

Defining Error Budgets

Error budgets are the most advanced part of the reliability stack because they rely on the two previous building blocks being meaningful.

As previously stated, when you define your SLO's target you are basically defining two states for your service: your success ratio is either acceptable, in which case you are in budget, or not, in which case you are out of budget, the image below illustrates the two scenarios:

As you can see, your error budget is automatically defined once you set your target:

What you are interested in measuring is the remaining error budget (REB), meaning the wiggle room you still have left until your good event ratio breaks the quality target that you have defined. As a way to generalize, this is measures as the percentage of the total error budget that has been consumed, as expressed by the equation bellow.

Simply put, you can think of EB as representations of how much your service is allowed to fail until it breaks your definition of quality.

EB allow you to say things such as "We can only suffer 3000 more errors until we break our budget this month", or "We only have 2 more hours of budget this quarter". The REB can be a powerful tool for decision-making, it provides you with an intuitive way to identify whether you should focus on developing new features or whether you should solidify the current state of your service. The figure below presents the most basic of algorithms that an organization can employ regarding EB (borrowed from Hidalgo et al. [1]).

A common mistake SRE beginners make when thinking about error budgets is to simply think of them as gas tanks that get filled at the beginning of your SLO window and decrease as time goes on and bad events occur. While it is true that your budget is full when your period starts and that bad events consume budget, you must remember that the consistent occurrence of good events can also provide you with budget. Only with time and practice can you truly grasp the dynamics evolved in an error budget, it varies according to your SLI, your SLO's time window type, and your own service as well.

Examples

Now for a more practical way of understanding how the dynamics behind the concepts that we have just introduced relate to one another. These notions can be grasped by thoughtful analysis of the mathematical formulas presented throughout the article, nonetheless, examples and visual representations are usually good for a deeper understanding.

Imagine the common case where HTTP requests with a 200 code are the good events and 500 code are bad events, and you had established a target value of 95%. For those of you with no web development background, HTTP codes are just a simple way of indication whether a request was fulfilled correctly or not.

At a given point, let's say tA you have a total of 100k requests from which 99k are successful (HTTP 200), and 1k are unsuccessful (HTTP 500). The percentage of good events is therefore 99%.

Suddenly, for some unforeseen circumstance your platform misbehaves for a period of time, during which you endure 3k unsuccessful requests. At the end of that period, let's say at tB, your percentage of good events is 99.000/103.000≈96%.

You are able to fix the problem and things go back to normal, some time passes and from then on you were able to respond successfully to another 2k requests, at which point, tC, your percentage of good events is 101.000/104.000≈97%.

What's next?

Congratulations! If you got here, hopefully, you now have a better understanding of the fundamental concepts behind SRE. Nonetheless, you are just getting started! There is a lot more stuff built on top of these building blocks that can help your organization make even more informed decisions and keep you up to the date with the reliability of your service. As we said at the beginning of this article SLOs are a journey, not a destination, and there are always ways to improve! We here at [Rely.io](https://Rely.io/) are here to help you every step of the way.

Now that you understand how your error budgets evolve with time, it can be helpful to monitor how fast (or slow) you are losing budget or when is it predicted to run out, this is usually known as the rate of error budget consumption (or burn rate). Moreover, you may even want to set up policies for being alerted when specific events occur so you can act on them as quickly as possible! You may learn about this topics in our upcoming blog articles!

Bibliography

[1] Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets" by Alex Hidalgo | 2020

[2] "The Site Reliability Workbook: Practical Ways to Implement SRE" by Betsy Beyer, Niall Richard Murphy, et al. | 2018

[3] "Site Reliability Engineering: How Google Runs Production Systems" by Niall Richard Murphy, Betsy Beyer, et al. | 2016

André Cavalheiro

Customer Reliability Engineer

Rely.io

On this page

Contributors

Request access

Request access

See related articles

Introducing Dashboards

The definitive guide for engineering teams to improve their Developer Experience (DevEx)

How a Reliability Champion (SRE, DevOps, etc.) uses Rely.io

Follow our simple guides to get set up in minutes