Assuming that you have a meaningful quality indicator, you can then set an objective for your organization to strive for. As it is commonly referred to in the SRE community, trying to ensure anything is reliable 100% of the time is a fool’s errand. In the information age, nothing runs in a vacuum, everything depends on something else - mistakes are made, things break, failures happen and unforeseen events take place. It's all part of the journey, that's ok! As long as these don't happen too often... SLOs serve just that, for you to determine how often you are allowed to fail while ensuring your users are still happy.
So let's get technical, how do you go from an indicator to an objective? Besides having an SLI, there are two things you need to define:
The first is the quality target you are aiming for. This is simply a value (between 0 and 100) in respect to your SLI's success ratio, equation 1. This threshold defines the number of failures that are comfortable enduring at a certain point in time.
The second is the time window during which the objective will track on, which involves picking its size and type.
The size of your window will define how long term the your SLO-based decision making process should be.
Shorter time windows are better for short-term decisions. If you missed your SLO last week then you can prioritize small optimizations, bug fixes, and reducing technical debt so that you can do better during the next few weeks. Longer periods allow you to be more strategic on the general direction your team is heading. "Should I have my engineers focus on moving our back-end to another framework that's more reliable since they're always complaining about the amount of trouble that the current one presents or should I have them automate a pipeline block with a new ML model ?" Simply put: do you want to increase the stability of your back-end with a better framework, or increase the amount of uncertainty with automation? Well, one week of data doesn't provide you with enough information to allow you to make such a big decision. Regardless of your choice, your engineers will have their hands full for the next few weeks so you want to make sure that during that amount of time your service meets your quality standards.
Regarding the window type: there are two types of windows: rolling windows which are continuously moving as time passes, and static windows, which are bound to calendar periods (e.g., a week, a month, a year). They both have their advantages and limitations. You can even decide to use both simultaneously as long as you are well aware of how to correctly interpret each one.
Rolling windows are more aligned with user experience.
As a rule of thumb, rolling windows should be defined in terms of weeks to consistently include the same number of weekends since it is common for the amount of traffic to vary significantly between weekends and weekdays.
Static windows are more aligned with the usual planning within an organization and overall project calendar, for example, you may wish to evaluate whether you were able to achieve your objective monthly or quarterly so you can plan.
Don't worry if you are feeling a bit overwhelmed by the number of decisions you need to make to establish an SLO. SRE is still an emerging field and there are no absolute right answers! Each service is a service, each organization is an organization. Recall what we said previously, this is a journey, not a destination! Start by choosing something that you can understand. As soon as you have at least one SLO up and running you'll immediately start to get valuable data about your service! You can iterate and make the appropriate changes as you start to get a feel for how these metrics evolve specifically in your service. Over time you can (and should!) re-think and re-iterate your SLO configurations to continuously improve their usefulness to your organization.