In the first two posts, we looked at reliability from a reactive (Incident Detection) and preventive (Release Management) perspective. A third pillar that helps improve system operation and overall reliability practices of companies is operational maturity.
Operational maturity consists of two parts:
- Reliability standards and best practices
- Measuring compliance and other performance indicators
Keeping up with the latest SRE and Platform Engineering trends/developments is hard
Reliability is a huge and evolving field, with new information and knowledge popping up constantly. Plenty of blogs, newsletters, and books to read and follow. With such a variety of information being available, many companies don't know what principles and best practices to follow to increase the engineering efficiency of their teams while improving the reliability and maintainability of their systems or how to assess their operational maturity.
What does operational maturity even mean? Compare it to owning a car. If you own and drive a car, you follow specific rules, and you do preventive maintenance to extend the life of your car. If you see the oil light going on, you'll add oil.
Similarly, if you operate a system at scale, some practices make it easier to operate and maintain those systems. Some metrics indicate how well-run a system is. People with experience operating systems at scale have documented their learnings and best practices, sometimes condensed in books, articles, or videos. However, they are harder to find, and it is unclear which ones to apply. Operational maturity indicates how well companies adhere to current standards and best practices.
People with a high level of knowledge in Software Development as well as DevOps and SRE practices are rare and hard to find. It is common for companies to either hire people with system operation backgrounds next to pure software engineers or have their infrastructure/development teams operate their services. Staying on top of the most recent developments in the field is challenging for DevOps and SREs and even harder for developers who own services where service operations are not the main part of the job.
While engineering departments of a certain size can afford to have a few people dedicated to keeping up with evolving principles and standards, it is unreasonable and unproductive to expect most developers to spend time learning and memorizing all of them. Base-level knowledge of operating services at scale is beneficial, but it doesn't seem like time is well spent to ramp up every engineer to become an expert in this field. Instead, organizations should invest in providing the required tools, guardrails, and guidance to improve the developer experience while operating their systems.
Another common theme at big companies with teams with varying levels of experience is that you'll end up with many localized solutions, where teams start to build tooling, track, and report metrics differently than their neighboring teams. It is positive that the team cares about a topic and wants to improve. Still, when every team is building "special snowflake solutions" and tracking different metrics, it is hard for senior engineering leaders to get a clearer picture of how certain areas look across their whole organization.
Lastly, following reliability standards and best practices often doesn't yield immediate results and overnight successes. More than not, it takes time and practice to see those improvements - it is a long-term game. This fact can hinder adoption and cause some teams to abort their efforts.
Eight symptoms of low operational maturity
Introducing reliability standards and best practices
Reliability standards
Rather than reinventing the wheel, following practices and standards that have proven successful is advisable. Following those practices will lead to the desired result in the long run.
Where is a good place to start? One of the most popular standards in the reliability field is the DevOps Research and Assessment (DORA), which covers four main areas:
- Lead time: How long does it take for code to go from committed to being live in production?
- Deploy frequency: How often do you deploy code to production?
- Change fail percentage: What percentage of changes result in degraded service and require remediation?
- Time to restore: How long does it take to restore your service if an incident occurs?
DORA found that those four performance indicators are great predictors of organizational performance.
On top of DORA, there are other operational standards that can be applied across the board to improve reliability and engineering velocity. Unfortunately, those standards are not from a single source or collected in a central place for easy reference. For teams to follow them they need to painfully read lots of materials and assess which apply to their systems.
Some of the best practices, standards and indicators we recommend are:
- Production readiness - these Indicators can be used as a checklist for production readiness, as well as post-production audits and to detect any degradation or missing elements.
- ~Ownership: Clearly defined ownership and responsibilities of the owner of the product or service.
- ~SLOs: An initial set of SLOs are in place. What are the most critical user journeys and what metrics can be used to know if you are delivering the experience the users expect?
- ~Alerting and escalation policy: Alerting based on above SLOs with clear expectations who is oncall for the service at what point. For larger companies clearly defining and stating an escalation policy and points of contacts is recommended.
- ~Documentation: Documentation that helps to understand how the system operates or how to interact with the system (e.g. API documentation).
- ~Runbooks: Especially for new launches, have accessible runbooks in place describing expected failure modes and how to resolve them. There are limitations to runbooks, but that is beyond the scope of this post.
- ~Code quality metrics / Testing: Requiring some level of test coverage, however being mindful of efforts spent and focus on most impactful tests (e.g. parts that are most critical or most likely to break). Quality of tests is more important than quantity.
- ~Dependencies: A service or product depends on many other services. Having a good understanding of your dependencies and how to deal with their unavailability can significantly improve product quality.
- Operational maturity - to determine the maturity of on-call, deployment and service management practices and use the right indicators (e.g. SLOs) to assess the health of services in production.
- ~Incident management: Indicators to watch around handling production incidents
- ~~Number of incidents
- ~~Time to close incidents
- ~~Duration to write incident post mortem
- ~Time to release: How long does it take to do a release? How much manual work is involved?
- ~Product and service health: Are we meeting the SLOs we defined for our user journeys and services?
- Observability maturity - to assess whether teams are collecting the right telemetry data, have the proper set up needed to quickly detect and troubleshoot incidents, and to understand how systems are behaving and performing.
- ~Metrics: Minimum required indicators to debug production issues are implemented.
- ~Golden Signals: Have metrics for traffic, latency, error rate and saturation.
- ~Dashboards: Implementation of an overview dashboard, that let’s engineers quickly assess if the system is working as expected as well as troubleshooting dashboards that allow debugging.
- ~Alerting: Alerting should be based on multi-window, multi-burn rate implementation.
- ~Monitoring view: Companies often have a combination of infrastructure-, service- and product-based monitoring. Ideally you should be able to correlate those metrics from user to infrastructure.
Measurement
The second important part of operational maturity is measurement. To see how well a company is doing with the operation of its services, it is crucial to look at performance indicators as well as compliance with standards and best practices.
Some performance indicators are easy to measure; for other best practices, it is harder to gain precise metrics. Don't let that discourage you. It is better to have an imperfect metric now than a better one later. The exact measurement is less important than a rough idea of where they are.
Collecting those metrics allows you to put things into perspective. It will highlight which areas need focus and on which fronts the company is performing well. Those insights can flow into the next planning round and set direction. It will also help to balance between speed and stability.
Collecting the metrics centrally allows senior engineering leaders to see the performance across their departments and compare teams, helping them make better decisions. Similarly comparing results to industry peers gives valuable insight into which areas the company is doing well and where it has areas of improvement.
It is advisable to put some form of review in place to look at those metrics and spot trends. The frequency at which different metrics should be reviewed depends on the individual metrics and can range from weekly to semi-annually. A critical aspect of those reviews is transparency. The metrics that are consulted by senior leadership should be the same ones that are looked at by the engineers daily.
Closing thoughts
Once organizations have identified the desired standards they want to follow, they should start to invest in automation to adopt those standards, ideally during service creation. Automation will reduce the time it takes to build new services and the number of diverse solutions created at the company.
Common objections towards adopting the industry leading operational and reliability standards
- It takes a long time to see results
Yes, that is true. Following best practices will not lead to results overnight. It will take time to see improvements in the metrics, and it may take even longer to see the full benefits of the operational practices. But rest assured that investing time into those practices will improve service operations over time. - Some practices are hard to measure
Often, the exact value of the metric is irrelevant, and what matters is to have a ballpark understanding of where on the spectrum a metric is. It doesn't matter if it is 24% or 28%, but is this something we believe is at 20% or 80%?
- Hard to compare teams
One observation from doing similar work in the past is that it will be hard to compare results between teams, especially for subjective metrics. Some teams will be more critical with the interpretation, whereas others will have a more loose interpretation. Similar to the bullet above, the exact values matter less than seeing overall trends on strengths/opportunities for improvement.
- There is no correct answer to what the right value should be
As mentioned before, the exact value is less important, and teams shouldn't spend time arguing if a value should be 25% or 30%. The focus should be on introspection, analyzing the trends, and interpreting the values. What can be learned from the assessment? Which efforts should be prioritized? Where should we focus more on speed, and where should we focus more on reliability? - Those standards are not for my kind of company
It might be true that certain standards are only applicable to certain types of companies, but at the same time there are best practices for companies of all sizes and stages of maturity. If the standard as defined doesn't fully apply, try to understand the intention and adjust it to match your company’s setup.
Your opinion matters
We hope you enjoyed this episode on operational maturity. Please let us know if you have any feedback on making our content more useful for you or any other topic you'd wish us to cover.
Do you have experience assessing and improving your company's operational maturity? We are curious to hear from you. Leave a comment on LinkedIn or drop us a note!