Three Site Reliability Engineering practices that help you improve service operations: Release Management (2/3)

Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"

Guide

Three Site Reliability Engineering practices that help you improve service operations: Release Management (2/3)

José Velez

Founder & CEO

Rely.io

Pascal Bovet

Reliability Advisor & Former Head of SRE at Robinhood

Rely.io

August 29, 2023

•

min read

Three Site Reliability Engineering practices that help you improve service operations: Release Management (2/3)

Changes are necessary to evolve our products and remain competitive. Every company wants to ship features and products as fast as possible, but companies need to balance speed and reliability. Shipping new features doesn't matter if, in the process, existing features break. Our users expect our products to remain reliable and new features to be added constantly, ideally without them noticing any side effects of the update.

How can we maintain high velocity while not compromising on the quality of our releases?

High velocity while maintaining reliability is hard

What are the secrets behind silicon valley companies that release multiple times daily versus legacy companies with release cycles spanning over a month?

As mentioned above, companies need to balance between velocity and reliability. One aspect to consider is how a bad change impacts our customers. Naturally, if the impact is high, for example, if a company is dealing with financial data and an issue can cause a loss of millions of dollars, we want to have a higher rigor on quality and certainty that the release will not break production.

But not all companies require that amount of rigor, yet they cannot ship with high frequency. A common phenomenon is that those companies have long release cycles.

Long-release cycles have multiple side effects. It means that if a feature can't make it into the release, there is a delay of one month or more until this feature is available to the users. Naturally, the pressure is higher to make sure that the feature works. Another side effect of big releases is that they encompass many changes, often to various parts of the system, across multiple user journeys, and they also roll in updates of their dependencies. A wide variety of changes going into a release makes the release harder to test, increasing the chances of causing any issues.

Lastly, companies with long release cycles often have release testing windows where a large variety of tests are performed, with a bunch of them being manual. The low frequency doesn't justify the automation of those tedious testing tasks.

How do you recognize if your release process can be improved?

Seven common symptoms of a bad release qualification process

‍Production issues shortly after the release
The most apparent sign is regular production issues in the hours or days after the release.

Frequent Rollbacks or Hotfixes
‍Releases often introduce production issues that require fixing either by rolling back the change or providing a hotfix. This will be more visible if rollbacks or hotfixes involve much manual work or are poorly documented.

Bad visibility
Missing crucial metrics about the impact of code changes means you'll rely on your engineers or customers to detect new bugs after the release. If the support volume regularly increases after the release, it's another symptom of flaws in the release process.

Long deployment window
The deployment window is long due to the unpredictability of the release. Hours or days are set aside to deploy a new release, and carefully observe for any side effects. During those windows, the product might be unstable or unusable at times.

Unclear release qualification procedures
The release qualification and deployment process is unclear and not or only partly documented. The documents are long, complicated, and frequently outdated by the time they are used.

Lack of automation
The release process includes much manual work during the testing and release cycles. This increases the time needed for the processes and makes processes more subjective and prone to errors.

Lack of confidence in the release
Employees from different groups regularly raise concerns about the quality of the release of features thereof, resulting in last-minute changes and discussions. Engineers doing the release and/or customer support are afraid or stressed about upcoming releases.

How can this be prevented? Is there a better approach for release qualification?

Improving the release procedure

Let's break down the release cycle into three parts and discuss some concepts and best practices in those areas:

Release qualification to gain confidence
Release gating to decide whether or not to move forward with a release
Release roll out to make the release available to your users in a controlled manner

Release qualification

The goal is to assess the quality of the release and ensure that there are no regressions. Usual steps in the release qualification process include unit, integration, and end-to-end testing. A dedicated team might run QA (quality assurance) depending on the company.

While we don't want to dive deep into the details, let's briefly touch on two points - test coverage and manual testing. It is up to companies to find the right ambition level for test coverage. While we can't make any precise recommendations, we want to highlight that an extremely high level of test coverage might only be required if the criticality of the product justifies that amount of work. A general recommendation is that testing should be automated as far as possible. Manual tests often slow down the release process and are prone to errors.

One form of test frequently run is performance testing, where the goal is to determine if there is any performance degradation with the recent release. The goal is to detect if parts of the system have gotten significantly slower or the resource requirement increased, potentially causing reduced functionality or availability issues. The system is often exposed to organic or synthetic traffic for load tests to observe the characteristics under specified conditions. High-performing teams automate those tests against predefined quality and reliability indicators to ensure consistent service quality in order to meet customer expectations.

Two practices commonly used in testing to create predictable results are golden queries and replaying traffic. If the underlying data changes frequently and it is hard to compare between different test runs, it makes sense to define "golden queries," a series of queries run on a predetermined data set. Similarly, with complex systems, it can make sense to replay traffic or transactions from a certain period (e.g., the previous day) and compare the output to that of the previous version.

Release gating

The second part of the release process is release gating. It answers the question of whether to release the release candidate to production or not. The most crucial part of release gating is to have a predefined set of criteria that are consulted for a go-ahead or abort decision.

Those criteria should include a set of clearly articulated and specific metrics; it is not enough to point towards a dashboard and instruct engineers to look at the dashboard and see if the metrics look reasonable. There should be clear expectations on which metrics should not be affected and for which metrics the release should see a difference.

If the system is complex, it will not be possible to consult every single metric. We highly recommend implementing and consulting metrics representative of the customer experience and starting the coverage with the most critical parts of the system. A recommended approach is to also define clear objectives for your service based on those metric, often referred to as Service Level Objectives (SLOs).

Other criteria that could influence how to proceed with a release could be on the compliance side. Specific requirements (e.g., SOC2 or SOX) might need to be met to stay compliant.

Lastly, if the change is a major update or a new product, companies often run through a more excessive set of requirements like launch or production readiness checklists. Those can include further sign-offs from non-technical stakeholders like marketing or support.

Rollout

The last part of the release is the rollout to production. The rollout should be gradual to minimize the risk of a change negatively affecting your customers. If possible, you want to avoid all your customers receiving the updated version simultaneously.

As we discussed in our previous article, having a set of metrics representative of our customers' experience is critical to detecting issues early and reliably. If we don't have high confidence in our alerting stack, we want to watch those metrics closely during the rollout to catch problems early and roll back if needed.

A common practice used to minimize the impact of a bad rollout is canary analysis, basically testing your change on a small number of users for a fixed amount of time and then, based on that outcome, determining if the change is safe to propagate further.

Conclusion

The methods outlined above help teams increase confidence in their releases, ultimately allowing them to release more frequently with higher confidence. If possible, as much of the release process as possible should be automated and accessible in a single place.

Implementing these recommendations, allows you to empower your engineering teams to own their releases with guardrails in place to protect your customer experience, which makes it possible to ship faster with higher quality.

Common objections

Common objections around shipping faster include:

We need 100% test coverage
It is doubtful that you need such a high test coverage, and it is even less likely that you need this for all user journeys. Usually, the return on investment for such a high coverage doesn't pay off and justify the opportunity costs.
Testing in production is bad
Testing in production is bad if that is the only testing done. But it will only be possible to catch some issues before it hits production. Therefore you should invest in detecting production issues in prod quickly. Think about it as a multi-layer approach where you have different safety nets.
My release is too big to follow those recommendations
Try to break down your release into smaller parts and focus on improving the most critical user journeys first. You don't need to apply the same rigor and spend less time on smaller, less critical components.
We have a lot of manual tests. Releasing that frequently doesn’t work for us.
In short, try automating as much as possible, or consider reducing manual tests for less crucial user journeys. Some tests will be harder to automate, but having manual testing in your release process will make it slower and more error-prone. Also, releasing more frequently means that the investments in automation will pay off sooner.
We cannot deploy during critical business hours
Companies need to understand their risk profile. It might make sense to limit releases during specifically busy times when an issue can cause a massive loss (think Black Friday for e-commerce), but this should be the exception and not the norm. There are very few companies where an issue can put the firm at risk (e.g., companies in the financial space) that can justify this behavior to some extent. However, the goal should be to improve confidence in the release process to a degree where you get closer to release at any time. Further, the release restrictions should be kept to a minimum and not affect every user journey, as different user journeys have different risk profiles.

Your opinion matters

We hope you enjoyed this episode on effective release management. Please let us know if you have any feedback on making our content more useful for you or any other topic you'd wish us to cover.

Do you have experience improving your company's release process, release gating, or rollout? We are curious to hear from you. Leave a comment on LinkedIn or drop us a note!

Further readings

José Velez

Founder & CEO

Rely.io

Pascal Bovet

Reliability Advisor & Former Head of SRE at Robinhood

Rely.io

On this page

Contributors

Request access

Request access

See related articles

Three Site Reliability Engineering practices that help you improve service operations: Operational Maturity (3/3)

Three Site Reliability Engineering practices that help you improve service operations: Release Management (2/3)

Three Site Reliability Engineering practices that help you improve service operations: Incident Detection (1/3)

Follow our simple guides to get set up in minutes