Let's break down the release cycle into three parts and discuss some concepts and best practices in those areas:
- Release qualification to gain confidence
- Release gating to decide whether or not to move forward with a release
- Release roll out to make the release available to your users in a controlled manner
The goal is to assess the quality of the release and ensure that there are no regressions. Usual steps in the release qualification process include unit, integration, and end-to-end testing. A dedicated team might run QA (quality assurance) depending on the company.
While we don't want to dive deep into the details, let's briefly touch on two points - test coverage and manual testing. It is up to companies to find the right ambition level for test coverage. While we can't make any precise recommendations, we want to highlight that an extremely high level of test coverage might only be required if the criticality of the product justifies that amount of work. A general recommendation is that testing should be automated as far as possible. Manual tests often slow down the release process and are prone to errors.
One form of test frequently run is performance testing, where the goal is to determine if there is any performance degradation with the recent release. The goal is to detect if parts of the system have gotten significantly slower or the resource requirement increased, potentially causing reduced functionality or availability issues. The system is often exposed to organic or synthetic traffic for load tests to observe the characteristics under specified conditions. High-performing teams automate those tests against predefined quality and reliability indicators to ensure consistent service quality in order to meet customer expectations.
Two practices commonly used in testing to create predictable results are golden queries and replaying traffic. If the underlying data changes frequently and it is hard to compare between different test runs, it makes sense to define "golden queries," a series of queries run on a predetermined data set. Similarly, with complex systems, it can make sense to replay traffic or transactions from a certain period (e.g., the previous day) and compare the output to that of the previous version.
The second part of the release process is release gating. It answers the question of whether to release the release candidate to production or not. The most crucial part of release gating is to have a predefined set of criteria that are consulted for a go-ahead or abort decision.
Those criteria should include a set of clearly articulated and specific metrics; it is not enough to point towards a dashboard and instruct engineers to look at the dashboard and see if the metrics look reasonable. There should be clear expectations on which metrics should not be affected and for which metrics the release should see a difference.
If the system is complex, it will not be possible to consult every single metric. We highly recommend implementing and consulting metrics representative of the customer experience and starting the coverage with the most critical parts of the system. A recommended approach is to also define clear objectives for your service based on those metric, often referred to as Service Level Objectives (SLOs).
Other criteria that could influence how to proceed with a release could be on the compliance side. Specific requirements (e.g., SOC2 or SOX) might need to be met to stay compliant.
Lastly, if the change is a major update or a new product, companies often run through a more excessive set of requirements like launch or production readiness checklists. Those can include further sign-offs from non-technical stakeholders like marketing or support.
The last part of the release is the rollout to production. The rollout should be gradual to minimize the risk of a change negatively affecting your customers. If possible, you want to avoid all your customers receiving the updated version simultaneously.
As we discussed in our previous article, having a set of metrics representative of our customers' experience is critical to detecting issues early and reliably. If we don't have high confidence in our alerting stack, we want to watch those metrics closely during the rollout to catch problems early and roll back if needed.
A common practice used to minimize the impact of a bad rollout is canary analysis, basically testing your change on a small number of users for a fixed amount of time and then, based on that outcome, determining if the change is safe to propagate further.
The methods outlined above help teams increase confidence in their releases, ultimately allowing them to release more frequently with higher confidence. If possible, as much of the release process as possible should be automated and accessible in a single place.
Implementing these recommendations, allows you to empower your engineering teams to own their releases with guardrails in place to protect your customer experience, which makes it possible to ship faster with higher quality.