The Reliability Champion
Ensuring reliability is an essential aspect of any organisation's success. However, achieving it can be a challenging and complex task - there is no one way of achieving it, but there are many successful ways to get started .
Some companies have dedicated people, such as SREs or DevOps, to manage reliability, while others rely on engineering leaders or senior developers to lead the charge. Regardless of who takes on the role, the main objective is to establish and maintain good practices to ensure reliability.
Reliability Champions (RCs) need a way to keep track of their reliability coverage, generate insights in ways that are easy to communicate and know where they should invest their efforts next. The right tools can make a significant difference in achieving this goal. With Rely, Reliability Champions don’t need to worry about miss-communication, manual reports, or learning how to navigate a dozen different platforms - all the insights and literature exists in a single place.
Applying best practices
The SRE field is fast-growing, both in terms of market and technology. Every day, there are new best practices and recommendations coming from the most advanced practitioners and tech providers. Keeping up with everything that’s happening in the field and picking the practices that make the most sense for their own companies is the daunting task Reliability Champions undertake.
Rely enables reliability champions to:
- Have a single source of truth fully integrated with their entire stack thanks to the plug’n play integrations.
- Get access to the recommendations that best apply to their own use-cases and available telemetry through Rely’s curated Reliability Templates which have been developed by field experts.
- Set the standards to be applied across the company and track adoption across all teams through the Product Catalog.
Setting and tracking reliability targets
With best practices in hand, Reliability Champions need to work with business owners (PMs, Executives, etc.) and engineering teams to agree on and strive for reliability targets that ensure the company is meeting customer expectations. Leading this conversation can be challenging as reliability is viewed through two different lenses:
- Reliability to business owners is a matter of ensuring the quality of user journeys and the reliability of the user actions that make them up.
- While reliability for engineering teams is a matter of ensuring the quality of their services and the reliability of the different service operations they’re responsible for.
Bridging the gap between the two view-points is a necessity but it can involve multiple back-and-forths, miss-communication, trial and errors etc. This leads to frustration, uncontrollable timelines and increased costs, often slowing or even halting reliability investments entirely.
Rely bridges the gap between business owners and engineering teams thanks to:
- Data-Driven Targets: Setting reliability targets is a data driven process, based on both industry standards and past behavior thanks to the Historical Performance Report.
- Standardized reliability process: The product catalog gives all users access to a standardized and comprehensive inventory thanks to both a service catalog and a user journeys catalog.
- Transparent communication for team collaboration: Reviewing and investigating performance of both services and user journeys does not happen in silos. Thanks to the business map that articulates how services support user Journeys, everyone can view and assess the performance of either services and user Journeys through the most important lens: how does it actually impact the end-users?
Alerting when it matters
With more telemetry available across increasingly complex technological stacks, the temptation to set alerts on all potential signals of failure is high. This is further aggravated by the pressure put on teams to not miss something that could lead to a severe outage.
As a consequence, it is not rare to see a single real world failure event triggering dozens of alerts across different tools, for different teams, via multiple channels which ends up contributing to alert fatigue and to a worse overall performance.
Reliability Champions are tasked with helping engineering teams set up a monitoring and alerting strategy that balances coverage and false positives, protecting both the business from failures and the teams from alert fatigue. To support them in this task, Rely offers out-of-the-box and best practices compliant Alerting:
- Meaningful and manageable alert volumes: Alerts are computed based on SLOs, guaranteeing someone is only alerted (or worse, woken up) if the end-users, and thus the business, are actually being impacted
- Contextualised and actionable alerts: Thanks to the information pulled from the product catalog and the business map, all alerts are contextualised with
- The impact it is having on end-users (e.g. which user journeys are involved)
- The nature of the problem (e.g. outage, degrading performance, slow experience)
- Centralised Information: The product catalog also makes the alerts more easily actionable since the links to relevant information are easily available in a single place (e.g. Run-books, internal documentation, observability dashboards, etc.). This ensures time pressed Incident managers have all resources to restore service readily available.
Creating and nurturing a reliability culture
Reliability is a never-ending endeavour that Reliability Champions are tasked with. This involves, among other things, promoting it, tracking its progress and reporting on its achievements.
Once again, this can result in multiple back-and-forths with teams across all business units to ask for KPIs, review progress, and push for increased efforts. This can quickly become time-consuming and frustrating but is nonetheless necessary to report meaningful progress to stakeholders and most importantly to improve the end-users experience.
With Rely, Reliability Champions are assisted in fostering bottom-up adoption and stronger enthusiasm around reliability efforts:
- Performance leaderboards: Achievements are celebrated thanks to the performance leaderboards, where top-performing teams are put forward.
- Rank achievements: Teams are continually pushed to further adopt the standards set by the Reliability Champions thanks to Rank Achievements that show and celebrate milestones of progress, instead of putting them in a “reliability tunnel” until they achieve full compliance and performance results.
- Reporting views: Reliability is no longer confined to being “one more KPI for the OKRs review”. The Reporting Views allow reliability to be ingrained in daily operations and rituals (e.g., planning sprints taking into account remaining Error Budgets or writing post-mortems based on end-users impact).
Making informed decisions based on business impact
When it comes to reliability and its impact on designing roadmaps or setting engineering priorities, companies navigate a few pitfalls:
- It can be tempting to always strive for more ambitious reliability targets but this means delaying roadmaps and increasing investment (whether as engineering hours or as budget increases because collecting and storing Observability data is not free)
- It can also be tempting to celebrate far exceeding targets (e.g. 100% availability) while also overlooking the unnecessary cost this bears (redundancies, duplications, etc.)
As a result of the Former, trust is often compromised between product and engineering teams (e.g. “feature deliveries are never on time” and “product expectations are unrealistic”). As for the Latter, gaining and most importantly maintaining sponsorship from leadership becomes challenging (who rightfully expects results from Reliability investments, be them in time or budgets).
It comes down to the Reliability Champions to promote grounded conversations based on measurable facts. Rely helps this process by providing automated reports:
- Reliability report: Thanks to information from the product catalog and Business Map, Rely provides an out-of-the-box reliability report that shows how the user experience improved over the past cycles (e.g., quarterly). It also features the rank achievements, showing in one report the increased performance and increased maturity of the services and their teams.
- Tech & business report: The tech & business report clearly outlines which services are below, at, or above target and by how much. For each service, a stance is then recommended: “halt new features and reduce tech debt until performance is on target”, “keep current balance between tech debt and features”, or “increase feature shipping and encourage innovation that would tap into the unused error budget”.
By leveraging Rely's features, Reliability Champions can drive a culture of reliability throughout the organisation, ensuring that decisions are made based on measurable business impact and fostering trust between product and engineering teams.
Try out Rely.io
- If you want to know more or see Rely in action, book a demo with us.
- If you want to join us in discussing industry best practices, how they are implemented by your peers and contribute to our product direction, join our Slack Community.