Request access

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

How to Overcome Silos between Product and Engineering Teams to achieve Reliability

André Cavalheiro
André Cavalheiro
Customer Reliability Engineer
Rely.io
André Cavalheiro
June 20, 2023
8
 min read
How to Overcome Silos between Product and Engineering Teams to achieve Reliability

Reliability from different lenses

Product and engineering teams share the goal of providing a superior user experience through their apps, but they approach this from two distinct perspectives.

For product teams, this objective translates into the careful design of user journeys, the series of actions taken by a user to accomplish a particular goal. Each action in this journey is expected to have a dependable, predictable, and consistent response ensuring a reliable and seamless experience.

On the other hand, engineering teams are entrusted with the construction and maintenance of the infrastructure and software supporting an application. They are tasked with ensuring the efficient functioning of a horde of services, each carrying several operations. For instance, an API hosts several endpoints performing different functions. The reliability of each endpoint must be separately evaluated to ensure the overall service's reliability.

In essence, product teams view reliability as the assurance of quality in user journeys and the actions within them, while engineering teams perceive it as guaranteeing that services run smoothly and perform their operations accurately. These perspectives converge via the fact user actions prompt one or more service operations.  

Entity diagram that illustrates how product and engineering teams view and interact with the aspects of reliability.

Ultimately, reliability is about ensuring the quality of the underlying engineering systems. But viewing it through these dual lenses facilitates a clearer connection between business responsibilities and technical considerations, enhancing coordination, observability, and communication.

The Product Catalog

The Product Catalog serves as the bridge, providing a standardized and comprehensive inventory of services, user journeys, the relationship between them and their health status. It allows for transparent communication and collaboration, making reliability initiatives a collective effort. Consequently, the focus remains on how the performance impacts end-users, leading to informed decisions and strategies to improve reliability.

Product Managers receiving reports about users experiencing issues can use the product catalog to access all the required information to assess a problem on their own:

  • Where's the issue happening?

    • Open the UJ catalog and quickly identify the non-compliant ones.
    • Open the UJ page and identify the faulty steps in those journeys.
  • How bad is it?

    • See the success ratio for every action for the time period of your choosing.
  • And, perhaps most importantly, who can help solve it?

    • Who owns the services that are compromising this journey?
The "Subscription Purchase" user-journey page, showing its metadata, links to external resources, user action list, and health status of related service operations

Engineers navigating through bugs, dependencies, owners, code-bases and observability tools, either on their day to day or during chaotic incident handling procedures, can use the product catalog as their compass, helping them make sense of the problems, and accessing information.

  • Has a release impacted service performance?

    • Open the Service Catalog, pick a fitting time window, and see if any service is depleted.
  • What exactly is the nature of the problem?

    • Open the Service page, what operations are lacking in performance? And in what way? Have they become unavailable? Have they become slow?
  • What is the impact of this? Who on the product team should I give a heads-up to?

    • Identify which User Journeys are supported by the faulty Service Operations. How business critical are they?
    • Communicate the nature of the problem to the owners of the User Journeys or in the documented contact links.
  • How can I debug the problem?

    • Use the multitude of resources available to your service, all accessible via the service page: from logs, traces, runbooks, cloud provider tools, documentation etc.
Service Catalog page displaying a list of services and their respective health statuses

For SREs striving to improve reliability, the product catalog serves as a comprehensive resource and reference guide:

  • Where are the gaps in our current monitoring and observability setup?
    • Check the SLO coverage rate per service and per UJ to efficiently identify blind spots and ensure adequate monitoring across all aspects of the platform.
  • Where can system performance be optimized further?
    • By identifying the processes where reliability or latency targets are lowest, SREs can pinpoint bottlenecks. The product catalog enables them to provide actionable recommendations on how to improve specific aspects of the platform.
  • How can we ensure smooth collaboration between different teams?
    • The product catalog acts as living documentation, providing direct insights into reliability. This aids in onboarding new team members, clarifying questions during quality assurance processes, and aligning with user analytics. It eases cross-team alignment, promoting a more cohesive, informed workforce.
  • How can we standardize our processes for deploying new services?
    • The product catalog, with its wealth of information, allows SREs to create streamlined processes and define clear requirements for deploying new services. This standardization ensures consistent performance across services and sets a clear expectation for new deployments.

Whether you're a Product Manager, a Business Leader, or an Engineer, the product catalog offers a unique, data-driven perspective. It's a one-stop solution for investigating issues, understanding user journeys, and making informed decisions. 

Steering Reliability Initiatives: A Practical Example

Let’s now transition to a practical application of this knowledge and use the product catalog as the backbone of a structured reliability initiative.

Imagine you're an SRE at a prominent streaming company. Your mission? Launch a reliability drive across the organization. A streaming company's core objectives revolve around two pivotal user journeys: 'Subscription Purchase' — ensuring a frictionless subscription process, and 'Content Streaming' — guaranteeing a smooth content delivery. In this context, let's zero in on the 'Subscription Purchase' user journey. This journey comprises several steps:

  1. Accessing the subscription purchase form
  2. Submitting it
  3. Being redirected to the purchase summary page.

As we know by now, beneath these seemingly simple interactions may lie a complex technical machinery. Let us visualize the dependencies between actions, services and other services in the diagram below.

Blue-filled rectangles represent User Journeys and Actions. Circles within the image denote Service Operations, which are the touch-points of the Services. These circles are placed within different layers that symbolize the Services.The color of each circle indicates the user action that triggers the corresponding Service Operation. 

Identifying these dependencies can be tricky. Regardless of the method, the outcome is a comprehensive understanding of which operations to monitor for ensuring the reliability of the entire user journey. The final leap? Implementing Service Level Indicators (SLIs) to track these operations and setting up a target they should abide by, their Service Level Objectives (SLOs), based on the desired user experience.

Establishing SLOs at this stage enables a more streamlined configuration process. It crystallizes what constitutes a business-impacting problem and provides clarity to all stakeholders involved.

  • Referencing the diagram presented earlier, it's clear that when a user submits their payment, six service operations are triggered (the ones colored in red).
  • With the help of the SLO wizard's previews, we can know the weekly and monthly volumes of this user journey.
  • Given this volume, the challenge becomes to determine what constitutes a problem that requires action. Is it when 20, 10, or even just 1 user struggles with payments in a month? This pivotal decision point calls for a comprehensive team discussion. Our platform leverages industry benchmarks and norms for various use-cases to facilitate these conversations and guide teams towards an aligned definition of what constitutes a satisfactory user experience.
  • As a basic guideline, this information can inform the SLO target values of the six underlying service operations.
  • These targets, in conjunction with the event volume, set the amount of errors contained in an SLO’s error budget. For instance, with an event volume of 1000 and a reliability target of 99.9%, your error budget will accommodate up to 10 errors. This means, each error will consume 10% of your error budget.
  • By using historical data to preview the SLO, you can retroactively assess if the target you've set is realistic. More importantly, by setting SLOs for all operations, you can swiftly pinpoint the operations that are weighing your reliability down.
  • Fortunately, this process is more straightforward than it sounds. The moment that you add a data-source, Rely ingests your telemetry data and matches it against a curated list of reliability templates developed by field experts to generate out of the box recommendations. This allows you to create dozens of SLOs in just minutes with a few simple clicks. 
First step of the SLO wizard from the product catalog, enabling component-specific recommendations, metric visualization over time, and SLO value testing with performance previews.

With SLOs in place, the product catalog shows its true value, enabling stakeholders to identify the health of both services and user journeys at a glance. It provides a dedicated health dashboard for each of these entities, offering comprehensive insights into the functionality and performance of your product. 

Try out Rely.io

  • If you want to know more or see Rely in action, book a demo with us.
  • If you want to join us in discussing industry best practices, how they are implemented by your peers and contribute to our product direction, join our Slack Community.

André Cavalheiro
André Cavalheiro
Customer Reliability Engineer
Rely.io
André Cavalheiro
On this page
Contributors
Previous post
There is no previous post
Back to all posts
Next post
There is no next post
Back to all posts
Our blog
See related articles
Introducing Dashboards
Introducing Dashboards
Dashboards in Rely allow your engineering teams to create and consume views that best tell the story and have answers for the use cases they care about
Samir Brizini
Samir Brizini
April 11, 2024
7
 min
The definitive guide for engineering teams to improve their Developer Experience (DevEx)
The definitive guide for engineering teams to improve their Developer Experience (DevEx)
We’ll explore why DevEx matters to your organization, how to define and measure it, the role of internal developer portals and reliability engineering practices, and give you practical recommendations.
José Velez
José Velez
September 21, 2023
19
 min
How a Reliability Champion (SRE, DevOps, etc.) uses Rely.io
How a Reliability Champion (SRE, DevOps, etc.) uses Rely.io
With Rely, you don’t need to worry about miss-communication, manual reports, or learning how to navigate a dozen different platforms - all the insights and literature exists in a single place.
André Cavalheiro
André Cavalheiro
August 10, 2023
7
 min