Request access

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

FAQs - Frequently Asked Questions

Everything you need to know is here at your fingertips. Ask questions, browse for answers and submit requests.

Reliability intelligence

What should I use Rely.io for?

Engineering teams use Rely.io (1) to define the acceptable reliability levels [objectives] and (2) to measure the actual quality of service being delivered against those objectives. With this setup, teams are able to make data-driven decisions faster and with higher ROI, create powerful automations for alerting & release rollbacks, promote a healthier & more efficient incident management culture, and provide visibility of the reliability of their products across the company up to the executive level.

With Rely.io, you'll reduce the number of reliability issues and degradations by aggregating fragmented data, removing noise, and aligning teams together towards the same reliability goals.

Please contact us if you want to see a demo of the different use cases we support.

What is Rely.io?

Rely.io is a mission control center for your business reliability. We help companies measure and improve the reliability of their products and services.

The platform productizes industry-leading Site Reliability Engineering (SRE) practices that use Service-Level Objectives as the foundation of the reliability stack. This approach has been used by SRE leaders at Facebook, Google, Netflix, Zalando, and others for years now, using custom technology that they've built in-house. Rely.io is making it available for everyone, you just need to integrate with the tools you're already using for monitoring/observability.

Can you help with onboarding?

Sure thing! We provide a dedicated onboarding service for all teams joining the beta and a white-glove onboarding service for the Enterprise plan. In the latter, we will also provide consulting advice from field experts to help you master your reliability strategy.

How much does it cost?

Rely.io is currently in Beta and can be used for free. Check out our pricing page here.

How can I contact Rely.io?

You can contact us here or by email support@rely.io.

How do I get started?

You can request a free account here. We will then get in touch and share a comprehensive SLO Adoption Guide with you.

How long does the onboarding take?

For self-service onboardings, integrating your preferred data source and creating your first SLOs should take you less than an hour. Assisted onboardings, which include scoping, configuration, and training, take two weeks on average.

Can I use Rely.io on-premise?

No. At the moment we are only offering Rely.io as a software-as-a-service web application, hosted by our team on secure AWS infrastructure.

Platform

Who are SLOs for?

SLOs are for engineering teams and organizations wanting to use existing streams of monitoring & observability data to their advantage.

SLO adoption is generally promoted by Site Reliability, DevOps or Infrastructure engineers (SLO Champions), with sponsorship from engineering leadership. The SLO Champions also oversee the creation of internal standards and policies for SLO adoption, the technical implementation, and advocacy across development teams.

Even if less common, SLOs can also be implemented by product managers given Rely.io's “no-query no-code” approach.

The benefits provided by SLO adoption are leveraged by many stakeholders, such as the CTO, engineering managers, product, finance, and more.

What is Reliability Intelligence?

Reliability Intelligence is the engineering practice of using Service-Level Objectives (SLOs) at scale to continuously improve the quality of service. When you track reliability indicators against business goals, they become easy and accessible to be discussed by anyone in the company and to be used for intelligent automations. The goal is to drive actionable insights from data you are already collecting from product usage, services and infrastructure.

What is a Service-Level Indicator (SLI)?

From Google’s SRE Book:

An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.

This quantitative measure is generally a ratio between two numbers:

$$\text{ SLI } ={ \text{ Good Events } \over \text{ Valid Events }}$$

In reliability engineering, the events used to compute this ratio are software and IT infrastructure metrics collected by monitoring tools, such as Prometheus, New Relic or Datadog.

In a nutshell, an SLI intends to convert a stream of raw and complex metrics about a software application into a simple but informative health score percentage. For a given time window, the SLI starts at 100% and decreases as the number of bad events increase.

Here’s a practical example of an availability SLI of a REST API:

SLI Type: Availability

SLI Specification: Proportion of requests to the REST API that did not return a 5xx

SLI Implementation:

Good events: Total count of all non-5xx events logged for calls to the REST API

Valid events: Total count of all logged events for calls to the REST API

What is an error budget?

An SLO targets a certain percentage of good events that's smaller than 100%. The error budget represents how much your SLO allows for failure or bad events.

If your service logs 1,000 valid events per week and your weekly SLO for that service is 95%, you have an error budget of 50 bad events, or 5%.

At any point during an SLOs time window, the remaining error budget and the error budget burn rate can be used to make decisions (e.g. improve reliability Vs. ship new features) and to create automatic alerts (e.g. if the forecasted error budget at the end of the time window is negative, create an incident for on-call engineers)

What is a Service-Level Objective (SLO)?

From Google’s SRE Book:

An SLO is a service level objective: a target value […] for a service level that is measured by an SLI

If the SLI is what we measure, the SLO is what we aim for. If the SLI measures the level of service, then the SLO establishes the acceptable level of service — learn why we should never expect 100% reliability.

By establishing an objective for a specific time window, teams are enabled to plan and measure reliability over time. A service's reliability stops being an abstract concept that is difficult to grasp and starts being an objective, tangible, and easily understandable metric.

What are the prerequisites for SLO adoption?

The only prerequisite for SLO adoption is the access to monitoring or observability data coming from a tool such as Prometheus, New Relic, Amazon Cloudwatch, or Datadog.

What are the benefits of implementing SLOs?

Adopting SLOs at scale will result in a decrease of the overall number of reliability issues, which will save an organization money, improve brand awareness, reduce customer churn, among other things. You’ll encounter several other benefits along the way, such as:

  • Establishing Reliability Ownership
  • Improving Visibility And Decision Making
  • Increasing Engineering Velocity
  • Reducing Alert Fatigue
  • Flagging Degradations Earlier
  • Improving And Unifying Observability

Check our blog What Are The Benefits Of Service-Level Objectives (SLOs)?

How much effort is required to implement SLOs?

If you’re looking to (1) integrate an observability data source, (2) explore the Rely.io platform and, (3) create your first <10 SLOs, it will take you less than an hour.

A fully fledged SLO adoption requires some planning and investment of time. All organizations require a custom strategy to implement SLOs at scale. The strategy you create will include your own policies & practices and will take into account your current team topology.

Check our (Public) Adoption Guide & Checklist to learn more.