The first quarter of 2022 was a busy one for the Rely.io team. Between seemingly impossible technical challenges, amazing new hires and a flourishing customer pipeline, our team worked non-stop to bring some awesome new features to life and to continue evolving our product towards facilitating SRE adoption in organizations of all sizes. In this article, we will highlight some of the main features shipped in this past quarter that most contributed to the evolution of our product.
As many are already aware, Rely.io is on a mission to democratize SRE and help companies gradually make the shift from traditional DevOps methodologies to more contemporary site reliability engineering practices. The latest version of our SLO Wizard helps us move a step closer towards this goal.
Service-level objectives are at the core of SRE and as such, we made the ability to intuitively configure SLOs a major focus of our development during Q1. Our revamped SLO Wizard now allows users not just to access and use monitoring data coming from multiple different data sources, but also to manipulate it in a number of different ways. Users can leverage the SLO Wizard to combine multiple time series together to generate high-level monitoring metrics in order to create the most expressive service-level indicators (SLIs) possible. The wizard also allows for applying different alignment functions to each individual metric giving users unparalleled fine-tuning capabilities when creating their SLIs. These SLIs can then be used to create quality SLOs that more accurately translate the relation between operational performance and business expectations.
A major pain point we identified while having discussions with our users was the fact that many don’t know where to start when creating an SLI. We also understand that learning how to leverage all the functionalities offered by the SLO Wizard might seem like a mountain too steep to climb at first. So, to mitigate this problem and help our users extract value from Rely.io as soon as they are onboard, we came up with a set of templates that can automatically generate ready to use service-level indicators.
Our SLI recommendations engine looks through our users’ metric data and autonomously creates SLIs that measure different areas of application performance, like availability, latency and throughput. This out-of-the-box SLI offering means users can immediately dive into SRE regardless of their expertise level and learn more as they go. We have dozens of SLI templates already in use and we are focused on improving and growing our offering in the months to come.
As more feedback came in, we realized that the way we were collecting metrics through our data source integrations didn’t give users the flexibility to access all the data they deemed important. Thus, the Query Selector was born. This new metric selection option in the SLO Wizard allows for directly querying data sources in their native querying language. Users can now create queries as complex as they’d like to retrieve time series data otherwise difficult to access. It also has the added benefit of allowing more technically fluent users to reuse existing queries that they might have already set up in tools like Grafana, in order to access metrics they already know are relevant for their SLO use cases. As of right now, this functionality is only available for New Relic integrations and we are continuing to work on expanding it to all of our other data sources.
In summary, with this new version of the SLO Wizard we are trying to create a low-code/no-code playground where users can easily explore and access their monitoring data in a multitude of different ways so as to create relevant and actionable SLOs.
SLO Insights V2
Because we know simply creating SLOs is not enough, our team also dedicated itself to finding new ways to better extract insightful information from those SLOs. One of the major outcomes of this effort was the extreme revamp of our SLO Insights page. The main driver of this revamp was to abstract the technicalities behind SLOs and translate them into easy to understand, actionable information.
We added intuitive and dynamic descriptions to every chart to help users more easily interpret what they’re seeing. This provides users with a clearer picture of how an SLO is being impacted and allows them to translate that impact into actual customer experience in an intuitive way. More specifically, we facilitated the interpretation of error budgets and burn rates by extrapolating insights such as telling users when an SLO is expected to miss its goal or identifying the periods of downtime that affected the greatest amount of end users. This makes it so that users can drive real action regarding preemptively tackling reliability issues and more confidently decide on what development initiatives to prioritize.
Another outcome of our effort to extrapolate actionable data from SLOs was the development of the first version of the Summary Dashboard. As the name suggests, this dashboard provides an overview of how an application is behaving based on SLO performance and allows for identifying areas of the application that might require some extra attention. We divided the dashboard into three distinct sections: Summary, Application Performance and Reliability Breakdown.
The Summary section shows everything that requires attention either immediately, or in the near future, displaying the number of SLOs being impacted at the moment and how many alerts are currently firing. The Application Performance section displays a breakdown of the application’s performance by building upon the information displayed in the Summary section. It shows which specific areas of an application are currently being impacted and what are the top underperforming areas. The Reliability Breakdown section displays a full performance breakdown of the entire application from the perspective of the SLOs that cover the application’s services.
Integrations, Integrations, Integrations…
You asked and we delivered! As we gathered more and more feedback from our prospective customers about our platform, one thing became obvious: we needed to increase our data source offering to be able to integrate with existing monitoring stacks. During this past quarter, we increased our data source offering with the introduction of three new popular integrations.
The first integration we worked on was New Relic. If you’re part of the IT Operations world, you’ve surely heard about this very popular monitoring tool. Our users can now use the monitoring data they are collecting through New Relic to set up SLOs inside Rely.io and start generating actionable and customer-focused insights from them.
The next integration we worked on was the GCP integration. As per Google’s own statement, GCP’s Cloud Monitoring offers automatic out-of-the-box metric collection for Google Cloud services while also supporting monitoring of hybrid and multi-cloud environments. This made this integration to be next in-line for development as it’s a product widely used by organizations of all sizes
Finally, by the end of the quarter we began working on allowing integrations with one of the most widely used and far reaching monitoring solutions: Prometheus. More specifically, the Prometheus Remote Write integration. This integration was heavily requested by our customers since many already use Prometheus to instrument and collect monitoring data from their applications but didn’t want to use our already supported Prometheus HTTP API integration, since it poses a security concern for many. By allowing remote write integrations, our users can selectively export the data most relevant to their SLO use cases without the need to deploy any public-facing endpoints that could compromise their application’s security and risk exposing sensitive information.
Looking forward to Q2
In retrospect, it was a tough but successful quarter in terms of product development. But this is just the beginning! We are continuing to improve on our current offering to address evolving client requirements and are already developing new amazing features like reliability reporting and error budget alerting that will boost the value our users can extract from the Rely.io platform.
If you’d like to try out our product, join our upcoming Open Beta program where you’ll have first hand access to all these features and more, and will get the chance to contribute and tailor our roadmap according to your needs!