Why I joined Rely.io — António Araújo

Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"

Team

Why I joined Rely.io — António Araújo

António Araújo

Go To Market Lead

Rely.io

March 8, 2022

•

min read

Why I joined Rely.io — António Araújo | Rely.io

For the past 5 years, I’ve partnered with dozens of fast-growing startups while at AWS and became part of one later at Unbabel. I’ve observed how engineering teams still need to invest large amounts of time ensuring systems are reliable, no matter how much easier that became when public cloud providers showed up.

‍

Cloud-based architectures are getting more complex and with more and more third-party dependencies. Building reliable systems in 2022 is still incredibly complex for engineers. In addition to this, SRE teams are lacking the accurate data about how their work is impacting, positively or negatively, the customer journeys over time.

‍

For executive or business teams, the site reliability function is often oversimplified around a sort of “how often were the systems down last quarter” kind of thesis. Although of massive importance, it can’t be the single focus as SREs have additional concerns such as latency, security, developer experience, infrastructure costs, etc. If enforced to the limit, this rationale might bias engineers towards availability and fixing short-term issues. A shortsighted focus on basic threshold alerting can lead to a disregard of actual long-term reliability-ensuring tasks.

‍

By agreeing on Service Level Objectives (SLOs) with the business, SRE teams can now create data-driven goals or OKRs, accurately report on the starting metric and corresponding results, and in consequence showcase to the organization the ROI of their work — all by tracking those SLOs on Rely.io. The thesis can now evolve to a longer, holistic approach:

How often did the systems go down this quarter compared with the previous one?
Are we complying with the organization's agreed availability/latency/throughput goals?
What was the ROI of the investments made in SRE?
In which user journeys will we make reliability investments next quarter?

‍

Behemoths like Google, Microsoft or Netflix follow an SLO-based approach to site reliability with lots of internally-built components by hundreds of site reliability engineers working at each of these companies.

‍

In early 2021, I met José Velez for the first time and he told me he wanted to productize this framework to make it available for any organization, no matter its size. And it’s been exciting to watch José and the rest of the team succeeding at doing it. The Rely.io platform is up & running in pre-beta with over 10 organizations using it weekly to:

Create SLOs in a few clicks — really easy, I’ve even done it myself already!
Prioritize between new features vs technical debt with SLO data
Reduce on-call engineer’s alert fatigue
Capture issues that are slowly depleting their error budget
Integrate their monitoring tools, such as Amazon CloudWatch, Prometheus, New Relic, Elasticsearch and/or Datadog (with more coming soon)

‍

Now, I’m joining Rely.io to lead our Go-To-Market activities and I’m really looking forward to speaking with engineering teams around the world about their reliability efforts!

‍

If you are an engineer or engineering leader looking after site reliability / DevOps, please reach out if you’d like to learn more.

‍

António Araújo

Go To Market Lead

Rely.io

On this page

Contributors

Request access

Request access

See related articles

Introducing Engineering Performance - Boost Team Productivity with Intelligent Engineering Metrics

Introducing Rely.io’s Internal Developer Portal

Introducing Rely's reliability automation platform

Follow our simple guides to get set up in minutes