Site Reliability Engineering (SRE) and its interconnected areas such as Observability, Platform Engineering, and DevOps, have typically operated without Product Managers. I believe that’s happened because IT Operations was seen solely as a cost center and not as a source of competitive advantage.
With the rise of technology giants such as Google, Amazon or Facebook, other companies started adopting similar SRE practices that improve efficiency, security, development speed, and the reliability performance of large-scale systems. Everyone is trying to move at the same speed as big tech and nimble startups. Bets on SRE or DevOps are now seen as investments with positive returns, rather than sunk costs.
There’s little to no literature coming from Google describing how Product Managers can be part of an SRE team. Although there’s been lots to say about the SRE Team Lifecycles and their different topologies, there hasn’t been much around bringing non-engineers into this function. I think that’s going to change soon.
Why do SRE teams need Product Managers?
There’s an increasing number of product owners and program managers in SRE and Platform teams because they have to:
Build products for users (other engineering teams)
Prioritize which reliability investments have the highest impact on customers
Create the long-term reliability strategy for a company
Make Build Vs. Buy decisions
Liaise with several functional groups, including teams outside of engineering
Define reliability targets and report on performance from the perspective of customers' expectations
Manage relationships with new and existing vendors
An SREs plate is full already so the tasks listed above are arguably stealing time from reliability-ensuring activities. A few weeks ago, not thinking I’d be writing this blog today, I ran a poll on r/sre asking How do you spend most of your time?
The results here were not that surprising. It validated that a large proportion of SREs do actually build and/or manage developer tooling, meaning that they must care for users. Also, those who commented did mention that a portion of their time is spent answering questions, doing admin tasks, and in vendor meetings.
We expect our Technical Product Managers in the platform tribe to have a tight working relationship with the product engineering, infrastructure and security teams as they’re usually the key stakeholders and consumers of the products that our platform teams are building
Product Managers supporting SRE and Platform teams are asked to bring traditional product management techniques, such as user research, roadmap prioritization, and stakeholder alignment into the reliability world. According to several job descriptions I’ve analyzed, their responsibilities often include:
Partnering with engineering and product leads to build product roadmaps for SRE
Creating a long-term strategy for observability and tooling investments, including managing vendor relationships
Implementing and maintaining Service-Level Indicators (SLIs) and Service-Level Objectives (SLOs)
Creating profiles of users (software engineers) and ensuring SRE’s products addresses their needs
Championing reliability ownership across non-SRE teams and enabling them to account for & track reliability of the services they’re responsible for
Owning the vision and strategy for: incident management, disaster recovery, performance testing, chaos engineering, etc.
Note: Responsibilities will vary from one organization to another, as well as job titles — SRE Product Lead, Technical Program Manager, SRE Product Owner, etc.
Below is a visual example of how a Product Manager might be part of an SRE team and some of their responsibilities — don’t take the SRE’s work areas as an absolute truth, I know there are many missing and some of these are always shared responsibilities across the team!
Given SRE’s principle of applying software to manage and automate IT, the function has successfully taken on many areas of responsibility. And it has been able to do so with less people than it would normally have been needed to move at the same speed reliably. That means complexity has increased drastically and now there’s a need for a focused strategy, planning and management function within SRE.
I believe that we will start seeing more and more product managers step into this area or, most likely, more engineers formally take on a technical product management role within reliability. My second hypothesis is that the SLO methodology will become the product manager’s best friend because it will allow them to:
Agree with non-engineering functions on the reliability goals needed to meet or exceed customer expectations
Communicate about reliability performance with SLIs/SLOs as a standardized language
Prioritize roadmap according to SLO historical performance
Design better alerting and incident management strategies with burn rate alerting
Enable teams to own reliability of their services with out-of-the-box service SLIs
Monitor data-driven KPIs/OKRs, allowing for weighted, justified and fast decision making
More on the above with demos of Rely.io on a future blog post coming soon!