Reliability is a huge and evolving field, with new information and knowledge popping up constantly. Plenty of blogs, newsletters, and books to read and follow. With such a variety of information being available, many companies don't know what principles and best practices to follow to increase the engineering efficiency of their teams while improving the reliability and maintainability of their systems or how to assess their operational maturity.
What does operational maturity even mean? Compare it to owning a car. If you own and drive a car, you follow specific rules, and you do preventive maintenance to extend the life of your car. If you see the oil light going on, you'll add oil.
Similarly, if you operate a system at scale, some practices make it easier to operate and maintain those systems. Some metrics indicate how well-run a system is. People with experience operating systems at scale have documented their learnings and best practices, sometimes condensed in books, articles, or videos. However, they are harder to find, and it is unclear which ones to apply. Operational maturity indicates how well companies adhere to current standards and best practices.
People with a high level of knowledge in Software Development as well as DevOps and SRE practices are rare and hard to find. It is common for companies to either hire people with system operation backgrounds next to pure software engineers or have their infrastructure/development teams operate their services. Staying on top of the most recent developments in the field is challenging for DevOps and SREs and even harder for developers who own services where service operations are not the main part of the job.
While engineering departments of a certain size can afford to have a few people dedicated to keeping up with evolving principles and standards, it is unreasonable and unproductive to expect most developers to spend time learning and memorizing all of them. Base-level knowledge of operating services at scale is beneficial, but it doesn't seem like time is well spent to ramp up every engineer to become an expert in this field. Instead, organizations should invest in providing the required tools, guardrails, and guidance to improve the developer experience while operating their systems.
Another common theme at big companies with teams with varying levels of experience is that you'll end up with many localized solutions, where teams start to build tooling, track, and report metrics differently than their neighboring teams. It is positive that the team cares about a topic and wants to improve. Still, when every team is building "special snowflake solutions" and tracking different metrics, it is hard for senior engineering leaders to get a clearer picture of how certain areas look across their whole organization.
Lastly, following reliability standards and best practices often doesn't yield immediate results and overnight successes. More than not, it takes time and practice to see those improvements - it is a long-term game. This fact can hinder adoption and cause some teams to abort their efforts.