The notes below are curated from a talk by Mike Sparr on SRE adoption, grounded in Google SRE principles. They represent the conceptual foundation behind the implementation work described in the other posts in this collection. If you are new to SRE, start here before reading the implementation posts. If you are already familiar, this serves as a useful reference for the cultural and organizational dimensions that tooling alone cannot address.

Source: Mike Sparr, Google - SRE Best Practices


SRE Best Practices

  • Don’t expect a tool to solve it
  • Cultural change requires “believers” in senior roles to advocate within the organization
  • People need to absorb information within their own mental model - a new framework imposed top-down rarely sticks

“Reliability is a Journey”

  • It is a process that can span 6–9 months in organizations with 5,000 engineers - nothing happens immediately
  • Step 1: “I want to be reliable when I grow up” (you must believe you have a problem first)
  • Step 2: Read the book and watch SRE vs DevOps
  • Step 3: “Panic!” (myth: fire the team and retrain; the reality is that teams can be retrained in-house)
  • Step 4: Start small, be patient, celebrate each step - spread the word

Common Language: Traditionally, Incentives Aren’t Aligned

  • Developers are incentivized with rapid change (ship more)
  • Operators are incentivized with stability (limit change)
  • SRE provides a shared language to negotiate between these two incentives

The problem - incentive misalignment:

Incentive misalignment

The solution - shared language:

Shared language

Latency

  • Research in cognitive psychology shows that humans cannot perceive the difference between three nines and four nines of availability
  • The “art of an SLO” is figuring out how much error budget you can spend while keeping customers happy
  • It is not about collecting nines - “we need to be good enough and keep our customers happy”
  • It may be more beneficial to invest engineering effort in features than in additional reliability
  • Defining goals is a fundamental collaboration between development and operations together

CUJ (Critical User Journey)

  • Not everything needs to be measured - for example, a feature three clicks deep in a left navigation panel
  • Focus on the actions that define whether the product works for the user

Best Practices Summary

  • Shared ownership - reduce organizational silos
  • Error budgets - accept failure as normal (blame is not useful)
  • Reduce cost of failure - implement gradual changes
  • Automate common cases - leverage tooling and automation; if you have a decision tree or undocumented process, automation helps
  • Measure toil and reliability - measure everything; alert only on CUJ, but measure all

SRE as the personification of DevOps:

SRE best practices - building blocks

SRE best practices - personification of DevOps

SLI (Metrics)

  • Measurements need to be transparent - everyone should understand what is being measured and why
  • Latency is difficult to measure end-to-end (browser through load balancer to backend)

SLI/SLO/SLA terminology overview:

Terms

SLO (SLI + Goal) - Internal Target

  • Common initial mistake: “how many nines can we achieve?” (wrong starting question)
  • Correct approach: what does success look like to the customer and what are their expectations?
  • Inputs: customer research, support case review, measured experienced latency
  • If you lack data: the first goal should reflect current experience (current latency, current availability)
  • Should be re-evaluated at weeks 1, 4, and 6 - the first six weeks are critical to getting the target right
  • After six weeks, associate an alert to it (needs time before alerting becomes meaningful rather than noise)
  • Re-evaluate at least once per quarter; sit with all parties involved and decide whether to raise or adjust the target
  • SLOs have a life of their own - think about how to maintain them over time, not just how to set them
  • They need to reflect how customers actually use the product; if usage patterns change, the SLO needs to change
  • Avoid “collecting nines as a badge of honor” - starting with 99% or 99.9% is appropriate

SLA (Adding Consequences) - Making Some SLOs Public

  • An SLA is an internal SLO exposed to customers with legal consequences; refund is the most common mechanism
  • If you have an SLA, create a stricter internal SLO and catch violations earlier - the SLO is the safety buffer
  • Do not publish an SLA with an unreasonable number of nines
  • Do not publish an SLA on something you historically fail to meet
  • Be deliberate about what you publish to avoid eroding customer trust
  • Consider global and regional SLAs: separate CUJs per region (e.g., NA, EMEA, APAC) may be appropriate

Myth: Adopting SLOs Results in Moving Slower

  • Developer teams respond well to an error budget when it is framed as an allowance: spend it on innovation, or lose it
  • If you don’t spend it, it’s taken away - not spending error budget means missing an innovation opportunity
  • Error budgets give teams permission to take risk without creating conflict with operators
  • Without visibility into budget, teams become risk-averse by default

Benefits of Error Budgets

  • Common incentives for developers and SREs - a shared mechanism for balancing innovation and reliability
  • Developer teams manage risk themselves - they decide how to spend their error budget
  • Unrealistic reliability goals dampen velocity; realistic ones align teams
  • Developer teams become self-policing - the error budget is a resource they have reason to protect
  • Shared responsibility for system uptime - infrastructure failures consume developer error budget

Error budget as an allowance (ice cream analogy):

Benefits of error budgets

What happens when the budget is exhausted? Agree upfront, not when things are on fire:

Error budget agreement

SRE customer journey and SLO setup guide:

SRE Customer Journey

SLO Setup Guide


Related posts: