SRE Best Practices for Service Level Objectives

The notes below are curated from a talk by Mike Sparr on SRE adoption, grounded in Google SRE principles. They represent the conceptual foundation behind the implementation work described in the other posts in this collection. If you are new to SRE, start here before reading the implementation posts. If you are already familiar, this serves as a useful reference for the cultural and organizational dimensions that tooling alone cannot address.

Source: Mike Sparr, Google - SRE Best Practices

SRE Best Practices

Don’t expect a tool to solve it
Cultural change requires “believers” in senior roles to advocate within the organization
People need to absorb information within their own mental model - a new framework imposed top-down rarely sticks

“Reliability is a Journey”

It is a process that can span 6–9 months in organizations with 5,000 engineers - nothing happens immediately
Step 1: “I want to be reliable when I grow up” (you must believe you have a problem first)
Step 2: Read the book and watch SRE vs DevOps
Step 3: “Panic!” (myth: fire the team and retrain; the reality is that teams can be retrained in-house)
Step 4: Start small, be patient, celebrate each step - spread the word

Common Language: Traditionally, Incentives Aren’t Aligned

Developers are incentivized with rapid change (ship more)
Operators are incentivized with stability (limit change)
SRE provides a shared language to negotiate between these two incentives

The problem - incentive misalignment:

Incentive misalignment

The solution - shared language:

Shared language

Latency

Research in cognitive psychology shows that humans cannot perceive the difference between three nines and four nines of availability
The “art of an SLO” is figuring out how much error budget you can spend while keeping customers happy
It is not about collecting nines - “we need to be good enough and keep our customers happy”
It may be more beneficial to invest engineering effort in features than in additional reliability
Defining goals is a fundamental collaboration between development and operations together

CUJ (Critical User Journey)

Not everything needs to be measured - for example, a feature three clicks deep in a left navigation panel
Focus on the actions that define whether the product works for the user

Best Practices Summary

Shared ownership - reduce organizational silos
Error budgets - accept failure as normal (blame is not useful)
Reduce cost of failure - implement gradual changes
Automate common cases - leverage tooling and automation; if you have a decision tree or undocumented process, automation helps
Measure toil and reliability - measure everything; alert only on CUJ, but measure all

SRE as the personification of DevOps:

SRE best practices - building blocks

SRE best practices - personification of DevOps

SLI (Metrics)

Measurements need to be transparent - everyone should understand what is being measured and why
Latency is difficult to measure end-to-end (browser through load balancer to backend)

SLI/SLO/SLA terminology overview:

Terms

SLO (SLI + Goal) - Internal Target

Common initial mistake: “how many nines can we achieve?” (wrong starting question)
Correct approach: what does success look like to the customer and what are their expectations?
Inputs: customer research, support case review, measured experienced latency
If you lack data: the first goal should reflect current experience (current latency, current availability)
Should be re-evaluated at weeks 1, 4, and 6 - the first six weeks are critical to getting the target right
After six weeks, associate an alert to it (needs time before alerting becomes meaningful rather than noise)
Re-evaluate at least once per quarter; sit with all parties involved and decide whether to raise or adjust the target
SLOs have a life of their own - think about how to maintain them over time, not just how to set them
They need to reflect how customers actually use the product; if usage patterns change, the SLO needs to change
Avoid “collecting nines as a badge of honor” - starting with 99% or 99.9% is appropriate

SLA (Adding Consequences) - Making Some SLOs Public

An SLA is an internal SLO exposed to customers with legal consequences; refund is the most common mechanism
If you have an SLA, create a stricter internal SLO and catch violations earlier - the SLO is the safety buffer
Do not publish an SLA with an unreasonable number of nines
Do not publish an SLA on something you historically fail to meet
Be deliberate about what you publish to avoid eroding customer trust
Consider global and regional SLAs: separate CUJs per region (e.g., NA, EMEA, APAC) may be appropriate

Myth: Adopting SLOs Results in Moving Slower

Developer teams respond well to an error budget when it is framed as an allowance: spend it on innovation, or lose it
If you don’t spend it, it’s taken away - not spending error budget means missing an innovation opportunity
Error budgets give teams permission to take risk without creating conflict with operators
Without visibility into budget, teams become risk-averse by default

Benefits of Error Budgets

Common incentives for developers and SREs - a shared mechanism for balancing innovation and reliability
Developer teams manage risk themselves - they decide how to spend their error budget
Unrealistic reliability goals dampen velocity; realistic ones align teams
Developer teams become self-policing - the error budget is a resource they have reason to protect
Shared responsibility for system uptime - infrastructure failures consume developer error budget

Error budget as an allowance (ice cream analogy):

Benefits of error budgets

What happens when the budget is exhausted? Agree upfront, not when things are on fire:

Error budget agreement

SRE customer journey and SLO setup guide:

SRE Customer Journey

SLO Setup Guide

Related posts:

Observability Prerequisites for SLO Implementation - applying these principles to the instrumentation dependency problem
Defining CUJ SLOs: A Process Framework - the process framework that operationalizes these best practices