Defining a CUJ SLO is a cross-functional process. The technical work - monitors, SLI queries, alert thresholds - comes relatively late. Most of the effort is in establishing agreement: on which journeys matter, what success looks like, and what happens when the target is missed. This post documents a framework for that process from start to finish.

Phase 1: Identifying and Prioritizing CUJs

A Critical User Journey is a sequence of user actions that delivers business value. The selection of which journeys to measure cannot be made by SRE alone - it requires input from product management, customer success, and engineering leadership, since the answer depends on revenue impact, support volume, and strategic priority.

The inputs that matter for prioritization:

  • UX research and usage patterns - where do users actually spend time, and where do they drop off?
  • Customer success and support data - which failures generate the most support tickets?
  • Revenue and conversion impact - which journeys gate transactions or renewals?
  • Engineering feasibility - what can actually be measured given current instrumentation?

Prioritization is a business decision with technical constraints, not the other way around.

Phase 2: Defining Success Criteria

Once journeys are selected, the next question is what “reliable” means for each one. This requires translating business expectations into measurable criteria:

  • At what point does poor performance cause customers to fail or abandon the journey?
  • What level of unavailability is acceptable before it creates business consequences?
  • How does user perception of latency affect satisfaction for this specific journey?

These questions are best answered through a cross-functional workshop with product, UX, engineering, and SRE. The output is a documented CUJ scope: what the journey includes, where it starts and ends, and what constitutes success from the user’s perspective.

Phase 3: Technical Mapping and Instrumentation

With the journey defined, SRE and engineering map it to technical components: service dependencies, data flows, external integrations, and known failure modes. This mapping feeds the SLI definition: what to measure, at which service boundary, and with what query.

If the required instrumentation does not exist, it needs to be funded and scheduled as engineering work before SLO targets can be set. An SLO built on incomplete instrumentation produces misleading numbers. See Observability Prerequisites for SLO Implementation for how to handle this dependency.

Phase 4: Baseline and Target Negotiation

The target-setting process starts with a baseline measurement period on existing data. “What does success look like to the customer?” is a better starting question than “how many nines can we achieve?” Targets based on current performance and user expectations tend to be more durable than aspirational targets that have no historical grounding.

Target negotiation involves multiple parties:

  • SRE brings historical data and technical feasibility constraints
  • Product brings user expectations and business requirements
  • Engineering brings cost of improvement analysis
  • Leadership brings resource allocation authority

When targets are too strict relative to what is achievable, negotiate or defer rather than set a target that will be breached immediately. When targets are too loose, product should escalate with evidence of user impact.

Phase 5: Formalizing the Agreement

An agreed SLO is not just a number. The formalization step includes:

  • Error budget policy - what happens when budget is exhausted? (freeze deployments, post-mortem, escalation)
  • Ownership assignment - both product and SRE own the SLO jointly
  • Escalation paths - who gets notified at what burn rate thresholds?
  • Resource commitments - what engineering capacity is allocated for reliability improvements when budget is low?

Error budget policy is the most important of these. Without it, an SLO is informational rather than actionable. Teams that don’t know what will happen when they exhaust budget treat SLOs as dashboards, not operational contracts.

Phase 6: Monitoring Implementation

With targets and policies agreed, monitoring is configured:

  • SLO tracking (metric-based or monitor-based, depending on what SLI data is available)
  • Dashboards combining SLO status with service context
  • Multi-burn-rate alerts for fast and slow burn scenarios
  • Runbooks linked to each alert

Naming conventions and tagging standards applied at this stage have long-term consequences for dashboard filtering, ownership tracking, and Terraform maintainability. Establish them before creating resources, not after.

Phase 7: Operational Review

SLOs are not static. A regular review cycle - at minimum quarterly - should assess:

  • Does the SLO still reflect the actual user experience?
  • Have service architectures changed in ways that affect the SLI measurement?
  • Is the target still calibrated to business needs, or has it drifted?
  • What does the error budget consumption pattern reveal about reliability trends?

Users change how they interact with a product over time. SLOs that were accurate at definition can become misleading without periodic reassessment.

flowchart TD

A[Business Strategy & Objectives] --> B[Product Management: Define Value Streams]

B --> C[Customer Journey Mapping]

C --> C1[UX Research: User Behavior]

C --> C2[Product Analytics: Usage Patterns]

C --> C3[Customer Success: Pain Points]

C --> C4[Revenue/Conversion Impact]

C1 --> D[Identify Critical User Journeys]

C2 --> D

C3 --> D

C4 --> D

D --> E[Business Acumen Discussion]

E --> E1[Executive Leadership]

E --> E2[Product Management]

E --> E3[Customer Success/Support]

E --> E4[Engineering Leadership]

E1 --> F{Prioritize CUJs by Business Impact}

E2 --> F

E3 --> F

E4 --> F

F --> G[Define Success Criteria - Business View]

G --> G1[What does 'reliable' mean to users?]

G --> G2[At what point do we lose customers/revenue?]

G --> G3[What's acceptable unavailability?]

G1 --> H[Cross-Functional Workshop: CUJ Definition]

G2 --> H

G3 --> H

H --> H1[Product: User Journey Steps]

H --> H2[UX: User Expectations]

H --> H3[Engineering: Architecture]

H --> H4[SRE: Technical Feasibility]

H --> H5[Support: Known Issues]

H1 --> I[Documented CUJ Scope]

H2 --> I

H3 --> I

H4 --> I

H5 --> I

I --> J[Engineering/SRE: Map to Technical Components]

J --> J1[Service Dependencies]

J --> J2[Data Flows]

J --> J3[External Dependencies]

J --> J4[Failure Modes]

J1 --> K[SRE: Define Candidate SLIs]

J2 --> K

J3 --> K

J4 --> K

K --> L{Instrumentation Exists?}

L -->|No| M[Engineering: Implement Observability]

M --> M1[Define Instrumentation Requirements]

M --> M2[Prioritize in Engineering Roadmap]

M2 --> N{Funded & Scheduled?}

N -->|No| O[Escalate to Leadership]

O --> E1

N -->|Yes| P[Engineering Implements]

P --> K

L -->|Yes| Q[SRE: Baseline Measurement Period]

Q --> R[Analyze Current Performance vs Business Needs]

R --> S[Cross-Functional: Draft SLO Proposal]

S --> S1[SRE: Technical Targets]

S --> S2[Product: User Expectations]

S --> S3[Engineering: Feasibility & Cost]

S --> S4[Leadership: Resource Allocation]

S1 --> T[Negotiation & Alignment]

S2 --> T

S3 --> T

S4 --> T

T --> U{All Parties Agree?}

U -->|No - Target Too Strict| V[Negotiate: Relax or Defer]

U -->|No - Target Too Loose| W[Product Escalates: User Impact]

V --> S

W --> S

U -->|Yes| X[Formalize SLO Agreement]

X --> X1[Define Error Budget Policy]

X --> X2[Assign Ownership: Product + SRE]

X --> X3[Define Escalation Paths]

X --> X4[Commit Resources for Breaches]

X1 --> Y[SRE: Implement Monitoring]

X2 --> Y

X3 --> Y

X4 --> Y

Y --> Y1[Configure SLO Tracking]

Y --> Y2[Build Dashboards]

Y --> Y3[Create Alerts: Multi-Burn Rate]

Y --> Y4[Document Runbooks]

Y1 --> Z[Operational Phase]

Y2 --> Z

Y3 --> Z

Y4 --> Z

Z --> AA[Regular Business Review]

AA --> AB{SLO Reflects User Experience?}

AB -->|No - Users Unhappy| AC[Product: Gather User Feedback]

AC --> AD[Tighten SLO or Redefine CUJ]

AD --> I

AB -->|No - SLO Too Strict| AE[Engineering: Cost-Benefit Analysis]

AE --> AF[Relax SLO or Improve Architecture]

AF --> S

AB -->|Yes| AG[Maintain SLO]

AG --> AH[Quarterly Strategic Review]

AH --> AI{Business Priorities Changed?}

AI -->|Yes| D

AI -->|No| AA

style A fill:#ffebee

style E fill:#fff3e0

style H fill:#e8f5e9

style T fill:#e1f5fe

style Z fill:#f3e5f5

Key Takeaways

  • CUJ selection is a business decision - SRE provides feasibility constraints, not the priorities
  • Target-setting starts with user expectations and historical data, not a desired number of nines
  • An error budget policy is required for an SLO to be operational rather than informational
  • Naming conventions and tagging standards established at implementation time are hard to change retroactively
  • Quarterly review is necessary - user behavior and service architectures change, and SLOs need to reflect that

Related posts: