Quick definitions for this post:

  • SLI (Service Level Indicator) - a metric that measures some aspect of service behavior (e.g., request success rate, latency)
  • SLO (Service Level Objective) - a target applied to an SLI (e.g., 99.9% of requests succeed)
  • CUJ (Critical User Journey) - an end-to-end sequence of user actions that delivers business value
  • Error budget - the allowed failure budget implied by an SLO target (e.g., 0.1% of requests over 30 days)

Two Different Monitoring Philosophies

CUJ SLOs and service SLOs answer different questions. Mixing the two - or treating them as equivalent - produces SLO configurations that are either redundant or miss the point.

CUJ SLOs measure whether a user can complete a meaningful task from end to end. They are black-box by design: measured at the entry point of the journey, from the user’s perspective. They do not care about internal service health - only the outcome.

Service SLOs measure the reliability of a specific service at its API boundary. They answer “is this service healthy?” not “can the user accomplish their goal?”

The difference matters because a service can be healthy while the user journey fails (due to integration issues, dependent service failures, or frontend problems), and a service can appear degraded while the user journey succeeds (due to retries, fallbacks, or caching).

Understanding Service Dependencies

When building CUJ SLOs, the key decision is which services to include in the measurement. This requires understanding the dependency type of each service in the journey.

Hard Dependencies

A hard dependency is a service whose failure causes the entire user journey to fail. The user cannot complete their goal if this service is unavailable.

Examples:

  • Payment gateway in a checkout flow - no payment means no order
  • Authentication service for login - failure blocks all access
  • Database for data retrieval - without data, nothing can be displayed

Hard dependencies are always included in the CUJ SLO measurement.

Soft Dependencies

A soft dependency is a service whose failure does not prevent the user journey from completing. The experience may be degraded, but the core goal is still achievable.

Examples:

  • Analytics or logging service - the journey completes even if metrics are not recorded
  • Recommendation engine - users can browse without personalized suggestions
  • Email notification service - an order processes even if the confirmation email fails

Soft dependencies are excluded from CUJ SLO measurement. Failures there do not represent a user journey failure.

In-Between Dependencies

Some services fall between these categories: their failure causes degradation but not complete failure. A caching layer with database fallback is a typical example - without the cache, the journey is slower but still succeeds.

Whether to include these in the CUJ SLO depends on whether the degradation violates the SLO threshold. If the fallback still meets the latency target, exclude it. If it does not, treat it as a hard dependency.

Decision rule: “Can the user complete their goal without this service?” If no, it is a hard dependency. If yes, it is soft.

Avoiding Redundant Measurement

A common mistake in CUJ SLO design is measuring both a service and its hard dependencies separately. This counts the same failure twice.

If Service A calls Service B, and a failure in Service B causes Service A to return an error to the user, then monitoring Service A’s error rate already captures Service B’s failures from the user’s perspective. Adding a separate monitor on Service B’s endpoint inside the CUJ SLO double-counts the failure and inflates the apparent error rate.

The principle is: measure once, at the entry point where users interact. Failures in downstream hard dependencies automatically manifest at the entry point. This is the black-box principle applied to dependency chains.

Monitors, Service SLOs, and CUJ SLOs

These three tools serve different purposes and should not be conflated:

ToolPurposeNotification style
Monitors/AlertsImmediate notification for clear, actionable conditionsFires when threshold is crossed
Service SLOsTrack reliability of a single service over timeBurn rate alerts for sustained degradation
CUJ SLOsBlack-box measurement of user journey outcomesBurn rate alerts tied to business impact

Alerts are for operators. Service SLOs are for service owners. CUJ SLOs are for product and business stakeholders. The audience shapes how each is configured and reviewed.

What Counts as a CUJ Service

For a service to belong in a CUJ SLO, it must be directly invoked by users (or be a hard dependency of something directly invoked), and its failure must result in a user-visible failure of the journey.

A service that is called by other services for internal purposes, but whose failure is handled gracefully by callers, does not belong in the CUJ SLO. Measure it in a service SLO if it matters for reliability, but do not conflate internal health with user outcome.

Example - workflow lifecycle CUJ:

The minimal endpoint set for a Create → Edit → Delete workflow journey maps to three services calls, all on the same service:

ActionEndpoint
CreatePOST /[workflow-service]/api/v1/workflows
EditPUT /[workflow-service]/api/v1/workflows/{id}
DeleteDELETE /[workflow-service]/api/v0/workflows/{id}

Downstream services that the [workflow-service] calls internally are excluded. Their failures surface as failures on these three endpoints - measuring them separately would double-count the same events.

Key Takeaways

  • CUJ SLOs measure user outcomes; service SLOs measure service health - these are complementary, not interchangeable
  • Include hard dependencies in CUJ SLOs; exclude soft dependencies
  • Measure at the entry point where users interact - downstream hard dependency failures surface automatically
  • Double-measuring services and their hard dependencies double-counts failures and produces misleading SLO scores
  • The right question for dependency classification is: “Can the user complete their goal without this service?”

Related posts: