CUJ SLO Implementation Assessment

This document reflects an assessment of implementing Critical User Journey SLOs using Datadog in a production analytics platform. It covers what was delivered, the constraints encountered, and recommendations for improving observability foundation. The intent is to provide an honest accounting rather than a clean success story.

Key outcomes at the time of assessment:

3 CUJs defined with SLO coverage across critical customer scenarios (in progress as of January 2026)
Technical gaps identified in logging, service ownership, and instrumentation affecting measurement accuracy
Workarounds documented and applied to deliver functional SLOs within existing system constraints
Prioritized recommendations produced for improving the observability foundation

Critical finding: Several challenges affected SLO reliability - log quality issues causing misattributed errors, unclear service boundaries creating overlapping monitors, and incomplete instrumentation limiting journey visibility. These are documented to inform future improvements, not to frame current delivery as blocked.

What Are Critical User Journeys?

Critical User Journeys represent end-to-end customer experiences that deliver business value. They differ fundamentally from traditional service-level monitoring.

CUJ focus: “Can a customer complete the workflow run?” (outcome-based) Traditional SLO focus: “Is the workflow service responding?” (availability-based)

A true CUJ tracks whether users can accomplish meaningful tasks from start to finish, spanning multiple services and dependencies. Success means the entire flow works from the customer’s perspective, not just that individual services are healthy.

When building CUJ-based SLOs, you are measuring customer impact rather than technical uptime. This requires different instrumentation, cross-service correlation, and a clear understanding of service boundaries. The challenges in this assessment directly affect the ability to measure true customer outcomes versus proxy metrics.

What Was Implemented

SLO architecture:

Primary approach: Datadog APM endpoints → monitors → SLOs
SLO type: monitor-based
Services instrumented: 4 services across 4 teams
- [orchestration-engine]
- [frontend-monolith]
- [workflow-service]
- [optimizer-service]

Datadog components created:

Monitors for availability, latency, and error rate tracking
SLOs covering: Workflow load (Workflow Canvas), Workflow run, Workflow create/edit/delete
Tags applied:
- SLO: env:, service:, slo_type:
- Monitor: team:, service:, env:, monitor_type:

Naming conventions established for Terraform:

SLO: cuj_slo_[service]_[journey]_[env]
Monitor: [service]_[metric]_[what it does]_[env]
Consistent tagging for dashboard filtering and ownership tracking

Implementation flow:

flowchart TD
    Start([Start]) --> DefineJourneys[Define CUJs]

    %% Analysis Phase
    DefineJourneys --> DecisionCUJ{Is this a true CUJ or service health check?}
    DecisionCUJ -->|Service Check| RedefineScope[Redefine Scope]
    RedefineScope --> DefineJourneys
    DecisionCUJ -->|True CUJ| ValidateStakeholders[Validate with Stakeholders]

    ValidateStakeholders --> AssessBoundaries[Assess Service Boundaries]
    AssessBoundaries --> DecisionBoundaries{Are service boundaries clear?}
    DecisionBoundaries -->|No| ClarifyScope[Clarify Scope]
    ClarifyScope --> AssessBoundaries
    DecisionBoundaries -->|Yes| CheckCorrelation{Is trace-log correlation available?}

    CheckCorrelation -->|No| CorrelationGap[Correlation gaps]
    CorrelationGap --> DocWorkarounds1[Document Workarounds]
    CheckCorrelation -->|Yes| IdentifyGaps[Identify Instrumentation Gaps]
    DocWorkarounds1 --> IdentifyGaps

    IdentifyGaps --> DecisionInstrumented{Are all services instrumented?}
    DecisionInstrumented -->|No| InstGap[Instrumentation gaps]
    InstGap --> ScopeLimits[Scope Limitations]
    DecisionInstrumented -->|Yes| CreateMonitors[Create Monitors]
    ScopeLimits --> CreateMonitors

    %% Implementation Phase
    CreateMonitors --> ConfigSLOs[Configure SLOs]
    ConfigSLOs --> ApplyTags[Apply Tags]
    ApplyTags --> SetBaselines[Set Baselines]

    SetBaselines --> DecisionTargets{SLO targets realistic?}
    DecisionTargets -->|No| AdjustTargets[Adjust Targets]
    AdjustTargets --> SetBaselines
    DecisionTargets -->|Yes| TestCoverage[Test Coverage]

    %% Finalization Phase
    TestCoverage --> CheckFalsePos[Check False Positives]
    TestCoverage --> CheckFalseNeg[Check False Negatives]
    CheckFalsePos --> FalsePositives[False Positives Detected]
    CheckFalseNeg --> FalseNegatives[False Negatives Detected]
    FalsePositives --> ApplyFilters[Apply Filters and Thresholds]
    FalseNegatives --> SyntheticTests[Implement Synthetic Tests]

    ApplyFilters --> DocWorkarounds2[Document Workarounds]
    SyntheticTests --> DocWorkarounds2

    DocWorkarounds2 --> DecisionReady{Deployment ready?}
    DecisionReady -->|No| FixIssues[Fix Issues]
    FixIssues --> TestCoverage
    DecisionReady -->|Yes| Deploy[Deploy SLOs]

    Deploy --> Monitor[Monitor Effectiveness]
    Monitor --> End([End: Functional SLOs])

    %% Styling
    classDef analysisPhase fill:#B3E5EC,stroke:#1FB8CD,stroke-width:2px,color:#000
    classDef implementationPhase fill:#A5D6A7,stroke:#2E8B57,stroke-width:2px,color:#000
    classDef finalizationPhase fill:#FFE0B2,stroke:#D2BA4C,stroke-width:2px,color:#000
    classDef riskPhase fill:#FFCDD2,stroke:#DB4545,stroke-width:2px,color:#000
    classDef decisionPhase fill:#FFEB8A,stroke:#D2BA4C,stroke-width:2px,color:#000

    class DefineJourneys,ValidateStakeholders,AssessBoundaries,IdentifyGaps,RedefineScope,ClarifyScope analysisPhase
    class CreateMonitors,ConfigSLOs,ApplyTags,SetBaselines,AdjustTargets implementationPhase
    class TestCoverage,Deploy,Monitor,DocWorkarounds2,FixIssues,CheckFalsePos,CheckFalseNeg finalizationPhase
    class CorrelationGap,InstGap,ScopeLimits,FalsePositives,FalseNegatives,ApplyFilters,SyntheticTests,DocWorkarounds1 riskPhase
    class DecisionCUJ,DecisionBoundaries,CheckCorrelation,DecisionInstrumented,DecisionTargets,DecisionReady decisionPhase

Implementation Constraints

Four primary constraint categories affected delivery quality:

Observability gaps - missing instrumentation, incomplete logging, or absent correlation IDs
Service boundary ambiguity - overlapping responsibilities or unclear ownership between teams
Log quality issues - false positives, false negatives, and misattributed errors
Infrastructure as Code technical debt - Terraform repository disorder affecting maintainability and change confidence

These constraints required workarounds that are documented below. The SLOs delivered are functional but may not capture complete journey fidelity given current system limitations.

Challenge 1: Distinguishing True CUJs from Service Health Checks

The problem: Some requested “journeys” were actually service-level health checks rather than customer outcome measurements. A monitor that tracks whether a service is responding does not tell you whether customers can complete their task.

Example scenario:

Labeled as CUJ: “User Login Journey” measuring authentication service uptime
Actual measurement: Service availability, not whether login works end-to-end
What is missing: Frontend rendering, session creation, redirect success, dependent profile data loading

A service can be healthy while the journey fails due to integration issues, network problems, or downstream dependencies. This produces green dashboards while customer experience degrades.

What was done: Documented which SLOs measure true end-to-end journeys versus service proxies. Where end-to-end measurement was not possible due to instrumentation gaps, the limitation and scope were noted explicitly.

Recommendation: Validate each journey with product and business stakeholders to confirm it represents actual customer value. Prioritize instrumentation improvements for true journey tracking.

Challenge 2: Service Boundary and Ownership Confusion

The problem: Overlapping service definitions, unclear team ownership, and shared responsibilities make it difficult to attribute errors correctly and avoid duplicate monitoring.

Example scenario:

Services A and B both handle parts of a shared responsibility
Error logs appear in both services for the same customer failure
Monitors fire in both places, creating duplicate alerts
No clear owner for resolving the underlying issue

Without clear boundaries, you create redundant monitors, count errors multiple times in SLO calculations, and generate alert fatigue from duplicate notifications.

What was done:

Tagged services with best-known ownership information
Documented services with ambiguous boundaries requiring architecture team clarification
Created monitors scoped to avoid double-counting where possible
Noted overlaps in this assessment for resolution

Identifiable markers of this issue:

Multiple services logging the same error event with different service tags
Service catalog showing missing or multiple owners for the same functionality
Runbook links absent or pointing to the wrong team documentation
Duplicate alerts firing for a single customer-impacting incident

Recommendation: Service ownership audit using Datadog Software Catalog with scorecard requirements covering: documented owner, runbook links, on-call contacts, and clear responsibility boundaries.

Challenge 3: Log Quality Issues and Error Attribution

The problem: Logs may surface issues that are not real (false positives), miss actual problems (false negatives), or incorrectly attribute errors to services that are merely propagating failures from upstream dependencies.

Monitors are only as reliable as the data they consume. If logs misattribute errors, SLOs blame the wrong services and alerts fire on non-issues while missing real customer impact.

False Positives: Alerts Without Customer Impact

Common sources identified:

Expected behavior logged as errors:

Validation failures (user entered invalid data - working as designed)
Rate limiting triggers (protecting the system - intended behavior)
Authentication rejections (wrong password - not a system failure)

Transient events overcounted:

Network timeouts that succeed on automatic retry
Load balancer health check failures during valid autoscaling events
Database connection pool exhaustion during legitimate traffic spikes

Cascading alarm scenarios:

One upstream service fails, generating errors in multiple downstream services
All dependent services alert for the same root cause
On-call teams respond to noise rather than the actual problem

Workarounds applied:

Exclude 4xx client errors from availability SLIs where appropriate
Add evaluation windows requiring sustained error rates before alerting
Implement recovery thresholds to suppress alerts that self-resolve
Document expected error patterns to filter from critical monitors

False Negatives: Missing Real Customer Impact

Common sources identified:

Silent failures:

Services failing without generating error logs (poorly instrumented error handling)
Timeout scenarios not properly caught and logged
Circuit breakers tripping without notification

Severity misclassification:

Critical errors logged at WARNING level, bypassing monitor thresholds
Customer-impacting issues logged as INFO rather than ERROR
Status codes that do not map cleanly to user journey outcomes

Aggregate metric masking:

Critical journey failures hidden within overall service success rates
Rare but high-impact errors below percentage thresholds
Geographic or customer-segment issues averaged away in global metrics

Workarounds applied:

Log volume monitoring to detect missing log scenarios
Documented instrumentation gaps where false negatives likely exist

Error Misattribution: Blaming the Wrong Service

How this happens:

Missing correlation IDs:

Logs lack trace_id and span_id to link requests across services
Cannot determine which upstream service sent bad data
Each service in the chain logs an error, appearing like multiple independent problems

Cascading failure confusion:

Service A fails → Service B cannot complete requests → Service C times out
All three services log errors; only A has the root cause
Monitors fire across all services, obscuring the actual problem

Workarounds applied:

Enable trace-log correlation where possible using DD_LOGS_INJECTION=true
Tag monitors and SLOs with service boundaries to limit scope
Document services lacking proper correlation for future instrumentation

Identifiable markers of this issue:

Error logs missing trace_id, span_id, or parent service context
Multiple services simultaneously reporting errors for the same customer request
Teams claiming “it’s not our issue” without clear evidence

Challenge 4: Infrastructure as Code Disorder

Terraform repository state: Inconsistent module structure, unclear resource organization, and redundant code that should be unified across environments. Exceptions were treated as the rule rather than the exception.

Impact:

Manual verification required to confirm deployed configuration matches documented intent
Risk of unintended changes when modifying existing resources
Difficult to onboard new team members to CUJ SLO maintenance
Inconsistent tagging across resources affecting dashboard filtering

Workarounds applied:

External mapping document: “Terraform Resource to CUJ Journey mapping”
Peer review requirement for all Terraform changes
Naming conventions applied to new resources only (legacy resources remain inconsistent)
Manual audit before each deployment to catch drift

Recommendation: Terraform refactoring in phases:

Audit - inventory all CUJ-related resources, identify drift and orphans
Consolidate - migrate to module structure (e.g., modules/datadog_cuj_slo/)
Tag remediation - retroactively apply consistent tagging to existing resources
Validation - implement pre-commit hooks and CI checks for naming/tagging compliance

Implementation Checklist

This checklist documents the assessment state at the time of writing. Use it to validate completeness and identify remaining gaps.

Journey Definition Quality

Requirement	Notes
Journeys represent actual end-user value	Validated for implemented journeys; document which are service health proxies
Clear start and end points tied to business outcomes	Documented per journey
Business stakeholder validation completed	Schedule validation with Product team
Maximum 3–5 critical journeys identified	3 journeys prioritized at assessment date
Journey steps documented with service dependencies	Some gaps in dependency mapping remain

Service and Ownership Clarity

Requirement	Notes
Clear team ownership defined for each service	List services with unclear ownership
No overlapping or duplicate service definitions	Document overlaps identified
Services tagged with team, service, environment	Applied per Datadog standards
Service catalog populated with runbooks, contacts	Completion varies by service
On-call rotation defined for each service	Gap services should be listed explicitly

Observability Foundation

Requirement	Notes
Distributed tracing configured across journey services	List services lacking APM
Logs include correlation IDs (trace_id, span_id)	List services with missing correlation
Structured logging with consistent field names	Some legacy services use unstructured logs
Service emits availability, latency, error metrics	Most services instrumented

Log Quality and Error Attribution

Requirement	Notes
Services distinguish client vs server errors	Many services log all errors as 5xx-equivalent
Error logs include originating service context	Requires instrumentation work
Expected errors excluded from SLI calculations	Filtered: validation errors, rate limits
Error severity accurately reflects customer impact	Some misclassification identified

SLO Configuration Standards

Requirement	Notes
Appropriate SLO type selected	Rationale documented in SLO descriptions
Meaningful names following naming convention	Convention: `cuj_slo_[service]_[journey]_[env]`
Descriptions include what, why, how, runbook links	Varies across resources
Targets based on historical data and business needs	Some targets aspirational, not evidence-based
Targets provide buffer above customer SLA	All SLOs stricter than external commitments

Known Limitations and Workarounds

Correlation Gaps

Impact: Cannot reliably attribute errors across services without trace correlation. May see duplicate error counting or blame propagated from the wrong source.

Workaround: Separate monitors per service with limited cross-service visibility; conservative SLO scope using service boundary tags; dependency relationships documented manually in the service catalog.

Recommendation: Enable DD_LOGS_INJECTION=true in tracer configuration for affected services. Priority: High - this affects accuracy of all SLOs involving those services.

Service Boundary Ambiguities

Impact: Potential double-counting of errors, unclear team ownership on incidents, duplicate alerts for single root cause.

Workaround: Monitors scoped to specific endpoints within services to avoid overlap; ambiguous services tagged with multiple team identifiers; escalation documentation created for unclear ownership scenarios.

Recommendation: Architecture review to clarify service boundaries. In the interim, establish an ownership decision matrix per service area.

Instrumentation Gaps

Impact: Cannot measure the full customer journey. SLO coverage limited to instrumented portions only.

Workaround: SLO definitions explicitly scoped to exclude uninstrumented sections; synthetic tests added as compensating controls where possible; gap noted in SLO descriptions to set accurate expectations.

Recommendation: Instrumentation backfill prioritized for gap services based on journey criticality.

Recommendations and Next Steps

Immediate (1–2 weeks)

Validate journey definitions with stakeholders Meet with product and business owners to confirm each “journey” represents actual customer value, not just service health.
Enable trace-log correlation for affected services Deploy DD_LOGS_INJECTION=true to services lacking correlation. Enables accurate error attribution and reduces misattribution across SLOs.
Document error budget policies Define what happens when SLOs are breached - deployment freeze, post-mortem, or escalation. Without this, SLOs are informational rather than operational.

Short-term (1–3 months)

Service ownership audit using Datadog Software Catalog Use scorecard requirements to identify and resolve ownership gaps.
Implement error taxonomy tagging Add error.type tags (client_error, server_error, dependency_error) to differentiate error sources and enable proper attribution.
Expand synthetic test coverage Create Datadog Synthetic tests for identified false negative risks, particularly for services with silent failure modes.

Long-term (3–6 months)

Instrumentation backfill for gap services Prioritize adding logging and metrics to uninstrumented services. This enables true end-to-end journey measurement rather than partial proxies.
Log quality improvement program Establish logging standards: structured format, severity guidelines, required context fields. Enforce via code review and linting.
Unified SLO dashboards with context Build dashboards combining SLO status with relevant service and infrastructure metrics to improve troubleshooting during SLO breaches.

Measuring Success

These metrics indicate whether improvements are working:

SLO fidelity metrics:

Correlation between SLO breaches and customer-reported issues (target: >90% alignment)
False positive rate: alerts without customer impact (target: <5% of all alerts)
False negative rate: customer issues not caught by SLOs (target: <10% of incidents)

Operational health metrics:

Mean time to detect (MTTD): time from issue start to SLO breach detection
Mean time to resolve (MTTR): time from SLO breach to resolution
Alert fatigue: number of alerts that require no action (lower is better)

Instrumentation coverage:

Percentage of critical journeys with complete end-to-end tracing
Percentage of services with correlation IDs in logs
Percentage of SLOs with validated business stakeholder sign-off

Key Takeaways

A functional SLO with documented limitations is more useful than a stalled SLO waiting for perfect conditions
Log quality problems are a systemic issue, not an SRE configuration problem - they require engineering investment
Service boundary ambiguity causes both operational confusion and measurement inaccuracy; clarifying it is high-leverage work
Infrastructure as Code disorder compounds over time; naming conventions and module structure established early are much cheaper than remediation later
Error budget policies need to be agreed before an SLO is deployed, not negotiated during an incident

Related posts:

CUJ SLOs vs Service SLOs - the conceptual foundation for the boundary decisions described in this assessment
CUJ SLO Implementation: Workflows and Job Execution - the specific implementation work this assessment reflects on

What Are Critical User Journeys?#

What Was Implemented#

Implementation Constraints#

Challenge 1: Distinguishing True CUJs from Service Health Checks#

Challenge 2: Service Boundary and Ownership Confusion#

Challenge 3: Log Quality Issues and Error Attribution#

False Positives: Alerts Without Customer Impact#

False Negatives: Missing Real Customer Impact#

Error Misattribution: Blaming the Wrong Service#

Challenge 4: Infrastructure as Code Disorder#

Implementation Checklist#

Journey Definition Quality#

Service and Ownership Clarity#

Observability Foundation#

Log Quality and Error Attribution#

SLO Configuration Standards#

Known Limitations and Workarounds#

Correlation Gaps#

Service Boundary Ambiguities#

Instrumentation Gaps#

Recommendations and Next Steps#

Immediate (1–2 weeks)#

Short-term (1–3 months)#

Long-term (3–6 months)#

Measuring Success#

Key Takeaways#

What Are Critical User Journeys?

What Was Implemented

Implementation Constraints

Challenge 1: Distinguishing True CUJs from Service Health Checks

Challenge 2: Service Boundary and Ownership Confusion

Challenge 3: Log Quality Issues and Error Attribution

False Positives: Alerts Without Customer Impact

False Negatives: Missing Real Customer Impact

Error Misattribution: Blaming the Wrong Service

Challenge 4: Infrastructure as Code Disorder

Implementation Checklist

Journey Definition Quality

Service and Ownership Clarity

Observability Foundation

Log Quality and Error Attribution

SLO Configuration Standards

Known Limitations and Workarounds

Correlation Gaps

Service Boundary Ambiguities

Instrumentation Gaps

Recommendations and Next Steps

Immediate (1–2 weeks)

Short-term (1–3 months)

Long-term (3–6 months)

Measuring Success

Key Takeaways