This document reflects an assessment of implementing Critical User Journey SLOs using Datadog in a production analytics platform. It covers what was delivered, the constraints encountered, and recommendations for improving observability foundation. The intent is to provide an honest accounting rather than a clean success story.

Key outcomes at the time of assessment:

  • 3 CUJs defined with SLO coverage across critical customer scenarios (in progress as of January 2026)
  • Technical gaps identified in logging, service ownership, and instrumentation affecting measurement accuracy
  • Workarounds documented and applied to deliver functional SLOs within existing system constraints
  • Prioritized recommendations produced for improving the observability foundation

Critical finding: Several challenges affected SLO reliability - log quality issues causing misattributed errors, unclear service boundaries creating overlapping monitors, and incomplete instrumentation limiting journey visibility. These are documented to inform future improvements, not to frame current delivery as blocked.


What Are Critical User Journeys?

Critical User Journeys represent end-to-end customer experiences that deliver business value. They differ fundamentally from traditional service-level monitoring.

CUJ focus: “Can a customer complete the workflow run?” (outcome-based) Traditional SLO focus: “Is the workflow service responding?” (availability-based)

A true CUJ tracks whether users can accomplish meaningful tasks from start to finish, spanning multiple services and dependencies. Success means the entire flow works from the customer’s perspective, not just that individual services are healthy.

When building CUJ-based SLOs, you are measuring customer impact rather than technical uptime. This requires different instrumentation, cross-service correlation, and a clear understanding of service boundaries. The challenges in this assessment directly affect the ability to measure true customer outcomes versus proxy metrics.


What Was Implemented

SLO architecture:

  • Primary approach: Datadog APM endpoints → monitors → SLOs
  • SLO type: monitor-based
  • Services instrumented: 4 services across 4 teams
    • [orchestration-engine]
    • [frontend-monolith]
    • [workflow-service]
    • [optimizer-service]

Datadog components created:

  • Monitors for availability, latency, and error rate tracking
  • SLOs covering: Workflow load (Workflow Canvas), Workflow run, Workflow create/edit/delete
  • Tags applied:
    • SLO: env:, service:, slo_type:
    • Monitor: team:, service:, env:, monitor_type:

Naming conventions established for Terraform:

  • SLO: cuj_slo_[service]_[journey]_[env]
  • Monitor: [service]_[metric]_[what it does]_[env]
  • Consistent tagging for dashboard filtering and ownership tracking

Implementation flow:

flowchart TD
    Start([Start]) --> DefineJourneys[Define CUJs]

    %% Analysis Phase
    DefineJourneys --> DecisionCUJ{Is this a true CUJ or service health check?}
    DecisionCUJ -->|Service Check| RedefineScope[Redefine Scope]
    RedefineScope --> DefineJourneys
    DecisionCUJ -->|True CUJ| ValidateStakeholders[Validate with Stakeholders]

    ValidateStakeholders --> AssessBoundaries[Assess Service Boundaries]
    AssessBoundaries --> DecisionBoundaries{Are service boundaries clear?}
    DecisionBoundaries -->|No| ClarifyScope[Clarify Scope]
    ClarifyScope --> AssessBoundaries
    DecisionBoundaries -->|Yes| CheckCorrelation{Is trace-log correlation available?}

    CheckCorrelation -->|No| CorrelationGap[Correlation gaps]
    CorrelationGap --> DocWorkarounds1[Document Workarounds]
    CheckCorrelation -->|Yes| IdentifyGaps[Identify Instrumentation Gaps]
    DocWorkarounds1 --> IdentifyGaps

    IdentifyGaps --> DecisionInstrumented{Are all services instrumented?}
    DecisionInstrumented -->|No| InstGap[Instrumentation gaps]
    InstGap --> ScopeLimits[Scope Limitations]
    DecisionInstrumented -->|Yes| CreateMonitors[Create Monitors]
    ScopeLimits --> CreateMonitors

    %% Implementation Phase
    CreateMonitors --> ConfigSLOs[Configure SLOs]
    ConfigSLOs --> ApplyTags[Apply Tags]
    ApplyTags --> SetBaselines[Set Baselines]

    SetBaselines --> DecisionTargets{SLO targets realistic?}
    DecisionTargets -->|No| AdjustTargets[Adjust Targets]
    AdjustTargets --> SetBaselines
    DecisionTargets -->|Yes| TestCoverage[Test Coverage]

    %% Finalization Phase
    TestCoverage --> CheckFalsePos[Check False Positives]
    TestCoverage --> CheckFalseNeg[Check False Negatives]
    CheckFalsePos --> FalsePositives[False Positives Detected]
    CheckFalseNeg --> FalseNegatives[False Negatives Detected]
    FalsePositives --> ApplyFilters[Apply Filters and Thresholds]
    FalseNegatives --> SyntheticTests[Implement Synthetic Tests]

    ApplyFilters --> DocWorkarounds2[Document Workarounds]
    SyntheticTests --> DocWorkarounds2

    DocWorkarounds2 --> DecisionReady{Deployment ready?}
    DecisionReady -->|No| FixIssues[Fix Issues]
    FixIssues --> TestCoverage
    DecisionReady -->|Yes| Deploy[Deploy SLOs]

    Deploy --> Monitor[Monitor Effectiveness]
    Monitor --> End([End: Functional SLOs])

    %% Styling
    classDef analysisPhase fill:#B3E5EC,stroke:#1FB8CD,stroke-width:2px,color:#000
    classDef implementationPhase fill:#A5D6A7,stroke:#2E8B57,stroke-width:2px,color:#000
    classDef finalizationPhase fill:#FFE0B2,stroke:#D2BA4C,stroke-width:2px,color:#000
    classDef riskPhase fill:#FFCDD2,stroke:#DB4545,stroke-width:2px,color:#000
    classDef decisionPhase fill:#FFEB8A,stroke:#D2BA4C,stroke-width:2px,color:#000

    class DefineJourneys,ValidateStakeholders,AssessBoundaries,IdentifyGaps,RedefineScope,ClarifyScope analysisPhase
    class CreateMonitors,ConfigSLOs,ApplyTags,SetBaselines,AdjustTargets implementationPhase
    class TestCoverage,Deploy,Monitor,DocWorkarounds2,FixIssues,CheckFalsePos,CheckFalseNeg finalizationPhase
    class CorrelationGap,InstGap,ScopeLimits,FalsePositives,FalseNegatives,ApplyFilters,SyntheticTests,DocWorkarounds1 riskPhase
    class DecisionCUJ,DecisionBoundaries,CheckCorrelation,DecisionInstrumented,DecisionTargets,DecisionReady decisionPhase

Implementation Constraints

Four primary constraint categories affected delivery quality:

  1. Observability gaps - missing instrumentation, incomplete logging, or absent correlation IDs
  2. Service boundary ambiguity - overlapping responsibilities or unclear ownership between teams
  3. Log quality issues - false positives, false negatives, and misattributed errors
  4. Infrastructure as Code technical debt - Terraform repository disorder affecting maintainability and change confidence

These constraints required workarounds that are documented below. The SLOs delivered are functional but may not capture complete journey fidelity given current system limitations.


Challenge 1: Distinguishing True CUJs from Service Health Checks

The problem: Some requested “journeys” were actually service-level health checks rather than customer outcome measurements. A monitor that tracks whether a service is responding does not tell you whether customers can complete their task.

Example scenario:

  • Labeled as CUJ: “User Login Journey” measuring authentication service uptime
  • Actual measurement: Service availability, not whether login works end-to-end
  • What is missing: Frontend rendering, session creation, redirect success, dependent profile data loading

A service can be healthy while the journey fails due to integration issues, network problems, or downstream dependencies. This produces green dashboards while customer experience degrades.

What was done: Documented which SLOs measure true end-to-end journeys versus service proxies. Where end-to-end measurement was not possible due to instrumentation gaps, the limitation and scope were noted explicitly.

Recommendation: Validate each journey with product and business stakeholders to confirm it represents actual customer value. Prioritize instrumentation improvements for true journey tracking.


Challenge 2: Service Boundary and Ownership Confusion

The problem: Overlapping service definitions, unclear team ownership, and shared responsibilities make it difficult to attribute errors correctly and avoid duplicate monitoring.

Example scenario:

  • Services A and B both handle parts of a shared responsibility
  • Error logs appear in both services for the same customer failure
  • Monitors fire in both places, creating duplicate alerts
  • No clear owner for resolving the underlying issue

Without clear boundaries, you create redundant monitors, count errors multiple times in SLO calculations, and generate alert fatigue from duplicate notifications.

What was done:

  • Tagged services with best-known ownership information
  • Documented services with ambiguous boundaries requiring architecture team clarification
  • Created monitors scoped to avoid double-counting where possible
  • Noted overlaps in this assessment for resolution

Identifiable markers of this issue:

  • Multiple services logging the same error event with different service tags
  • Service catalog showing missing or multiple owners for the same functionality
  • Runbook links absent or pointing to the wrong team documentation
  • Duplicate alerts firing for a single customer-impacting incident

Recommendation: Service ownership audit using Datadog Software Catalog with scorecard requirements covering: documented owner, runbook links, on-call contacts, and clear responsibility boundaries.


Challenge 3: Log Quality Issues and Error Attribution

The problem: Logs may surface issues that are not real (false positives), miss actual problems (false negatives), or incorrectly attribute errors to services that are merely propagating failures from upstream dependencies.

Monitors are only as reliable as the data they consume. If logs misattribute errors, SLOs blame the wrong services and alerts fire on non-issues while missing real customer impact.

False Positives: Alerts Without Customer Impact

Common sources identified:

Expected behavior logged as errors:

  • Validation failures (user entered invalid data - working as designed)
  • Rate limiting triggers (protecting the system - intended behavior)
  • Authentication rejections (wrong password - not a system failure)

Transient events overcounted:

  • Network timeouts that succeed on automatic retry
  • Load balancer health check failures during valid autoscaling events
  • Database connection pool exhaustion during legitimate traffic spikes

Cascading alarm scenarios:

  • One upstream service fails, generating errors in multiple downstream services
  • All dependent services alert for the same root cause
  • On-call teams respond to noise rather than the actual problem

Workarounds applied:

  • Exclude 4xx client errors from availability SLIs where appropriate
  • Add evaluation windows requiring sustained error rates before alerting
  • Implement recovery thresholds to suppress alerts that self-resolve
  • Document expected error patterns to filter from critical monitors

False Negatives: Missing Real Customer Impact

Common sources identified:

Silent failures:

  • Services failing without generating error logs (poorly instrumented error handling)
  • Timeout scenarios not properly caught and logged
  • Circuit breakers tripping without notification

Severity misclassification:

  • Critical errors logged at WARNING level, bypassing monitor thresholds
  • Customer-impacting issues logged as INFO rather than ERROR
  • Status codes that do not map cleanly to user journey outcomes

Aggregate metric masking:

  • Critical journey failures hidden within overall service success rates
  • Rare but high-impact errors below percentage thresholds
  • Geographic or customer-segment issues averaged away in global metrics

Workarounds applied:

  • Log volume monitoring to detect missing log scenarios
  • Documented instrumentation gaps where false negatives likely exist

Error Misattribution: Blaming the Wrong Service

How this happens:

Missing correlation IDs:

  • Logs lack trace_id and span_id to link requests across services
  • Cannot determine which upstream service sent bad data
  • Each service in the chain logs an error, appearing like multiple independent problems

Cascading failure confusion:

  • Service A fails → Service B cannot complete requests → Service C times out
  • All three services log errors; only A has the root cause
  • Monitors fire across all services, obscuring the actual problem

Workarounds applied:

  • Enable trace-log correlation where possible using DD_LOGS_INJECTION=true
  • Tag monitors and SLOs with service boundaries to limit scope
  • Document services lacking proper correlation for future instrumentation

Identifiable markers of this issue:

  • Error logs missing trace_id, span_id, or parent service context
  • Multiple services simultaneously reporting errors for the same customer request
  • Teams claiming “it’s not our issue” without clear evidence

Challenge 4: Infrastructure as Code Disorder

Terraform repository state: Inconsistent module structure, unclear resource organization, and redundant code that should be unified across environments. Exceptions were treated as the rule rather than the exception.

Impact:

  • Manual verification required to confirm deployed configuration matches documented intent
  • Risk of unintended changes when modifying existing resources
  • Difficult to onboard new team members to CUJ SLO maintenance
  • Inconsistent tagging across resources affecting dashboard filtering

Workarounds applied:

  • External mapping document: “Terraform Resource to CUJ Journey mapping”
  • Peer review requirement for all Terraform changes
  • Naming conventions applied to new resources only (legacy resources remain inconsistent)
  • Manual audit before each deployment to catch drift

Recommendation: Terraform refactoring in phases:

  1. Audit - inventory all CUJ-related resources, identify drift and orphans
  2. Consolidate - migrate to module structure (e.g., modules/datadog_cuj_slo/)
  3. Tag remediation - retroactively apply consistent tagging to existing resources
  4. Validation - implement pre-commit hooks and CI checks for naming/tagging compliance

Implementation Checklist

This checklist documents the assessment state at the time of writing. Use it to validate completeness and identify remaining gaps.

Journey Definition Quality

RequirementNotes
Journeys represent actual end-user valueValidated for implemented journeys; document which are service health proxies
Clear start and end points tied to business outcomesDocumented per journey
Business stakeholder validation completedSchedule validation with Product team
Maximum 3–5 critical journeys identified3 journeys prioritized at assessment date
Journey steps documented with service dependenciesSome gaps in dependency mapping remain

Service and Ownership Clarity

RequirementNotes
Clear team ownership defined for each serviceList services with unclear ownership
No overlapping or duplicate service definitionsDocument overlaps identified
Services tagged with team, service, environmentApplied per Datadog standards
Service catalog populated with runbooks, contactsCompletion varies by service
On-call rotation defined for each serviceGap services should be listed explicitly

Observability Foundation

RequirementNotes
Distributed tracing configured across journey servicesList services lacking APM
Logs include correlation IDs (trace_id, span_id)List services with missing correlation
Structured logging with consistent field namesSome legacy services use unstructured logs
Service emits availability, latency, error metricsMost services instrumented

Log Quality and Error Attribution

RequirementNotes
Services distinguish client vs server errorsMany services log all errors as 5xx-equivalent
Error logs include originating service contextRequires instrumentation work
Expected errors excluded from SLI calculationsFiltered: validation errors, rate limits
Error severity accurately reflects customer impactSome misclassification identified

SLO Configuration Standards

RequirementNotes
Appropriate SLO type selectedRationale documented in SLO descriptions
Meaningful names following naming conventionConvention: cuj_slo_[service]_[journey]_[env]
Descriptions include what, why, how, runbook linksVaries across resources
Targets based on historical data and business needsSome targets aspirational, not evidence-based
Targets provide buffer above customer SLAAll SLOs stricter than external commitments

Known Limitations and Workarounds

Correlation Gaps

Impact: Cannot reliably attribute errors across services without trace correlation. May see duplicate error counting or blame propagated from the wrong source.

Workaround: Separate monitors per service with limited cross-service visibility; conservative SLO scope using service boundary tags; dependency relationships documented manually in the service catalog.

Recommendation: Enable DD_LOGS_INJECTION=true in tracer configuration for affected services. Priority: High - this affects accuracy of all SLOs involving those services.

Service Boundary Ambiguities

Impact: Potential double-counting of errors, unclear team ownership on incidents, duplicate alerts for single root cause.

Workaround: Monitors scoped to specific endpoints within services to avoid overlap; ambiguous services tagged with multiple team identifiers; escalation documentation created for unclear ownership scenarios.

Recommendation: Architecture review to clarify service boundaries. In the interim, establish an ownership decision matrix per service area.

Instrumentation Gaps

Impact: Cannot measure the full customer journey. SLO coverage limited to instrumented portions only.

Workaround: SLO definitions explicitly scoped to exclude uninstrumented sections; synthetic tests added as compensating controls where possible; gap noted in SLO descriptions to set accurate expectations.

Recommendation: Instrumentation backfill prioritized for gap services based on journey criticality.


Recommendations and Next Steps

Immediate (1–2 weeks)

  1. Validate journey definitions with stakeholders Meet with product and business owners to confirm each “journey” represents actual customer value, not just service health.

  2. Enable trace-log correlation for affected services Deploy DD_LOGS_INJECTION=true to services lacking correlation. Enables accurate error attribution and reduces misattribution across SLOs.

  3. Document error budget policies Define what happens when SLOs are breached - deployment freeze, post-mortem, or escalation. Without this, SLOs are informational rather than operational.

Short-term (1–3 months)

  1. Service ownership audit using Datadog Software Catalog Use scorecard requirements to identify and resolve ownership gaps.

  2. Implement error taxonomy tagging Add error.type tags (client_error, server_error, dependency_error) to differentiate error sources and enable proper attribution.

  3. Expand synthetic test coverage Create Datadog Synthetic tests for identified false negative risks, particularly for services with silent failure modes.

Long-term (3–6 months)

  1. Instrumentation backfill for gap services Prioritize adding logging and metrics to uninstrumented services. This enables true end-to-end journey measurement rather than partial proxies.

  2. Log quality improvement program Establish logging standards: structured format, severity guidelines, required context fields. Enforce via code review and linting.

  3. Unified SLO dashboards with context Build dashboards combining SLO status with relevant service and infrastructure metrics to improve troubleshooting during SLO breaches.


Measuring Success

These metrics indicate whether improvements are working:

SLO fidelity metrics:

  • Correlation between SLO breaches and customer-reported issues (target: >90% alignment)
  • False positive rate: alerts without customer impact (target: <5% of all alerts)
  • False negative rate: customer issues not caught by SLOs (target: <10% of incidents)

Operational health metrics:

  • Mean time to detect (MTTD): time from issue start to SLO breach detection
  • Mean time to resolve (MTTR): time from SLO breach to resolution
  • Alert fatigue: number of alerts that require no action (lower is better)

Instrumentation coverage:

  • Percentage of critical journeys with complete end-to-end tracing
  • Percentage of services with correlation IDs in logs
  • Percentage of SLOs with validated business stakeholder sign-off

Key Takeaways

  • A functional SLO with documented limitations is more useful than a stalled SLO waiting for perfect conditions
  • Log quality problems are a systemic issue, not an SRE configuration problem - they require engineering investment
  • Service boundary ambiguity causes both operational confusion and measurement inaccuracy; clarifying it is high-leverage work
  • Infrastructure as Code disorder compounds over time; naming conventions and module structure established early are much cheaper than remediation later
  • Error budget policies need to be agreed before an SLO is deployed, not negotiated during an incident

Related posts: