This document reflects an assessment of implementing Critical User Journey SLOs using Datadog in a production analytics platform. It covers what was delivered, the constraints encountered, and recommendations for improving observability foundation. The intent is to provide an honest accounting rather than a clean success story.
Key outcomes at the time of assessment:
- 3 CUJs defined with SLO coverage across critical customer scenarios (in progress as of January 2026)
- Technical gaps identified in logging, service ownership, and instrumentation affecting measurement accuracy
- Workarounds documented and applied to deliver functional SLOs within existing system constraints
- Prioritized recommendations produced for improving the observability foundation
Critical finding: Several challenges affected SLO reliability - log quality issues causing misattributed errors, unclear service boundaries creating overlapping monitors, and incomplete instrumentation limiting journey visibility. These are documented to inform future improvements, not to frame current delivery as blocked.
What Are Critical User Journeys?
Critical User Journeys represent end-to-end customer experiences that deliver business value. They differ fundamentally from traditional service-level monitoring.
CUJ focus: “Can a customer complete the workflow run?” (outcome-based) Traditional SLO focus: “Is the workflow service responding?” (availability-based)
A true CUJ tracks whether users can accomplish meaningful tasks from start to finish, spanning multiple services and dependencies. Success means the entire flow works from the customer’s perspective, not just that individual services are healthy.
When building CUJ-based SLOs, you are measuring customer impact rather than technical uptime. This requires different instrumentation, cross-service correlation, and a clear understanding of service boundaries. The challenges in this assessment directly affect the ability to measure true customer outcomes versus proxy metrics.
What Was Implemented
SLO architecture:
- Primary approach: Datadog APM endpoints → monitors → SLOs
- SLO type: monitor-based
- Services instrumented: 4 services across 4 teams
- [orchestration-engine]
- [frontend-monolith]
- [workflow-service]
- [optimizer-service]
Datadog components created:
- Monitors for availability, latency, and error rate tracking
- SLOs covering: Workflow load (Workflow Canvas), Workflow run, Workflow create/edit/delete
- Tags applied:
- SLO:
env:,service:,slo_type: - Monitor:
team:,service:,env:,monitor_type:
- SLO:
Naming conventions established for Terraform:
- SLO:
cuj_slo_[service]_[journey]_[env] - Monitor:
[service]_[metric]_[what it does]_[env] - Consistent tagging for dashboard filtering and ownership tracking
Implementation flow:
flowchart TD
Start([Start]) --> DefineJourneys[Define CUJs]
%% Analysis Phase
DefineJourneys --> DecisionCUJ{Is this a true CUJ or service health check?}
DecisionCUJ -->|Service Check| RedefineScope[Redefine Scope]
RedefineScope --> DefineJourneys
DecisionCUJ -->|True CUJ| ValidateStakeholders[Validate with Stakeholders]
ValidateStakeholders --> AssessBoundaries[Assess Service Boundaries]
AssessBoundaries --> DecisionBoundaries{Are service boundaries clear?}
DecisionBoundaries -->|No| ClarifyScope[Clarify Scope]
ClarifyScope --> AssessBoundaries
DecisionBoundaries -->|Yes| CheckCorrelation{Is trace-log correlation available?}
CheckCorrelation -->|No| CorrelationGap[Correlation gaps]
CorrelationGap --> DocWorkarounds1[Document Workarounds]
CheckCorrelation -->|Yes| IdentifyGaps[Identify Instrumentation Gaps]
DocWorkarounds1 --> IdentifyGaps
IdentifyGaps --> DecisionInstrumented{Are all services instrumented?}
DecisionInstrumented -->|No| InstGap[Instrumentation gaps]
InstGap --> ScopeLimits[Scope Limitations]
DecisionInstrumented -->|Yes| CreateMonitors[Create Monitors]
ScopeLimits --> CreateMonitors
%% Implementation Phase
CreateMonitors --> ConfigSLOs[Configure SLOs]
ConfigSLOs --> ApplyTags[Apply Tags]
ApplyTags --> SetBaselines[Set Baselines]
SetBaselines --> DecisionTargets{SLO targets realistic?}
DecisionTargets -->|No| AdjustTargets[Adjust Targets]
AdjustTargets --> SetBaselines
DecisionTargets -->|Yes| TestCoverage[Test Coverage]
%% Finalization Phase
TestCoverage --> CheckFalsePos[Check False Positives]
TestCoverage --> CheckFalseNeg[Check False Negatives]
CheckFalsePos --> FalsePositives[False Positives Detected]
CheckFalseNeg --> FalseNegatives[False Negatives Detected]
FalsePositives --> ApplyFilters[Apply Filters and Thresholds]
FalseNegatives --> SyntheticTests[Implement Synthetic Tests]
ApplyFilters --> DocWorkarounds2[Document Workarounds]
SyntheticTests --> DocWorkarounds2
DocWorkarounds2 --> DecisionReady{Deployment ready?}
DecisionReady -->|No| FixIssues[Fix Issues]
FixIssues --> TestCoverage
DecisionReady -->|Yes| Deploy[Deploy SLOs]
Deploy --> Monitor[Monitor Effectiveness]
Monitor --> End([End: Functional SLOs])
%% Styling
classDef analysisPhase fill:#B3E5EC,stroke:#1FB8CD,stroke-width:2px,color:#000
classDef implementationPhase fill:#A5D6A7,stroke:#2E8B57,stroke-width:2px,color:#000
classDef finalizationPhase fill:#FFE0B2,stroke:#D2BA4C,stroke-width:2px,color:#000
classDef riskPhase fill:#FFCDD2,stroke:#DB4545,stroke-width:2px,color:#000
classDef decisionPhase fill:#FFEB8A,stroke:#D2BA4C,stroke-width:2px,color:#000
class DefineJourneys,ValidateStakeholders,AssessBoundaries,IdentifyGaps,RedefineScope,ClarifyScope analysisPhase
class CreateMonitors,ConfigSLOs,ApplyTags,SetBaselines,AdjustTargets implementationPhase
class TestCoverage,Deploy,Monitor,DocWorkarounds2,FixIssues,CheckFalsePos,CheckFalseNeg finalizationPhase
class CorrelationGap,InstGap,ScopeLimits,FalsePositives,FalseNegatives,ApplyFilters,SyntheticTests,DocWorkarounds1 riskPhase
class DecisionCUJ,DecisionBoundaries,CheckCorrelation,DecisionInstrumented,DecisionTargets,DecisionReady decisionPhase
Implementation Constraints
Four primary constraint categories affected delivery quality:
- Observability gaps - missing instrumentation, incomplete logging, or absent correlation IDs
- Service boundary ambiguity - overlapping responsibilities or unclear ownership between teams
- Log quality issues - false positives, false negatives, and misattributed errors
- Infrastructure as Code technical debt - Terraform repository disorder affecting maintainability and change confidence
These constraints required workarounds that are documented below. The SLOs delivered are functional but may not capture complete journey fidelity given current system limitations.
Challenge 1: Distinguishing True CUJs from Service Health Checks
The problem: Some requested “journeys” were actually service-level health checks rather than customer outcome measurements. A monitor that tracks whether a service is responding does not tell you whether customers can complete their task.
Example scenario:
- Labeled as CUJ: “User Login Journey” measuring authentication service uptime
- Actual measurement: Service availability, not whether login works end-to-end
- What is missing: Frontend rendering, session creation, redirect success, dependent profile data loading
A service can be healthy while the journey fails due to integration issues, network problems, or downstream dependencies. This produces green dashboards while customer experience degrades.
What was done: Documented which SLOs measure true end-to-end journeys versus service proxies. Where end-to-end measurement was not possible due to instrumentation gaps, the limitation and scope were noted explicitly.
Recommendation: Validate each journey with product and business stakeholders to confirm it represents actual customer value. Prioritize instrumentation improvements for true journey tracking.
Challenge 2: Service Boundary and Ownership Confusion
The problem: Overlapping service definitions, unclear team ownership, and shared responsibilities make it difficult to attribute errors correctly and avoid duplicate monitoring.
Example scenario:
- Services A and B both handle parts of a shared responsibility
- Error logs appear in both services for the same customer failure
- Monitors fire in both places, creating duplicate alerts
- No clear owner for resolving the underlying issue
Without clear boundaries, you create redundant monitors, count errors multiple times in SLO calculations, and generate alert fatigue from duplicate notifications.
What was done:
- Tagged services with best-known ownership information
- Documented services with ambiguous boundaries requiring architecture team clarification
- Created monitors scoped to avoid double-counting where possible
- Noted overlaps in this assessment for resolution
Identifiable markers of this issue:
- Multiple services logging the same error event with different service tags
- Service catalog showing missing or multiple owners for the same functionality
- Runbook links absent or pointing to the wrong team documentation
- Duplicate alerts firing for a single customer-impacting incident
Recommendation: Service ownership audit using Datadog Software Catalog with scorecard requirements covering: documented owner, runbook links, on-call contacts, and clear responsibility boundaries.
Challenge 3: Log Quality Issues and Error Attribution
The problem: Logs may surface issues that are not real (false positives), miss actual problems (false negatives), or incorrectly attribute errors to services that are merely propagating failures from upstream dependencies.
Monitors are only as reliable as the data they consume. If logs misattribute errors, SLOs blame the wrong services and alerts fire on non-issues while missing real customer impact.
False Positives: Alerts Without Customer Impact
Common sources identified:
Expected behavior logged as errors:
- Validation failures (user entered invalid data - working as designed)
- Rate limiting triggers (protecting the system - intended behavior)
- Authentication rejections (wrong password - not a system failure)
Transient events overcounted:
- Network timeouts that succeed on automatic retry
- Load balancer health check failures during valid autoscaling events
- Database connection pool exhaustion during legitimate traffic spikes
Cascading alarm scenarios:
- One upstream service fails, generating errors in multiple downstream services
- All dependent services alert for the same root cause
- On-call teams respond to noise rather than the actual problem
Workarounds applied:
- Exclude 4xx client errors from availability SLIs where appropriate
- Add evaluation windows requiring sustained error rates before alerting
- Implement recovery thresholds to suppress alerts that self-resolve
- Document expected error patterns to filter from critical monitors
False Negatives: Missing Real Customer Impact
Common sources identified:
Silent failures:
- Services failing without generating error logs (poorly instrumented error handling)
- Timeout scenarios not properly caught and logged
- Circuit breakers tripping without notification
Severity misclassification:
- Critical errors logged at WARNING level, bypassing monitor thresholds
- Customer-impacting issues logged as INFO rather than ERROR
- Status codes that do not map cleanly to user journey outcomes
Aggregate metric masking:
- Critical journey failures hidden within overall service success rates
- Rare but high-impact errors below percentage thresholds
- Geographic or customer-segment issues averaged away in global metrics
Workarounds applied:
- Log volume monitoring to detect missing log scenarios
- Documented instrumentation gaps where false negatives likely exist
Error Misattribution: Blaming the Wrong Service
How this happens:
Missing correlation IDs:
- Logs lack
trace_idandspan_idto link requests across services - Cannot determine which upstream service sent bad data
- Each service in the chain logs an error, appearing like multiple independent problems
Cascading failure confusion:
- Service A fails → Service B cannot complete requests → Service C times out
- All three services log errors; only A has the root cause
- Monitors fire across all services, obscuring the actual problem
Workarounds applied:
- Enable trace-log correlation where possible using
DD_LOGS_INJECTION=true - Tag monitors and SLOs with service boundaries to limit scope
- Document services lacking proper correlation for future instrumentation
Identifiable markers of this issue:
- Error logs missing
trace_id,span_id, or parent service context - Multiple services simultaneously reporting errors for the same customer request
- Teams claiming “it’s not our issue” without clear evidence
Challenge 4: Infrastructure as Code Disorder
Terraform repository state: Inconsistent module structure, unclear resource organization, and redundant code that should be unified across environments. Exceptions were treated as the rule rather than the exception.
Impact:
- Manual verification required to confirm deployed configuration matches documented intent
- Risk of unintended changes when modifying existing resources
- Difficult to onboard new team members to CUJ SLO maintenance
- Inconsistent tagging across resources affecting dashboard filtering
Workarounds applied:
- External mapping document: “Terraform Resource to CUJ Journey mapping”
- Peer review requirement for all Terraform changes
- Naming conventions applied to new resources only (legacy resources remain inconsistent)
- Manual audit before each deployment to catch drift
Recommendation: Terraform refactoring in phases:
- Audit - inventory all CUJ-related resources, identify drift and orphans
- Consolidate - migrate to module structure (e.g.,
modules/datadog_cuj_slo/) - Tag remediation - retroactively apply consistent tagging to existing resources
- Validation - implement pre-commit hooks and CI checks for naming/tagging compliance
Implementation Checklist
This checklist documents the assessment state at the time of writing. Use it to validate completeness and identify remaining gaps.
Journey Definition Quality
| Requirement | Notes |
|---|---|
| Journeys represent actual end-user value | Validated for implemented journeys; document which are service health proxies |
| Clear start and end points tied to business outcomes | Documented per journey |
| Business stakeholder validation completed | Schedule validation with Product team |
| Maximum 3–5 critical journeys identified | 3 journeys prioritized at assessment date |
| Journey steps documented with service dependencies | Some gaps in dependency mapping remain |
Service and Ownership Clarity
| Requirement | Notes |
|---|---|
| Clear team ownership defined for each service | List services with unclear ownership |
| No overlapping or duplicate service definitions | Document overlaps identified |
| Services tagged with team, service, environment | Applied per Datadog standards |
| Service catalog populated with runbooks, contacts | Completion varies by service |
| On-call rotation defined for each service | Gap services should be listed explicitly |
Observability Foundation
| Requirement | Notes |
|---|---|
| Distributed tracing configured across journey services | List services lacking APM |
| Logs include correlation IDs (trace_id, span_id) | List services with missing correlation |
| Structured logging with consistent field names | Some legacy services use unstructured logs |
| Service emits availability, latency, error metrics | Most services instrumented |
Log Quality and Error Attribution
| Requirement | Notes |
|---|---|
| Services distinguish client vs server errors | Many services log all errors as 5xx-equivalent |
| Error logs include originating service context | Requires instrumentation work |
| Expected errors excluded from SLI calculations | Filtered: validation errors, rate limits |
| Error severity accurately reflects customer impact | Some misclassification identified |
SLO Configuration Standards
| Requirement | Notes |
|---|---|
| Appropriate SLO type selected | Rationale documented in SLO descriptions |
| Meaningful names following naming convention | Convention: cuj_slo_[service]_[journey]_[env] |
| Descriptions include what, why, how, runbook links | Varies across resources |
| Targets based on historical data and business needs | Some targets aspirational, not evidence-based |
| Targets provide buffer above customer SLA | All SLOs stricter than external commitments |
Known Limitations and Workarounds
Correlation Gaps
Impact: Cannot reliably attribute errors across services without trace correlation. May see duplicate error counting or blame propagated from the wrong source.
Workaround: Separate monitors per service with limited cross-service visibility; conservative SLO scope using service boundary tags; dependency relationships documented manually in the service catalog.
Recommendation: Enable DD_LOGS_INJECTION=true in tracer configuration for
affected services. Priority: High - this affects accuracy of all SLOs involving
those services.
Service Boundary Ambiguities
Impact: Potential double-counting of errors, unclear team ownership on incidents, duplicate alerts for single root cause.
Workaround: Monitors scoped to specific endpoints within services to avoid overlap; ambiguous services tagged with multiple team identifiers; escalation documentation created for unclear ownership scenarios.
Recommendation: Architecture review to clarify service boundaries. In the interim, establish an ownership decision matrix per service area.
Instrumentation Gaps
Impact: Cannot measure the full customer journey. SLO coverage limited to instrumented portions only.
Workaround: SLO definitions explicitly scoped to exclude uninstrumented sections; synthetic tests added as compensating controls where possible; gap noted in SLO descriptions to set accurate expectations.
Recommendation: Instrumentation backfill prioritized for gap services based on journey criticality.
Recommendations and Next Steps
Immediate (1–2 weeks)
Validate journey definitions with stakeholders Meet with product and business owners to confirm each “journey” represents actual customer value, not just service health.
Enable trace-log correlation for affected services Deploy
DD_LOGS_INJECTION=trueto services lacking correlation. Enables accurate error attribution and reduces misattribution across SLOs.Document error budget policies Define what happens when SLOs are breached - deployment freeze, post-mortem, or escalation. Without this, SLOs are informational rather than operational.
Short-term (1–3 months)
Service ownership audit using Datadog Software Catalog Use scorecard requirements to identify and resolve ownership gaps.
Implement error taxonomy tagging Add
error.typetags (client_error, server_error, dependency_error) to differentiate error sources and enable proper attribution.Expand synthetic test coverage Create Datadog Synthetic tests for identified false negative risks, particularly for services with silent failure modes.
Long-term (3–6 months)
Instrumentation backfill for gap services Prioritize adding logging and metrics to uninstrumented services. This enables true end-to-end journey measurement rather than partial proxies.
Log quality improvement program Establish logging standards: structured format, severity guidelines, required context fields. Enforce via code review and linting.
Unified SLO dashboards with context Build dashboards combining SLO status with relevant service and infrastructure metrics to improve troubleshooting during SLO breaches.
Measuring Success
These metrics indicate whether improvements are working:
SLO fidelity metrics:
- Correlation between SLO breaches and customer-reported issues (target: >90% alignment)
- False positive rate: alerts without customer impact (target: <5% of all alerts)
- False negative rate: customer issues not caught by SLOs (target: <10% of incidents)
Operational health metrics:
- Mean time to detect (MTTD): time from issue start to SLO breach detection
- Mean time to resolve (MTTR): time from SLO breach to resolution
- Alert fatigue: number of alerts that require no action (lower is better)
Instrumentation coverage:
- Percentage of critical journeys with complete end-to-end tracing
- Percentage of services with correlation IDs in logs
- Percentage of SLOs with validated business stakeholder sign-off
Key Takeaways
- A functional SLO with documented limitations is more useful than a stalled SLO waiting for perfect conditions
- Log quality problems are a systemic issue, not an SRE configuration problem - they require engineering investment
- Service boundary ambiguity causes both operational confusion and measurement inaccuracy; clarifying it is high-leverage work
- Infrastructure as Code disorder compounds over time; naming conventions and module structure established early are much cheaper than remediation later
- Error budget policies need to be agreed before an SLO is deployed, not negotiated during an incident
Related posts:
- CUJ SLOs vs Service SLOs - the conceptual foundation for the boundary decisions described in this assessment
- CUJ SLO Implementation: Workflows and Job Execution - the specific implementation work this assessment reflects on