Quick context for this post:

  • CUJ (Critical User Journey) - an end-to-end sequence of user actions that delivers business value
  • SLI (Service Level Indicator) - a metric measuring service behavior (e.g., request success rate)
  • SLO (Service Level Objective) - a target applied to an SLI
  • APM (Application Performance Monitoring) - service-level tracing and metrics, used here via Datadog

This post covers two CUJ SLO implementations: one for workflow lifecycle management (create, edit, delete), and one for job execution (triggering and monitoring a run). Both surface the same core challenge - identifying the minimal set of endpoints that accurately represents the user journey without over-measuring - but they differ substantially in how complex the dependency mapping turns out to be.

Workflow CUJ: Identifying the Right Endpoints

The CUJ Scope

The workflow user journey covers three actions: Create → Edit → Delete. These are the essential lifecycle operations that define meaningful use of the product. Loading a workflow canvas or browsing a list is not part of this CUJ - those are supporting interactions, not the outcome the user came for.

Endpoint Identification Method

The most direct method for identifying which endpoints a user journey actually calls is to observe the network traffic during a real session:

  1. Log into the product with the target journey in scope
  2. Open browser developer tools → Network tab
  3. Perform the journey actions (create workflow, edit it, delete it)
  4. Export the HAR file

The HAR file contains every network call made during the session, including exact endpoint paths, methods, and response codes. Filter to calls on the service under observation, then remove internal-only calls (service-to-service, health checks, telemetry) that users never directly trigger.

From this process, the minimal set of endpoints for the workflow CUJ was identified as:

ActionServiceEndpoint
Create[workflow-service]POST /api/v1/workflows
Edit[workflow-service]PUT /api/v1/workflows/{id}
Delete[workflow-service]DELETE /api/v0/workflows/{id}

Why Minimal Endpoints Matter

The principle confirmed through discussions with the [workflow-service] Lead Software Engineer and the Staff SRE is: track the minimum viable set of endpoints that represents the journey, not every service call in the chain.

Over-measuring produces two problems:

  • Signal noise: failures in endpoints that are not user-facing inflate the error budget for non-user-impacting events
  • Ownership ambiguity: the more services included, the less clear it is who is responsible when the SLO degrades

SLO Calculation

Each endpoint monitor carries an individual error budget of 0.1%. The overall SLO target is the product of the individual endpoint targets. With three endpoints each at 99.9%, the composite SLO target is 99.9% × 99.9% × 99.9% ≈ 99.7%. This means the combined SLO is intentionally slightly looser than any individual endpoint, which is the correct behaviour - a journey succeeds only when all endpoints succeed.

The Status Code Question

One open question during implementation was whether to monitor for 2xx responses only, or all non-5xx responses (i.e., allowing 4xx as “success”). The Staff Observability Engineer’s position was that this depends on the service contract.

For these endpoints, 4xx responses indicate client errors (invalid input, unauthorized requests) rather than service failures. Including them as failures in the SLI would penalize the SLO for user errors, not infrastructure problems. The approach taken was to count only 5xx responses as failures, with the understanding that the SLI trusts the service architecture to distinguish client from server errors correctly.

The scope of CUJ SLOs is not to validate the correctness of API design - it is to measure whether the service is available and responsive for legitimate user requests.


Job Execution CUJ: Mapping Service Dependencies

Why Job Execution Is More Complex

A job execution CUJ - triggering a workflow run and confirming it completed - spans more services than the workflow lifecycle. The initial instinct was to include the obvious candidates: the scheduling service, the job execution service, notification delivery. This turned out to be the wrong starting point.

The correct approach is to start from the user-facing surface and trace inward.

Initial (Incorrect) Approach

The first pass at service identification produced a list based on what seemed relevant to “job execution” conceptually:

  • [workflow-service] (handled separately)
  • [scheduling-service] (for schedule-triggered runs)
  • [notification-service] (job event notifications)
  • Relevant event queue feedback

This list was built from knowledge of the system architecture, not from user behavior. The problem is that it does not reflect what the user actually touches.

Corrected Approach: Start from the User-Facing Surface

The [frontend-monolith] and [job-metadata-service] handle customer-facing job requests and route them to downstream services. This means that checking the endpoints on [frontend-monolith] is sufficient to capture the user-facing CUJ - failures in downstream services that affect the user journey will surface as failures on the [frontend-monolith] endpoints.

This follows the black-box principle: measure at the entry point where users interact. Hard dependency failures downstream automatically manifest at the entry point.

From APM trace analysis, the job execution journey can be triggered by four distinct scenarios:

  1. Running a workflow from the Library section
  2. Running a workflow from the Workflow Canvas section
  3. Running a scheduled workflow
  4. Running a workflow from Plans

Each of these triggers a different entry path, which affects which endpoints are in scope for the CUJ monitor.

Validating with APM Traces

Rather than relying on documentation or service team knowledge alone, APM trace analysis was used to observe which endpoints were actually called during job execution. Filtering traces by the known [frontend-monolith] job group endpoints across different trigger scenarios revealed which paths were common to all scenarios and which were specific to individual trigger types.

From this analysis, the core endpoints identified for the job execution CUJ were:

ServiceEndpoint
[workflow-service]POST /api/v1/workflows/:id/run
[frontend-monolith]POST /v4/jobGroups
[optimizer-service]POST /v1/runWorkloads

Further analysis on the remaining trigger scenarios (scheduled runs, Plans) was ongoing at the time of this assessment.

Service Dependencies: What to Include

Applying the dependency classification from CUJ SLOs vs Service SLOs:

  • [workflow-service] - hard dependency: failure prevents job from being submitted
  • [frontend-monolith] - hard dependency: this is the user-facing entry point
  • [optimizer-service] - hard dependency for certain job types: failure prevents execution
  • [scheduling-service] - soft dependency: failure affects scheduled triggers only, not on-demand runs
  • [notification-service] - soft dependency: failure does not prevent job execution or completion visibility

Services like [job-execution-service], [orchestration-engine], [cloud-execution-service], [ml-automation-service], [analytics-insights-service], and [data-platform-service] were identified as potential downstream dependencies during the initial exploration, but their failures surface through [frontend-monolith] or [workflow-service] responses. Monitoring them separately in the CUJ SLO would double-count failures.


Cross-Cutting Principles from Both Implementations

Start from user behavior, not architecture knowledge. The HAR file method and APM trace analysis both anchor endpoint selection in what users actually trigger, not what engineers think is important. This produces a shorter, more defensible list.

Validate the endpoint list with service owners. For the workflow CUJ, the Lead Software Engineer confirmed which endpoints were minimal and correct. For job execution, the analysis was validated by checking APM traces across real production traffic. Service owner sign-off reduces ambiguity during incidents.

Separate trigger scenarios from core CUJ scope. The job execution CUJ has four trigger scenarios, but the core journey (a job is submitted and runs) is common to all of them. Define the core CUJ first, then decide whether to add trigger-specific endpoints or handle them in separate monitors.

Determine status code policy per endpoint. Not all endpoints have the same error semantics. Establish whether 4xx counts as failure before configuring monitors - this decision is hard to change later without affecting SLO history.


Key Takeaways

  • The minimal endpoint set is a deliberate design decision, not a shortcut - over-measuring creates noise and ambiguity
  • HAR file export and APM trace analysis are more reliable than architectural assumptions for identifying CUJ endpoints
  • Start from the user-facing service and trace inward; hard dependency failures surface automatically at the entry point
  • Validate the endpoint list with service owners before configuring monitors
  • Establish status code policy (what counts as failure) before SLO configuration - it affects all future measurements

Related posts: