Covered Experience SLIs (next step)

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
proposed hmerscher reprazent andrewn team Observability 2025-03-24

Motivation

Continuing the work from Covered Experience SLIs, this addendum focus on the augmentation of the Covered Experience Framework by introducing a new service, the Covered Experience Tracker, that will add the capability of tracking Covered Experiences that never complete (due to errors mid journey), or do not complete within its expected lifetime.

This is being kept separate from the main blueprint for the reason that it is subject to change. The content presented here is aspirational only.

Scope

The Covered Experience Tracker is going to be responsible for the Covered Experience time out verification – especially relevant for tracking the asynchronous Covered Experience SLIs.

With the SDK consolidated, we can move and centralize the functionality of emitting events to this service, removing the complexity from the SDK.

The project will consist of:

  1. Covered Experience Tracker
  2. LabKit SDK

Example:

flowchart LR
    User((User))

    subgraph ServiceA
        subgraph ProcessA
            LabKit
        end
    end
    subgraph ServiceB
        subgraph ProcessB
            LabKitB
        end
    end

    subgraph tracker[Tracker]
        missing_end[Missing end event]
        timeout_check{Timeout check}
        timeout_action[Timeout action]

        missing_end --> timeout_check
        timeout_check --No timeout--> missing_end
        timeout_check --Timeout reached--> timeout_action
    end

    User --Request--> ServiceA
    LabKit --Checkpoint--> tracker
    ServiceA --> ServiceB
    LabKitB --Checkpoint--> tracker

The project work items are scoped in the epic #1540.

Covered Experience Tracker

  • Centralized Covered Experience state tracking
  • Sensible time to live (TTL) threshold for Covered Experience duration
  • State is stored and managed by Redis. Allowing the querying of stale Covered Experiences, timing out after configured threshold.
  • Authentication
  • Deployments:
    • Runway service for GitLab.com
    • Runway hosted service for Dedicated

It will serve an endpoint that will respond to the client generated payload:

Field Type Required Description Example
correlation_id string (ULID) Yes Unique identifier for the covered experience “01JP0EM7HB39WSJNR4662MYZ6V”
covered_experience_id string Yes Identification of the Covered Experience “http_request”
checkpoint string Yes Which step in the lifecycle “start” | “end” | “intermediate”
checkpoint_category string No A domain specific category for the checkpoint. TBD: impose limited cardinality. “authorize”
type string Yes Service/component generating the event “web”, “database”
feature_category string Yes GitLab feature category “source_code_management”
urgency string Yes How quickly a process needs to complete based on user expectations “sync_fast”
client_timestamp string (ISO-8601) Yes Timestamp when event occurred “2025-02-06T14:30:00Z”
server_timestamp string (ISO-8601) No (Response only) Server processing timestamp “2025-02-06T14:30:00.123Z”
meta object No i.e. https://docs.gitlab.com/development/logging/#logging-context-metadata-through-rails-or-grape-requests

A background process verifies all stale Covered Experiences and clears them out, emitting failure events.

Components interactions given a synchronous Covered Experience:

sequenceDiagram
    participant User
    participant App as Service A
    participant AppB as Service B
    participant Tracker as Covered Experience Tracker
    participant Redis
    participant Event as Logs and Metrics

    User->>App: Request
    activate App

    App->>Tracker: Checkpoint 1
    Tracker->>Redis: Store Initial State
    Tracker->>Event: Emit Start Event
    Tracker-->>App: Response

    App->>AppB: Forward Request
    activate AppB

    AppB->>Tracker: Checkpoint 2
    Tracker->>Redis: Update State
    Tracker->>Event: Emit Intermediate Event
    Tracker-->>AppB: Response

    AppB-->>App: Response
    deactivate AppB

    App->>Tracker: Checkpoint 3
    Tracker->>Redis: Mark Complete
    Tracker->>Event: Emit Success Event
    Tracker-->>App: Response

    App-->>User: Response
    deactivate App

    loop Expired Covered Experience
        Tracker->>Redis: Check for Timeout
        alt Timeout Reached
            Tracker->>Redis: Mark Failed
            Tracker->>Event: Emit Failure Event
        end
    end

Components interactions given an asynchronous Covered Experience:

sequenceDiagram
    participant User
    participant Web as Web Service
    participant Worker
    participant Tracker as Covered Experience Tracker
    participant Redis
    participant Event as Logs and Metrics

    User->>Web: Request
    activate Web

    Web->>Tracker: Checkpoint 1
    Tracker->>Redis: Store Initial State
    Tracker->>Event: Emit Start Event
    Tracker-->>Web: Response

    Web->>Worker: Enqueue Job
    Web-->>User: Response
    deactivate Web

    Note over Worker: Job wait in queue

    Note over Worker: Job starts
    activate Worker
    Worker->>Tracker: Checkpoint 2
    Tracker->>Redis: Update State

    alt Success Case
        Worker->>Tracker: End Covered Experience (Success)
        Tracker->>Redis: Mark Complete
        Tracker->>Event: Emit Success Event
    else Failure Case
        Worker->>Tracker: End Covered Experience (Failed)
        Tracker->>Redis: Mark Failed
        Tracker->>Event: Emit Failure Event
    end

    Tracker-->>Worker: Response
    deactivate Worker

    loop Expired Covered Experiences
        Tracker->>Redis: Check for Missing End Events
        alt Timeout Reached
            Tracker->>Redis: Mark Failed
            Tracker->>Event: Emit Failure Event
        end
    end

Authentication

Authentication between the SDK and the Covered Experience Tracker is required to prevent the sending of unexpected events that could distort the reliability metrics of GitLab features. E.g. the ai-gateway authentication and authorization.

SDK Requirements

  • Trigger events towards the Tracker, instead of directly pushing events
  • Automatic retries with exponential backoff for sending events to the Covered Experience Tracker
Last modified April 28, 2025: Cleanup and reorg shortcodes (eef3c341)