Database Traffic Replay

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
proposed	`mattkasa` `stomlinson` `zbraddock`	`tkuah`	`alexives` `rmar1`	devops data access	2025-05-07

Summary

Develop comprehensive tooling to capture, store, and replay SQL query traffic from GitLab.com. This solution will implement a lightweight query forwarding mechanism within GitLab that sends SQL queries to an external service with minimal performance impact on Rails and Sidekiq processes. Combined with purpose-built replay utilities, this system will enable performance testing, capacity planning, and database architecture evaluation, allowing us to simulate production loads at variable speeds, identify saturation and contention points, assess potential database configuration changes, and validate sharding strategies without adversely affecting production systems.

Motivation

This tool will allow us to collect and measure our database capacity. This will effectively settle questions about the capacity of both our current setup, as well as the effectiveness of other mitigations such as changes to anything in our production database infrastructure.

Goals

Negligible performance impact on production Rails/Sidekiq processes.
Increased confidence in database scaling decisions and configuration changes.
Enhanced ability to identify and mitigate performance bottlenecks proactively.
Improved database architecture testing capabilities without production risk.

Non-Goals

This is not a backup tool.
Not for data analysis, doesn’t run continuously, not load bearing for any other uses.
We do not expect to converge on exactly the same database state that occurred in production. Specifically we are concerned with database performance under load, not correctness or data consistency.
We expect captures to be not entirely consistent. Some queries will fail to execute during replay.

Proposal

We will capture all query traffic from the gitlab application, and replay it against the benchmarking environment. We will do this for a period of time, generally not to exceed an hour, and will use it to verify configuration changes to database hosts.

We will also support shrinking the traffic replay into a smaller period of time to simulate higher production load that the application might see in the future.

Design and implementation details

We will capture query data from rails nodes, publish it to a pubsub topic, then aggregate the query data and persist it to a bucket.

flowchart TD
    subgraph DataPlane1["Data Plane"]
        subgraph GitlabProd["gitlab-production"]
            GitlabRails1["Gitlab Rails"]
            GitlabRails2["Gitlab Rails"]
            GitlabRails3["Gitlab Rails"]
        end

        subgraph PubSub["PubSub"]
            Topic["Topic"]
            PS["Pull Subscription"]
            note1["Each message in the pubsub topic
            and subscriber is a query + metadata."]
        end

        subgraph Dataflow["DataFlow"]
            Subscriber["Subscriber"]
            note2["Inputs messages, outputs
            formatted querydata into the bucket."]
        end

        subgraph Bucket["Bucket"]
            CurrentFolder["Folder for this run"]
            PreviousFolder["Previous Run"]
            note3["Data is stored with same permission
            and retention restrictions as WAL data"]
        end
    end

    %% Define connections
    GitlabRails1 -->|"Queries"| Topic
    GitlabRails2 -->|"Queries"| Topic
    GitlabRails3 -->|"Queries"| Topic
    Topic --> PS
    PS --> Subscriber
    Subscriber --> CurrentFolder

    %% Styling
    classDef sectionHeader fill:#f9f,stroke:#333,stroke-width:2px
    classDef pauseStep fill:#ffe6cc,stroke:#d79b00,stroke-width:1px
    classDef pipelineStep fill:#d5e8d4,stroke:#82b366,stroke-width:1px

    class QueryCapture,QueryReplay sectionHeader
    class PauseA,PauseB pauseStep
    class PipelineStageA,PipelineStageB,PipelineStageC,MachineCreation,ReplayPerformed,MachinesDeleted pipelineStep

To replay, we will load the query data from the bucket, and replay it against a benchmarking database restored from production at the time of the capture, using the same number of connections as were originally used in production.

flowchart TD
    subgraph DataPlane2["Data Plane"]
        subgraph Benchmark["gitlab-db-benchmarking"]
            MachineCreation["Machines Created
            GPRD Snapshot used for machine creation is calculated on
            timestamp of folder of query data being replayed.
            WAL is replayed up until start LSN of the query data capture, then stopped."]
        end

        subgraph PipelineStageA["Trigger Pipeline Stage A"]
            StageAVars["input variables include:
            - foldername of querydata we want to replay
            - how many replay nodes we want to create
            - primary database instance only or also replicas
            - if we want database replicas, how many?
            - if we want to override default machine sizes, machine types etc"]
        end

        PauseA["Pause until next step manually triggered
        A good time for engineers to make any other manual changes to environment"]

        subgraph PipelineStageB["Trigger Pipeline Stage B"]
            StageBVars["input variables include:
            - replay speed (1x, 1.2x, 2x, etc)
            - write only or read & write

            Engineers can choose not to run Stage B if they just wanted
            an environment created and do not need a querydata replay"]
        end

        subgraph ReplayPerformed["Replay Performed"]
            ReplayProcess["Query Data will be gathered from GPRD bucket then replayed"]
        end

        PauseB["Pause until next step manually triggered
        A good time for engineers to make any other checks on the environment"]

        subgraph PipelineStageC["Trigger Pipeline Stage C"]
            StageCNote["- Pipeline C needs to be able to be run even if Pipelines A and B either a) were not run or b) failed"]
        end

        subgraph MachinesDeleted["Machines Deleted"]
            DeleteNote["Query Data will NOT be deleted from GPRD bucket"]
        end
    end

    PipelineStageA -->|"Creates"| MachineCreation
    MachineCreation --> PauseA
    PauseA --> PipelineStageB
    PipelineStageB -->|"Executes"| ReplayPerformed
    ReplayPerformed --> PauseB
    PauseB --> PipelineStageC
    PipelineStageC -->|"Cleans up"| MachinesDeleted

    %% Styling
    classDef sectionHeader fill:#f9f,stroke:#333,stroke-width:2px
    classDef pauseStep fill:#ffe6cc,stroke:#d79b00,stroke-width:1px
    classDef pipelineStep fill:#d5e8d4,stroke:#82b366,stroke-width:1px

    class QueryCapture,QueryReplay sectionHeader
    class PauseA,PauseB pauseStep
    class PipelineStageA,PipelineStageB,PipelineStageC,MachineCreation,ReplayPerformed,MachinesDeleted pipelineStep

Security and Retention

Note that the recorded traffic capture would contain RED data. We plan to treat this data the same way that we treat the WAL archives we already store for backup / restore validation, which contain roughly the same data in a different format (the results of queries rather than their text). Specifically we’ll store the query text in a traffic-capture bucket in the production environment, similar to WAL data, and perform replay testing in the db-benchmarking environment, also similar to how we currently run other testing. We will retain capture data for only 14 days, before it will be deleted by a lifecycle rule on the traffic-capture bucket The traffic-capture bucket will inherit most of its permissions from gitlab-production gcp project rules or gitlab.com gcp organization rules. However it will also be accessible by a traffic-capture service account which will have the roles/storage.objectUser permission necessary to perform traffic capture, and a traffic-replay service account which will have the roles/storage.objectUser permission necessary to perform traffic replay.

We are using GCP’s native PubSub, so message encryption is handled by default, and service accounts are used for authentication. A service account representing Gitlabs rails nodes will recieve the role roles/pubsub.publisher on the traffic capture pub/sub topic to allow it to add messages to the pubsub queue.

The traffic-capture service account will receive the roles

roles/pubsub.subscriber on the traffic-capture pubsub subscription and the traffic-capture pubsub topic
roles/pubsub.viewer on the traffic-capture pubsub topic
roles/dataflow.worker for the entire project. Unfortunately this necessary role can only be assigned on an entire project. Luckily, traffic-capture is the only functionality which uses Dataflow in gitlab-production.

Alternative Solutions

We could capture query traffic at the connection pooler level, using something like pgcat or pgdog.
- We aren’t yet running such a pooler in production (we’re currently running pgbouncer) and this tool will be useful to evaluate a change in connection pooler.
We could use a tool already built, such as https://github.com/gocardless/pgreplay-go.
- pgreplay-go and similar tools capture data from the postgres log file, but that won’t work at our scale - the volume of query text would exceed the capacity of a disk very quickly.

Last modified May 29, 2025: Security implementation details added to traffic capture and replay architecture diagram (5c630eaa)

View page source - Edit this page - please contribute.