Self-Service Performance Regression Testing

Guide for teams to independently run performance Regression testing

Overview

This guide enables teams to run their own performance regression testing without requiring Performance Enablement support.

Who this is for: Teams who understand what they’re testing and why. If you need help defining what to test or how to test a new component, escalate to Performance Enablement.

When to Use This Guide

  • Validating performance under load
  • Testing major upgrades for performance regressions
  • Validating performance before infrastructure changes

Prerequisites

Before starting, ensure you have:

Skills & Experience

  • GitLab Environment Toolkit (GET) experience: You’ve deployed a reference architecture before (or can follow the GET quick start guide)
  • GitLab Performance Tool (GPT) experience: You’ve loaded data and run tests with GPT (or can follow the GPT quick start guide)
  • Understanding of the change: You know what component/feature is being tested and why

Access & Accounts

  • Access to GitLab Sandbox
  • GitLab.com account with pull access to pre-release images
  • GCP/AWS accounts within your sandbox account

Not sure if you’re ready? Review the GET and GPT quick start guides above. If you get stuck, escalate to Performance Enablement.

Using This Guide in Your Work

This handbook page provides comprehensive implementation guidance for running performance regression tests and is paired with an issue template on the GitLab project to track progress. Create a work item using the Performance Regression Testing template to begin.

How to use these resources together

  1. Read this handbook page to understand the full process and implementation details
  2. Open a work item using the template to track your progress and results
  3. Reference specific sections of this handbook as you work through each phase
  4. Document your findings in the work item for team visibility

The handbook page provides the “why” and “how” details, while the work item template provides the “what” checklist for tracking progress.

Quick Decision Tree

flowchart TD
    %% Nodes
    Prep["Preparation<br/>(Define Test & Success, Choose Architecture)"]
    Setup["Setup<br/>(GCP, GET Config, Deploy, Seed Data)"]
    baseline{Need<br>Baseline?}
    Baseline_test["Baseline Test<br/>(Run Test, Capture Results)"]
    Upgrade_test["Upgrade & Test<br/>(Upgrade, Run Test, Capture Results)"]
    Analysis["Analysis<br/>(Compare, Document, Cleanup)"]
    Upgrade{Need<br>Upgrade Test?}

    %% Map
    Prep --> Setup
    Setup --> baseline
    baseline -->|No| Upgrade_test
    baseline -->|Yes| Baseline_test
    Baseline_test --> Upgrade
    Upgrade -- Yes -->Upgrade_test
    Upgrade -- No --> Analysis
    Upgrade_test --> Analysis

    %% Links
    click Prep "#test-preparation"
    click Setup "#setup-testing"
    click Baseline_test "#baseline-testing"
    click Upgrade_test "#upgrade-testing"
    click Analysis "#analysis-and-results"

Test Preparation

flowchart LR
    %% Nodes
    Testing["Define<br/>Testing"]
    Success["Identify<br/>Success Criteria"]
    Architecture["Choose<br/>Reference Architecture"]
    Test_image["Identify<br/>Test Image"]

    %% Map
    Testing --> Success
    Success --> Architecture
    Architecture --> Test_image


    %% Links
    click Testing "#define-testing"
    click Success "#identify-success-criteria"
    click Architecture "#choose-reference-architecture"
    click Test_image "#identify-test-image"

Define Testing

Your team should already understand what you’re testing and why.

  • What component/feature is being tested
  • What change is being evaluated (e.g., Rails 7.2 upgrade, Ruby 3.3 upgrade)
  • Expected impact areas (API response times, throughput, memory usage)

This step will also include deciding which tests to run and which test tooling you need to run it. This example will work off of GPT being selected as as the test tool, but we have others that can be used:

If GPT is selected as the test tool, there are a number of tests and load levels pre defined that are defined that can be used.

If there is not an existing test/load level that covers the desired use case, new will need to be written.

If you’re unsure about scope: Escalate to Performance Enablement before proceeding.

Identify Success Criteria

When running a performance test, having an idea of what success looks like can make the difference between a success or failure or wasted effort. You should know what success looks like to guide your efforts, or you may not capture the information you need and have to re-run the testing. Some examples:

  • Performance didn’t degrade from the baseline
  • Performance has improved by X % in comparison to the baseline
  • We’ve identified the performance characteristics of a new feature under load
  • We’ve established a baseline for this component

Part of determining the Success Criteria is determining what the baseline to compare against is. Is there an existing baseline that can be used or should you run a new baseline? Or is establishing a baseline the goal?

Baseline testing produces valuable performance metrics even without subsequent comparison testing. Teams documenting current system performance can stop after running the baseline if that meets the goal.

Important Considerations

  • Avoid tunnel vision: Going into a performance test with too specific of a goal (We expect a 2 ms response on the Workhorse P95 metric) can lead to tunnel vision, being too focused on a specific result that you miss interesting results that are not in that scope.
  • Test results don’t translate 1:1 to production: Performance improvements in test environments don’t translate 1:1 to production. A 3 ms improvement in your test environment may result in an improvement in production, but the magnitude will differ due to different hardware, load patterns, and data characteristics.

Choose Reference Architecture

Use the X Large Reference Architecture unless you have a specific reason not to.

The X Large architecture provides:

  • Good balance between complexity and resource usage
  • Well-documented and tested
  • Aligns with published performance benchmarks
  • Sufficient HA components for realistic testing

Use a different size only if:

  • Testing a component that doesn’t need HA → Use Medium
  • Testing at production scale → Use 2X Large or larger
  • Testing Geo or specialized features → Use appropriate architecture (e.g., Geo setup)

You can use the Reference Architecture sizing guide to help deciding on architecture. There are a couple of repos containing sample GET configs:

If unsure which to use: Default to X Large Reference Architecture.

Identify Test Image

Coordinate with the team responsible for the change to obtain the image URL. They’ll have the most up-to-date version and can provide any special considerations.

Common image sources (for reference):

Document:

  • Image URL
  • Version/commit information
  • Source of the Image (in case you need to get a refreshed Image)

Setup Testing

flowchart LR
    %% Nodes
    GET_prep["Prepare<br/>GET Configuration"]
    Test_infra["Setup<br/>Test Infrastructure"]

    %% Map
    GET_prep --> Test_infra

    %% Links
    click GET_prep "#prepare-get-configuration"
    click Test_infra "#setup-test-infrastructure"

Prepare GET Configuration

Most of the effort is covered in the GET Quick Start Guide

The important section is configuring gitlab_deb_download_url

To have GET use the test image identified earlier add to the ansible vars.yml file:

all:
  vars:
  # ... existing vars ...

    gitlab_repo_script_url: "https://packages.gitlab.com/install/repositories/gitlab/nightly-builds/script.deb.sh"
    gitlab_deb_download_url: "{{ lookup('env','GITLAB_UBUNTU_IMAGE') | default('https://gitlab.com/api/v4/projects/14588374/jobs/11423868576/artifacts/pkg/ubuntu-jammy/gitlab.deb', true)}}" # update to use latest image url
    gitlab_deb_download_url_headers: {
        "PRIVATE-TOKEN": "{{ lookup('env','PRIVATE_PROD_TOKEN')}}",
    } # use your .com token

  # ... rest of file ...

For GPT testing: Ensure required rate limits are disabled, custom Post Configure task is used https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/quality/-/blob/main/custom_task_files/gitlab_tasks/post_configure.yml#L1-35 applies it automatically (see https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/quality/-/blob/main/configs/reference_architectures/10k/ansible/inventory/vars.yml#L24-40 and https://gitlab.com/gitlab-org/gitlab-environment-toolkit/-/blob/main/docs/environment_advanced.md#custom-tasks).

Setup Test Infrastructure

It is recommended that you run the load test from as close to the test environment as possible (unless you are looking to test including internet latency, which is normally outside our testing scope).

On GCP, a n2-standard-2 vm will be sufficient. If you are intending on running a large number of tests, increasing the boot disk size to 100 GB would help prevent running out of disk space mid run.


Baseline Testing

flowchart LR
    %% Nodes
    deploy["Deploy<br/>Base Environment"]
    baseline_test["Run<br/>Baseline Test"]

    %% Map
    deploy --> baseline_test

    %% Links
    click deploy "#deploy-base-environment"
    click baseline_test "#run-baseline-test"

Deploy Base Environment

This links into Standing up the Environment. To run a baseline, stand up a version of the environment that reflects how the system functions today, so you will probably want to comment out the gitlab_deb_download_url parameter.

Run Baseline Test

Execute GPT against your baseline environment to establish comparison metrics.

Steps: See “Running a Performance Test”

Specific to baseline:

  • Use a current image (not the custom image you identified as your test image)
  • Document baseline results with timestamp
  • Save results as baseline_metrics_YYYY-MM-DD.csv

Note: If generating a baseline / documenting current system performance is the goal, the team can stop testing and go direct to documenting the findings


Upgrade Testing

flowchart LR
    %% Nodes
    deploy["Upgrade<br/>Environment"]
    test["Run<br/>Post-Upgrade Test"]

    %% Map
    deploy --> test

    %% Links
    click deploy "#upgrade-environment"
    click test "#run-post-upgrade-test"

Upgrade Environment

Re-run the Standing up the Environment steps with the changes necessary to implement the change under test. Make sure you uncomment the gitlab_deb_download_url parameter if you commented it out for a baseline test.

Run Post-Upgrade Test

Execute the same test configuration against your upgraded environment.

Steps: See “Running a Performance Test”

Specific to post-upgrade:

  • Use upgraded image
  • Document post-upgrade results with timestamp
  • Save results as post-upgrade_metrics_YYYY-MM-DD.csv

Analysis and Results

flowchart LR
    %% Nodes
    compare["Compare Results"]
    document["Document Findings"]
    cleanup["Cleanup"]

    %% Map
    compare --> document
    document --> cleanup

    %% Links
    click compare "#compare-results"
    click document "#document-findings"
    click cleanup "#cleanup"

Compare Results

Document Findings

  • Create summary of results
  • Note any concerning metrics
  • Determine if escalation needed

Cleanup

  • Destroy test environments and VMs
    • For GET (GitLab Environment Toolkit)

      # Ansible tear down
      cd <ansible_folder_in_GET>
      . ./get-python-env/bin/activate
      ansible-playbook -i environments/<ENV_NAME>/inventory playbooks/uninstall.yml
      
      # Terraform tear down
      cd <terraform_folder_in_GET>
      terraform destroy
      
    • You can leave the test load generator in an off state instead of destroying if you expect to run load tests in the future.

  • Verify GCP resources are cleaned up
    • Pay particular to resources you created outside of GET, i.e. External IP addresses

Stand up Environment

Deploy Environment

Provisioning the environment with GET is as simple as

cd <terraform_folder_in_GET>
terraform apply

Once Terraform is complete, run the Ansible configure

cd <ansible_folder_in_GET>
. ./get-python-env/bin/activate
ansible-playbook -i environments/<ENV_NAME>/inventory playbooks/all.yml

NOTE: both of these tasks will take a while to run so be prepared for a wait


Seed with Performance Data

To seed the environment, use the load generator you setup in Setup Test Infrastructure. Ensure your environment config file is setup for the environment you are loading data into. More details can be found in the GPT environment prep documentation. You will probably want to seed with horizontal and vertical data.

The test generator built into GPT runs as:

docker run -it \
  -e ACCESS_TOKEN=your-access-token \
  -v $(pwd)/config:/config \
  -v $(pwd)/results:/results \
  gitlab/gpt-data-generator --environment my-env.json -u

the -u flag enables it to run unattended.

If you want to load a different repo, you can include a tarball of the project to be used with the --large-project-tarball=/home/user/<CUSTOM PROJECT TARBALL>.tar.gz flag. The Alternate tarballs that are available.

NOTE: Loading data can take 1-2 hours to run, it is recommended to run it in a screen session so that if you get disconnected, the data loading continues.


Running a Performance Test

With GPT running the test is as simple as running:

docker run -it \
  -e ACCESS_TOKEN=<TOKEN> \
  -v <HOST CONFIG FOLDER>:/config \
  -v <HOST TESTS FOLDER>:/tests \
  -v <HOST RESULTS FOLDER>:/results \
  gitlab/gitlab-performance-tool --environment <ENV FILE NAME>.json --options 60s_500rps.json --tests <TEST FILE>.js

Before running a test, verify that you have the system setup to capture the metrics you need (either in the results directly from the test or in the Observability).

After the test completes, capture and save the metrics for comparison:

  • Test pass/failure statistics
  • Response timings from the test tool
  • System metrics from Observability

Make sure you capture the metrics relevant to your success criteria.


Capturing an image URL from a Master pipeline

In this example, I am grabbing from a test run in e2e-run-master

e2e-run-master slack channel

Click on the Pipeline link.

Pipeline view

Expand the GitLab Upstream and find the build-images job:

build-images stage

On the build-gdk-image job it shows the URL that was built:

image url

the url is in: Built image 'registry.gitlab.com/gitlab-org/gitlab/gitlab-qa-gdk:2fe86491afa3db4d9b48c06302e295f038863c11'


Capturing an image URL from a branch pipeline

This example pulls an image from the rails-next pipeline

Rails Next Pipeline

Expand the downstream pipelines

Downstream pipelines

Open the trigger-omnibus job, this will open another pipeline that builds the images

omnibus pipeline

Choose the appropriate job in the package stage for your architecture (I used Ubuntu-24.04-branch) and view it’s artifacts

artifacts

In this case I needed to navigate a couple of folders, pkg > ubuntu-noble.

Copy the URL for the *.deb file and save it to be used as the GITLAB_UBUNTU_IMAGE environment variable for GET.

In this case the URL was:

https://gitlab.com/gitlab-org/build/omnibus-gitlab-mirror/-/jobs/12607844542/artifacts/file/pkg/ubuntu-noble/gitlab-ee_18.7.0+rfbranch.2245624022.06bed6be-0_amd64.deb

Note: ⚠️ Job artifacts expire and are deleted within a day or two, so you will probably need to re-capture the image URL if you need to rebuild your test environment.


Interpreting the Results

After running your baseline and post-upgrade tests, you need to understand what the metrics mean.

You can analyze the metrics several different ways:

  • GPT generates a summary comparing against thresholds
  • Reviewing the raw output
    • Loading the output into a spreadsheet and comparing (like it was done in the examples)
    • Writing a script to compare
    • Generating some graphs (script/other tool)
    • Manually comparing the results

Use this framework to evaluate your results and determine next steps.

Green (On Track)

  • Metrics within published thresholds
  • No significant regressions vs baseline (< 5% difference)
  • Performance stable or improved
  • Action: Document results and close testing

Yellow (Needs Attention)

  • Metrics slightly above thresholds (5-10% over)
  • Minor regressions vs baseline (5-10% slower)
  • Investigate root cause, may need optimization
  • Action: Determine if the performance trade-off is acceptable for your use case. If acceptable, document the changed baseline and proceed. If not, investigate optimization opportunities.

Red (Rework Needed)

  • Metrics significantly above thresholds (> 10% over)
  • Major regressions vs baseline (> 10% slower)
  • Functionality breaks under load
  • Action: Investigate root cause and implement fixes. Re-run testing after changes to verify improvement.

Mixed Results

If your results are mixed (some metrics Green, some Yellow or Red), focus on metrics most relevant to your change. Document trade-offs and fix the unacceptable problems.

Example Metric Comparison

When comparing baseline vs post-upgrade results:

Metric Baseline Post-Upgrade Change Status
API Response Time (p95) 250ms 260ms +4% Green
Throughput (RPS) 195 192 -1.5% Green
Error Rate 0.1% 0.15% +0.05% Green
Memory Usage 4.2GB 4.5GB +7% Yellow
CPU Utilization 65% 68% +4.6% Green

In this example, memory usage shows a minor increase (Yellow), but other metrics are within acceptable ranges (Green). This would warrant investigation into the memory increase, but doesn’t necessarily block the upgrade.

Identifying Relevant Metrics

Not all metrics are relevant to every change. Focus on metrics related to your specific upgrade, some common metrics:

For Rails/Ruby upgrades

  • API response times (p50, p95, p99)
  • Throughput (requests per second)
  • Error rates
  • Memory usage
  • CPU utilization

For database upgrades

  • Query execution time
  • Database connection pool usage
  • Disk I/O metrics
  • Memory usage

For infrastructure changes

  • Network latency
  • Disk throughput
  • CPU utilization
  • Memory usage

Real-World Examples

Rails 7.2 Upgrade Testing

The Rails 7.2 upgrade testing (#579847) demonstrates the complete workflow:

Setup:

  • Deployed 10k environment with Rails 7.2 omnibus package
  • Used custom image from MR artifacts
  • Configured Jammy (Ubuntu 22.04) for package compatibility
  • Seeded with horizontal and vertical performance data

Baseline Testing:

  • Ran 60s_200rps.json configuration against baseline environment
  • Exported Grafana metrics for comparison
  • Documented all results with timestamps

Upgrade & Testing:

  • Applied Rails 7.2 upgrade to environment
  • Verified new version running
  • Ran same 60s_200rps.json configuration
  • Captured the test results
  • Exported post-upgrade Grafana metrics (screenshots of the graphs)

Results:

  • Compared baseline vs post-upgrade metrics from the test run in a Google Sheet
  • Visually compare the Graphana graph exports
  • Checked against Reference Architecture benchmarks
  • Identified any performance regressions
  • Documented findings in issue

Key Learnings:

  • Use screen command to prevent test interruption from unstable connections
  • Ensure rate limits are disabled before testing
  • Export Grafana metrics with clear timestamps for accurate comparison

See the full issue for detailed results and analysis: #579847

Ruby 3.3 Upgrade Testing

The Ruby 3.3 upgrade testing (#516194) demonstrates baseline comparison methodology:

Setup:

  • Deployed 10k environment with Ruby 3.3 omnibus package
  • Created parallel environment with Ruby 3.2 for direct comparison
  • Seeded both environments with identical performance data

Baseline Testing:

  • Ran 60s_200rps.json configuration against both environments
  • Documented all results with timestamps
  • Captured the test results from each
  • Exported post-upgrade Grafana metrics (screenshots of the graphs) from each

Results Comparison:

  • Compared Ruby 3.3 vs Ruby 3.2 metrics in a Google Sheet
  • Visually compare the Graphana graph exports
  • Checked against Reference Architecture benchmarks
  • Identified endpoints with performance differences
  • Documented findings in comparison spreadsheet

Key Findings:

  • Overall test score slightly better for Ruby 3.3 (88.81% vs 88.75%)
  • Only a few endpoints showed degraded performance (api_v4_users, web_user, api_v4_groups_group, web_project_file_rendered)
  • Memory utilization slightly higher for Ruby 3.3 but showed similar patterns
  • Conclusion: Ruby 3.3 maintains acceptable performance levels

Key Learnings:

  • Creating a parallel environment with the previous version enables direct comparison
  • Spreadsheet-based results tracking helps identify specific endpoints with issues
  • Memory usage patterns are as important as response times
  • Document findings in a way that helps future decision-making

Resources:


Common Issues & Troubleshooting

This section will be populated with concrete issues and solutions as they’re encountered during testing. Check back here for troubleshooting guidance.


When to Escalate to Performance Enablement

Contact Performance Enablement if:

  • Correct Test Tool choice: Need to figure out what tool to use to test
  • Test framework changes: Need to modify GPT for new components or test scenarios
  • Results interpretation: Results don’t match expectations and you need expert analysis
  • Custom testing: Need performance testing for components not in standard suite

How to escalate: Create a RFH in the Developer Experience RFH Project with:

  • Link to your test environment/results
  • Description of the issue
  • What you’ve already tried
  • Expected vs actual results

Questions or Feedback?

If you have questions about this guide or encounter issues not covered here, please:

  1. Check the troubleshooting section
  2. Review related resources above
  3. Create an issue in quality-engineering/team-tasks if you need Performance Enablement support