Self-Managed Platform team

Self-Managed Platform team in Test Platform sub-department
S.No Section
GitLab Team Handle @gl-quality/tp-self-managed-platform
Team Boards Team Board

Engineers in this team support the product sections supported by the Core Platform, SaaS Platforms along with maintaining the self-managed platform tools.

Team members

Engineering Manager: Kassandra Svoboda

S.No Section Stage/ Group/ Tool SET Counterpart
1 Core Platform Data Stores John McDonnell
2 Core Platform Foundations Nivetha Prabakaran
3 Core Platform Tenant Scale Andy Hohenner
4 Core Platform Systems Vishal Patel
5 Core Platform Geo Nick Westbury
6 SaaS Platforms GitLab Dedicated Brittany Wilkerson
7 SaaS Platforms US Public Sector Services Jim Baumgardner
8 - GitLab Environment Toolkit & Reference Architectures Grant Young
9 - GitLab Performance Tool Nailia Iskhakova

OKRs

Every quarter, the team commits to Objectives and Key Results (OKRs). The below shows current quarter OKRs and is updated regularly as the quarter progresses.

Here is an Overview of our current Self-Managed Platform team OKR.

Primary Projects

The Self Managed Platform team own several tools which form a 3-prong trident for Self-Managed Excellence: the Reference Architectures (RA), the GitLab Environment Toolkit (GET), and the GitLab Performance Tool (GPT). Together, these tools support our broader strategy of cementing customer confidence and contributing to their ongoing success by ensuring their instances are built to a rigorously tested standard that performs smoothly at scale.

flowchart LR
  subgraph selfManageExcel["Self-Managed Excellence"]
    RA("Reference Architectures (RA)")
    GET["GitLab Environment Toolkit (GET)"]
    GPT["GitLab Performance Tool (GPT)"]
  end
  style selfManageExcel fill:#FFF
  style RA color:#6b4fbb, stroke:#9370DB
  style GET color:#6b4fbb, stroke:#9370DB
  style GPT color:#6b4fbb, stroke:#9370DB
  click RA "https://docs.gitlab.com/ee/administration/reference_architectures/" _blank
  click GET "https://gitlab.com/gitlab-org/gitlab-environment-toolkit" _blank
  click GPT "https://gitlab.com/gitlab-org/quality/performance" _blank

Reference Architectures are officially recommended environment designs for deploying GitLab at scale in production that are tested and maintained by the Reference Architecture group. The group, led by the Self Managed Platform Team, is comprised of various individuals across GitLab disciplines and has the following responsibilities:

  • To test, maintain and update the Reference Architectures - Officially recommended environment designs and guidance for deploying GitLab at scale in production
  • To review any existing or proposed environment designs not already covered in the documentation
  • To assess the need for updates to the Reference Architectures during and after escalations involving performance issues suspected to be caused by environment design.

GitLab Environment Toolkit (GET), our provisioning toolkit is a collection of tools to deploy and operate production GitLab instances based on our Reference Architectures.

GitLab Performance Tool (GPT), our performance testing tool for validation at scale.

The Self-Managed Excellence dashboard tracks merge requests and issues metrics for GitLab Environment Toolkit, GitLab Performance Tool and Reference Architectures projects.

All Projects

Name Description
Reference Architectures Officially recommended environment designs for deploying GitLab at scale in production
GitLab Environment Toolkit Provisioning Toolkit
GitLab Performance Tool Performance testing tool for validation at scale
Upgrade Tester The Upgrade Tester pipeline builds environments using GET that are based on different Reference Architectures. Each pipeline will build an environment, seed it with data and then upgrade and test the environment with each upgrade to either a specified version or the latest nightly package.
Backup and Restore The Backup and Restore pipelines are designed to build environments using GET that are based on different Reference Architectures. Each is designed to run through the backup and restore process and verify the data that has been restored.
GitLab Browser Performance Tool A sister pipeline to GPT’s backend performance pipelines, these pipelines are designed to specifically test web page frontend performance in browsers.
Performance Test Data This Project serves as an LFS data repository for the GitLab Performance Tool
Performance Docker Images Docker builder and registry for GitLab Performance testing
Zero Downtime Testing Tool A testing tool designed to monitor any downtime that occurs during a zero downtime upgrade by continuously performing git operations and sending requests to the readiness?all=1 endpoint.
Self Managed Platform Team Channels Issue Tracker The issue tracker project is used to track requests and questions from Self Managed Platform Team Slack channels

Roadmap

Supporting internal customer initiatives

Self Managed Platform team prioritizes internal customer requests that have impact to large business initiatives. The team roadmap may change based on those ongoing priorities.

FY25

The key capabilities we plan to deliver in Self Managed Platform Team for FY25 are summarized below.

Note: We aim to address user feedback and feature requests from the community for GET and Reference Architecture throughout every quarter.

Q1 - Completed

  • Improve test coverage for upgrades
    • Create E2E Upgrade Tester
    • Enhance GitLab QA test scenario Test::Omnibus::UpdateFromPrevious for major and minor upgrades
      • Improve GitLab QA output logs for GitLab containers
      • Debug documentation for upgrade test jobs
    • Migration testing for multi-version* upgrades without building an environment
      • Create new required to pass test job db:migrate:multi-version-upgrade
      • Implement db:migrate:multi-version-upgrade to run against latest GitLab supported PostgreSQL versions
      • Create PG Dump Generator project
      • Add Rubocop to enforce factories for new tables
  • Improve Large Monorepo performance tests and validation
    • Setup additional Chromium repo performance testing pipeline
    • Identify monorepo performance hotspots

Q2 - Completed

  • Improve test coverage for Switchboard
    • Automate the onboarding flow of tenant creation
  • Improve Dedicated adoption speed of GET updates
    • Automate non-regression tests for GitLab Dedicated features at the infrastructure layer
      • Scope and add blueprint for integration testing of Allow Listing
      • Optimize use of GitLab-QA for Dedicated
  • Review cost of Reference Architecture internal usage
    • Dashboard to track the costs of Reference Architectures
    • Dedicated Development environment cost optimization
    • Identify cost optimization opportunities within GitLab based on performance hotspots
  • Improve test coverage for unified backup and restore
    • Test framework and pipeline for Backup and Restore
  • Enable confidence in the multi region deployments of AI Gateway
    • Test Framework and pipeline for client side latency metrics
    • Dashboard for historical data of baselines
  • GET feature enhancements and bug fixes
    • 3.3.0 Release
      • GKE Workload Identity support
      • GCP / AWS Customer Managed Encryption Keys expanded support
      • EKS Node Group AMI expanded support
      • Single Node expanded support
      • RHEL 9 support

Q3

  • Improve Dedicated adoption speed of GET updates
    • Automate non-regression tests for GitLab Dedicated features at the infrastructure layer
      • Integration testing of Allow Listing
      • Automate SAML tests
      • Automate Advanced Search tests
  • Enable confidence in the multi region deployments of AI Gateway
    • Latency testing baselines for bypassing monolith
  • Establish feature pipeline for Self Managed Platform Team
    • GitLab Environment Toolkit readiness template
    • Reference Architecture readiness template
  • GET feature enhancements and bug fixes
    • 3.4.0 Release
      • Firewall Rules refactors
      • GCP VPC Network Peering support
      • GCP Cloud SQL Geo support
      • Gitaly in Kubernetes experimental support
    • GitLab Operator support
  • Test Data Generator enhancements for large data seeding
  • Upgrade Tester enhancements
    • Geo support
  • Reference Architecture Updates
    • Reference Architecture Design Guide
    • Blog post series
    • Add customer testimonials to Reference Architecture

Q4

  • Establish feature pipeline for Self Managed Platform Team
    • Document requirements for a new component to be added to Reference Architecture
  • Improve Dedicated adoption speed of GET updates
    • Automate non-regression tests for GitLab Dedicated features at the infrastructure layer
      • Automate PrivateLink tests
      • Automate BYOK tests
      • Automate the Logging Stack tests
  • Upgrade Staging Ref to use latest GET version
  • GET feature enhancements

Future

  • GPT feature enhancements
    • Rate limits for Projects, Groups, and Users APIs (18.0)
    • Grafana slack alerts for high Rails memory use
    • GPT 3.0
  • Reference Architecture updates (GitLab feature readiness dependent)
    • Gitaly on Kubernetes (Cloud Native Service Reference Architecture)
    • Validation of RAFT
    • Container Registry Metadata Database support
    • CI Decomp support
    • Cloud Native Hybrid documentation impressions
    • Add 10k Azure test pipeline to Reference Architecture performance test pipeline schedule
  • GET Feature Enhancements
    • GitLab Pages (Linux Package) support
    • Full Infra Custom Tags and Labels support
    • More graceful Zero Downtime Upgrade support
    • Azure Kubernetes support
    • Azure PostgreSQL Flexible Server support
    • Azure Redis service support
  • Validate ARM Cloud Native Hybrid environment performance
  • Test Data Generator enhancements
    • Expand data seeding for Gitaly
  • Chromium megarepo data seeding
  • Improve test coverage for GitLab Upgrades
    • Running GitLab QA upgrade test jobs for unreleased patches
    • Reduce functionality degradation in upgrades
  • Improve test coverage for Postgres Upgrades

Working with us

There are occasions where the expertise of the Reference Architecture or Self-Managed Platform team may be needed in support of a customer issue.

For any requests relating to customer environments, either proposed or existing, they must be raised in the Reference Architectures project with the appropriate template. Requests should be opened two or more business days before action is needed to ensure the team has time to prepare and we kindly ask for this process to be followed for tracking and capacity reasons. Any requests made outside of this process such as direct asks to join customer calls or projects will be rejected and should instead be directed to Support or Professional Services accordingly.

For issues specifically with the GitLab Environment Toolkit (i.e. feature request, bug) or GitLab Performance Tool (i.e. request for help, performance testing of a new feature*, bug) issues can be raised in each respective project.

*To request for help with performance testing of a new feature, please create a new issue within the GPT project with the request for help template.

For individual questions please reach out to the team via our Slack channels.

Slack Channels

Channel Purpose
#reference-architectures Channel to ask questions relating to Reference Architectures
#gitlab_environment_toolkit Channel to discuss and any ask questions relating to GitLab Environment Toolkit
#gitlab_performance_tool Channel to discuss and ask any questions relating to GitLab Performance Tool and TP performance testing
#self-managed-platform-team Channel to engage with the Self-Managed Platform Team

How we work

Meetings and Scheduled Calls

Our preference is to work asynchronously, within our projects issues trackers.

The team does have a set of regular synchronous calls:

  • Self-Managed Environment Triage
  • 1-1s between the Individual Contributors and Engineering Manager

Stand-up twice per week on Tuesday and Thursday via our teams Slack channel

Project Management

Issue Boards

We track our work on the following issue boards:

Capacity Planning

We use a simple issue weighting system for capacity planning, ensuring a manageable amount of work for each milestone. We consider both the team’s throughput and each engineer’s upcoming availability from Workday.

The weights are intended to be used in aggregate, and what takes one person a certain amount of time may be different for another, depending on their level of knowledge of the issue. We should strive to be accurate, but understand that they are estimates. We will change the weight if it is not accurate or if the issue becomes more difficult than originally expected, leave a comment indicating why the weight was changed, and tag the EM and any assigned DRIs so we can better understand the scope and continue to improve.

Weights

To weigh an issue, consider the following important factors:

  • Volume of work: expected size of the change to the code base or validation testing required.
  • Amount of investigation or research expected.
  • Complexity:
    • Problem understanding: how well the problem is understood.
    • Problem-solving difficulty: the level of difficulty we expect to encounter.

The following weights are available based on the Fibonacci Series with 8 being the highest assignable number. The definitions are as below:

Weight Description Examples
1 - Trivial Simple and quick changes Documentation fixes or smaller additions.
2 - Small Straight forward changes, no underlying dependencies needed with little investigation or research required. Smaller Ansible additions or changes, e.g. within one role.
3 - Medium Well understood changes with a few dependencies that should only require a reasonable amount of investigation or research. Large Ansible changes, e.g. affecting multiple roles.
Small Terraform additions or changes, such as an additional setting for a Cloud Service.
5 - Large A larger task that will require a notable amount investigation and research.
All changes relating to security.
Large Terraform additions or changes such as a new Cloud Service or changes affecting multiple components.
8 - X-large A very large task that will require a significant amount of investigation and research. Pushing initiative level. Large GitLab changes such as new component that will require joint Reference Architecture, GET and GPT work

Anything that would be assigned a weight of 8 or larger should be broken down.

Status Updates

  • By 20:00 UTC / 03:00 PM ET on Fridays DRIs of OKRs to provide a status update in the comment section of the OKR
    • Format for weekly update:
      • Date of Update (YYYY-MM-DD)
      • Brief update (~sentence or couple bullets) for each of these four bullets:
        • Status update - Progress has been updated to X %.
        • What was done ✅ - Unblocked blockers, any other progress achieved
        • Next steps 👷
        • Blockers :octagonal_sign: - Issues or unexpected work that blocked/affected progress. For example, customer escalations/on-call DRI
  • ASYNC Standup on Tuesdays and Thursdays - Reply to GeekBot questionnaire on Slack.

GPT Pipeline Triage

Self Managed Platform Team members who are currently on Pipeline DRI on call rotation will also monitor the #gpt-performance-run Slack channel. Open issues to be reviewed can be found in the GPT pipeline triage board.

Self Managed Platform Channels Issue Tracker

The issue tracker is used to track requests and questions from Self Managed Platform Team Slack channels - GitLab Environment Toolkit, Reference Architectures and GitLab Performance Tool - to create issues for tracking purposes.

Navigate to Wiki page for more details how issue tracker project is implemented.

Metrics

Reference Architectures

GitLab Environment Toolkit

GitLab Performance Tool

Requests from Slack

Test Platform process across product sections

Overall we follow the same process as defined in our Test Platform handbook across all groups in Core Platform and SaaS Platform except for a few exceptions curated to fit the needs of specific groups.


Test Platform in Distribution group

Overview

The goal of this page is to document existing Quality Engineering activities in Distribution group.

Dashboards

Quality work

Quality work is being tracked in epic#9057. The epic lists large initiatives that need to be worked on to better support quality in Distribution group.

Last modified October 29, 2024: Fix broken links (455376ee)