Self-Managed Platform team
Self-Managed Platform team in Test Platform sub-department
Common Links
Engineers in this team support the product sections supported by the Core Platform, SaaS Platforms along with maintaining the self-managed platform tools.
Team members
Engineering Manager: Kassandra Svoboda
OKRs
Every quarter, the team commits to Objectives and Key Results (OKRs). The below shows current quarter OKRs and is updated regularly as the quarter progresses.
Here is an Overview of our current Self-Managed Platform team OKR.
Primary Projects
The Self Managed Platform team own several tools which form a 3-prong trident for Self-Managed Excellence: the Reference Architectures (RA), the GitLab Environment Toolkit (GET), and the GitLab Performance Tool (GPT). Together, these tools support our broader strategy of cementing customer confidence and contributing to their ongoing success by ensuring their instances are built to a rigorously tested standard that performs smoothly at scale.
flowchart LR
subgraph selfManageExcel["Self-Managed Excellence"]
RA("Reference Architectures (RA)")
GET["GitLab Environment Toolkit (GET)"]
GPT["GitLab Performance Tool (GPT)"]
end
style selfManageExcel fill:#FFF
style RA color:#6b4fbb, stroke:#9370DB
style GET color:#6b4fbb, stroke:#9370DB
style GPT color:#6b4fbb, stroke:#9370DB
click RA "https://docs.gitlab.com/ee/administration/reference_architectures/" _blank
click GET "https://gitlab.com/gitlab-org/gitlab-environment-toolkit" _blank
click GPT "https://gitlab.com/gitlab-org/quality/performance" _blank
Reference Architectures are officially recommended environment designs for deploying GitLab at scale in production that are tested and maintained by the Reference Architecture group. The group, led by the Self Managed Platform Team, is comprised of various individuals across GitLab disciplines and has the following responsibilities:
- To test, maintain and update the Reference Architectures - Officially recommended environment designs and guidance for deploying GitLab at scale in production
- To review any existing or proposed environment designs not already covered in the documentation
- To assess the need for updates to the Reference Architectures during and after escalations involving performance issues suspected to be caused by environment design.
GitLab Environment Toolkit (GET), our provisioning toolkit is a collection of tools to deploy and operate production GitLab instances based on our Reference Architectures.
GitLab Performance Tool (GPT), our performance testing tool for validation at scale.
The Self-Managed Excellence dashboard tracks merge requests and issues metrics for GitLab Environment Toolkit, GitLab Performance Tool and Reference Architectures projects.
All Projects
Name |
Description |
Reference Architectures |
Officially recommended environment designs for deploying GitLab at scale in production |
GitLab Environment Toolkit |
Provisioning Toolkit |
GitLab Performance Tool |
Performance testing tool for validation at scale |
Upgrade Tester |
The Upgrade Tester pipeline builds environments using GET that are based on different Reference Architectures. Each pipeline will build an environment, seed it with data and then upgrade and test the environment with each upgrade to either a specified version or the latest nightly package. |
Backup and Restore |
The Backup and Restore pipelines are designed to build environments using GET that are based on different Reference Architectures. Each is designed to run through the backup and restore process and verify the data that has been restored. |
GitLab Browser Performance Tool |
A sister pipeline to GPT’s backend performance pipelines, these pipelines are designed to specifically test web page frontend performance in browsers. |
Performance Test Data |
This Project serves as an LFS data repository for the GitLab Performance Tool |
Performance Docker Images |
Docker builder and registry for GitLab Performance testing |
Zero Downtime Testing Tool |
A testing tool designed to monitor any downtime that occurs during a zero downtime upgrade by continuously performing git operations and sending requests to the readiness?all=1 endpoint. |
Self Managed Platform Team Channels Issue Tracker |
The issue tracker project is used to track requests and questions from Self Managed Platform Team Slack channels |
Roadmap
Supporting internal customer initiatives
Self Managed Platform team prioritizes internal customer requests that have impact to large business initiatives. The team roadmap may change based on those ongoing priorities.
FY25
The key capabilities we plan to deliver in Self Managed Platform Team for FY25 are summarized below.
Note: We aim to address user feedback and feature requests from the community for GET and Reference Architecture throughout every quarter.
Q1 - Completed
- Improve test coverage for upgrades
- Create E2E Upgrade Tester
- Enhance GitLab QA test scenario
Test::Omnibus::UpdateFromPrevious
for major and minor upgrades
- Improve GitLab QA output logs for GitLab containers
- Debug documentation for upgrade test jobs
- Migration testing for multi-version* upgrades without building an environment
- Create new required to pass test job db:migrate:multi-version-upgrade
- Implement db:migrate:multi-version-upgrade to run against latest GitLab supported PostgreSQL versions
- Create PG Dump Generator project
- Add Rubocop to enforce factories for new tables
- Improve Large Monorepo performance tests and validation
- Setup additional Chromium repo performance testing pipeline
- Identify monorepo performance hotspots
Q2 - Completed
- Improve test coverage for Switchboard
- Automate the onboarding flow of tenant creation
- Improve Dedicated adoption speed of GET updates
- Automate non-regression tests for GitLab Dedicated features at the infrastructure layer
- Scope and add blueprint for integration testing of Allow Listing
- Optimize use of GitLab-QA for Dedicated
- Review cost of Reference Architecture internal usage
- Dashboard to track the costs of Reference Architectures
- Dedicated Development environment cost optimization
- Identify cost optimization opportunities within GitLab based on performance hotspots
- Improve test coverage for unified backup and restore
- Test framework and pipeline for Backup and Restore
- Enable confidence in the multi region deployments of AI Gateway
- Test Framework and pipeline for client side latency metrics
- Dashboard for historical data of baselines
- GET feature enhancements and bug fixes
- 3.3.0 Release
- GKE Workload Identity support
- GCP / AWS Customer Managed Encryption Keys expanded support
- EKS Node Group AMI expanded support
- Single Node expanded support
- RHEL 9 support
Q3
- Improve Dedicated adoption speed of GET updates
- Automate non-regression tests for GitLab Dedicated features at the infrastructure layer
- Integration testing of Allow Listing
- Automate SAML tests
- Automate Advanced Search tests
- Enable confidence in the multi region deployments of AI Gateway
- Latency testing baselines for bypassing monolith
- Establish feature pipeline for Self Managed Platform Team
- GitLab Environment Toolkit readiness template
- Reference Architecture readiness template
- GET feature enhancements and bug fixes
- 3.4.0 Release
- Firewall Rules refactors
- GCP VPC Network Peering support
- GCP Cloud SQL Geo support
- Gitaly in Kubernetes experimental support
- GitLab Operator support
- Test Data Generator enhancements for large data seeding
- Upgrade Tester enhancements
- Reference Architecture Updates
- Reference Architecture Design Guide
- Blog post series
- Add customer testimonials to Reference Architecture
Q4
- Establish feature pipeline for Self Managed Platform Team
- Document requirements for a new component to be added to Reference Architecture
- Improve Dedicated adoption speed of GET updates
- Automate non-regression tests for GitLab Dedicated features at the infrastructure layer
- Automate PrivateLink tests
- Automate BYOK tests
- Automate the Logging Stack tests
- Upgrade Staging Ref to use latest GET version
- GET feature enhancements
Future
- GPT feature enhancements
- Rate limits for Projects, Groups, and Users APIs (18.0)
- Grafana slack alerts for high Rails memory use
- GPT 3.0
- Reference Architecture updates (GitLab feature readiness dependent)
- Gitaly on Kubernetes (Cloud Native Service Reference Architecture)
- Validation of RAFT
- Container Registry Metadata Database support
- CI Decomp support
- Cloud Native Hybrid documentation impressions
- Add 10k Azure test pipeline to Reference Architecture performance test pipeline schedule
- GET Feature Enhancements
- GitLab Pages (Linux Package) support
- Full Infra Custom Tags and Labels support
- More graceful Zero Downtime Upgrade support
- Azure Kubernetes support
- Azure PostgreSQL Flexible Server support
- Azure Redis service support
- Validate ARM Cloud Native Hybrid environment performance
- Test Data Generator enhancements
- Expand data seeding for Gitaly
- Chromium megarepo data seeding
- Improve test coverage for GitLab Upgrades
- Running GitLab QA upgrade test jobs for unreleased patches
- Reduce functionality degradation in upgrades
- Improve test coverage for Postgres Upgrades
Working with us
There are occasions where the expertise of the Reference Architecture or Self-Managed Platform team may be needed in support of a customer issue.
For any requests relating to customer environments, either proposed or existing, they must be raised in the Reference Architectures project with the appropriate template. Requests should be opened two or more business days before action is needed to ensure the team has time to prepare and we kindly ask for this process to be followed for tracking and capacity reasons. Any requests made outside of this process such as direct asks to join customer calls or projects will be rejected and should instead be directed to Support or Professional Services accordingly.
For issues specifically with the GitLab Environment Toolkit (i.e. feature request, bug) or GitLab Performance Tool (i.e. request for help, performance testing of a new feature*, bug) issues can be raised in each respective project.
*To request for help with performance testing of a new feature, please create a new issue within the GPT project with the request for help template.
For individual questions please reach out to the team via our Slack channels.
Slack Channels
How we work
Meetings and Scheduled Calls
Our preference is to work asynchronously, within our projects issues trackers.
The team does have a set of regular synchronous calls:
- Self-Managed Environment Triage
- 1-1s between the Individual Contributors and Engineering Manager
Stand-up twice per week on Tuesday and Thursday via our teams Slack channel
Project Management
Issue Boards
We track our work on the following issue boards:
Capacity Planning
We use a simple issue weighting system for capacity planning, ensuring a
manageable amount of work for each milestone. We consider both the team’s
throughput and each engineer’s upcoming availability from Workday.
The weights are intended to be used in aggregate, and what takes one person a
certain amount of time may be different for another, depending on their level of
knowledge of the issue. We should strive to be accurate, but understand that
they are estimates. We will change the weight if it is not accurate or if the issue
becomes more difficult than originally expected, leave a comment indicating why the
weight was changed, and tag the EM and any assigned DRIs so we can better understand the scope
and continue to improve.
Weights
To weigh an issue, consider the following important factors:
- Volume of work: expected size of the change to the code base or validation testing required.
- Amount of investigation or research expected.
- Complexity:
- Problem understanding: how well the problem is understood.
- Problem-solving difficulty: the level of difficulty we expect to encounter.
The following weights are available based on the Fibonacci Series with 8 being the highest assignable number. The definitions are as below:
Weight |
Description |
Examples |
1 - Trivial |
Simple and quick changes |
Documentation fixes or smaller additions. |
2 - Small |
Straight forward changes, no underlying dependencies needed with little investigation or research required. |
Smaller Ansible additions or changes, e.g. within one role. |
3 - Medium |
Well understood changes with a few dependencies that should only require a reasonable amount of investigation or research. |
Large Ansible changes, e.g. affecting multiple roles. Small Terraform additions or changes, such as an additional setting for a Cloud Service. |
5 - Large |
A larger task that will require a notable amount investigation and research. All changes relating to security. |
Large Terraform additions or changes such as a new Cloud Service or changes affecting multiple components. |
8 - X-large |
A very large task that will require a significant amount of investigation and research. Pushing initiative level. |
Large GitLab changes such as new component that will require joint Reference Architecture, GET and GPT work |
Anything that would be assigned a weight of 8 or larger should be broken down.
Status Updates
- By 20:00 UTC / 03:00 PM ET on Fridays DRIs of OKRs to provide a status update in the comment section of the OKR
- Format for weekly update:
- Date of Update (YYYY-MM-DD)
- Brief update (~sentence or couple bullets) for each of these four bullets:
- Status update - Progress has been updated to X %.
- What was done ✅ - Unblocked blockers, any other progress achieved
- Next steps 👷
- Blockers :octagonal_sign: - Issues or unexpected work that blocked/affected progress. For example, customer escalations/on-call DRI
- ASYNC Standup on Tuesdays and Thursdays - Reply to GeekBot questionnaire on Slack.
GPT Pipeline Triage
Self Managed Platform Team members who are currently on Pipeline DRI on call rotation will also monitor the #gpt-performance-run Slack channel. Open issues to be reviewed can be found in the GPT pipeline triage board.
The issue tracker is used to track requests and questions from Self Managed Platform Team Slack channels - GitLab Environment Toolkit, Reference Architectures and GitLab Performance Tool - to create issues for tracking purposes.
Navigate to Wiki page for more details how issue tracker project is implemented.
Metrics
Reference Architectures
GitLab Environment Toolkit
GitLab Performance Tool
Requests from Slack
Overall we follow the same process as defined in our Test Platform handbook across all groups in Core Platform and SaaS Platform
except for a few exceptions curated to fit the needs of specific groups.
Overview
The goal of this page is to document existing Quality Engineering activities in Distribution group.
Dashboards
Quality work
Quality work is being tracked in epic#9057. The epic lists large initiatives that need to be worked on to better support quality in Distribution group.
Last modified October 29, 2024:
Fix broken links (455376ee
)