Infrastructure

The Infrastructure Department is responsible for the availability, reliability, performance, and scalability of GitLab.com and other supporting services

Mission

The Infrastructure Department enables GitLab (the company) to deliver a single DevOps application, and GitLab SaaS users to focus on generating value for their own businesses by ensuring that we operate an enterprise-grade SaaS platform.

The Infrastructure Department does this by focusing on availability, reliability, performance, and scalability efforts. These responsibilities have cost efficiency as an additional driving force, reinforced by the properly prioritized dogfooding efforts.

Many other teams also contribute to the success of the SaaS platform because GitLab.com is not a role. However, it is the responsibility of the Infrastructure Department to drive the ongoing evolution of the SaaS platform, enabled by platform observability data.

Getting Assistance

If you’re a GitLab team member and are looking to alert the Infrastructure teams about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.

For all other queries, please see the getting assistance page.

Vision

The Infrastructure Department operates a fast, secure, and reliable SaaS platform to which (and with which) everyone can contribute.

Integral part of this vision is to:

  1. Build a highly performant team of engineers, combining operational and software development experience to influence the best in reliable infrastructure.
  2. Work publicly in accordance with our transparency value.
  3. Use our own product to prepare, build, deliver work, and support the company strategy.
  4. Align our strategy with the industry trends, company direction, and end customer needs.

Direction

The direction is accomplished by using Objectives and Key Results (OKRs).

Other strategic initiatives to achieve this vision are driven by the needs of enterprise customers looking to adopt GitLab.com. The GitLab.com strategy catalogs top customer requests for the SaaS offering and outlines strategic initiatves across both Infrastructure and Stage Groups needed to address these gaps.

We are also Product Development

Unlike typical companies, part of the mandates of our Security, Infrastructure, and Support Departments is to contribute to the development of the GitLab Product. This follows from these concepts, many of which are also behaviors attached to our core values:

As such, everyone in the department should be familiar with, and be acting upon, the following statements:

  • We should all feel comfortable contributing to the GitLab open source project
  • If we need something, our first instinct should be to get it into the open source project so it can be given back to the community
  • Try to get it in the open source project first, rather than later, even if it’s 2x harder
  • We should be using the whole product to do our jobs
  • We are all familiar with our Dogfooding process and follow it
  • We should not expect new team members to join the company with these instincts, so we should be willing to teach them
  • It is part of managers’ responsibility to teach these values and behaviors

Organization structure

(click the boxes for more details)

flowchart LR
    I[Infrastructure]
    click I "/handbook/engineering/infrastructure/"

    I --> TPM[Technical Program Management]
    click TPM "/handbook/engineering/infrastructure/technical-program-management/"

    I --> EP[Engineering Productivity]
    click EP "/handbook/engineering/infrastructure/engineering-productivity/"
    I --> C[Core Platform]
    click C "/handbook/engineering/infrastructure/core-platform/"
    I --> EA[Engineering Analytics]
    click EA "/handbook/engineering/quality/engineering-analytics/"
    I --> TP[Test Platform]
    click TP "/handbook/engineering/infrastructure/test-platform/"
    I --> SP[SaaS Platforms]
    click SP "/handbook/engineering/infrastructure/platforms/"

    C --> SS[Systems Stage]
    click SS "/handbook/engineering/infrastructure/core-platform/systems/"

    SS --> GC[Gitaly]
    click GC "/handbook/engineering/infrastructure-platforms/data-access/gitaly/"
    SS --> Git[Git]
    click GG "/handbook/engineering/infrastructure-platforms/data-access/git/"
    SS --> Geo
    click Geo "/handbook/engineering/infrastructure/core-platform/systems/geo/"
    SS --> DB[Distribution::Build]
    click DB "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
    SS --> DD[Distribution::Deploy]
    click DD "/handbook/engineering/infrastructure/core-platform/systems/distribution/"

    C --> DS[Data Stores Stage]
    click DS "/handbook/engineering/infrastructure/core-platform/data_stores/"
    DS --> TS[Tenant Scale]
    click TS "/handbook/engineering/infrastructure/core-platform/tenant-scale/"
    DS --> Database
    click Database "/handbook/engineering/infrastructure-platforms/data-access/database-framework/"
    DS --> CC[Cloud Connector]
    click CC "/handbook/engineering/infrastructure/core-platform/data_stores/cloud-connector/"

    SP --> DE[Delivery]
    click DE "/handbook/engineering/infrastructure/team/delivery/"
    DE --> Deployments
    DE --> Releases
    SP --> Ops
    click Ops "/handbook/engineering/infrastructure/team/ops/"
    SP --> Foundations
    click Foundations "/handbook/engineering/infrastructure/team/foundations/"
    SP --> Scalability
    click Scalability "/handbook/engineering/infrastructure/team/scalability/"
    Scalability --> Observability
    Scalability --> Practices

    SP --> D[Dedicated]
    click D "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
    D --> E[Environment Automation]
    click E "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
    D --> PSS[Public Sector Services]
    click PSS "/handbook/engineering/infrastructure/team/gitlab-dedicated/us-public-sector-services/"
    D --> Switchboard
    click Switchboard "/handbook/engineering/infrastructure/team/gitlab-dedicated/switchboard/"

    TP --> SMP[Self-Managed Platform]
    click SMP "/handbook/engineering/infrastructure/test-platform/self-managed-platform-team/"
    TP --> TE[Test Engineering]
    click TE "/handbook/engineering/infrastructure/test-platform/test-engineering-team/"
    TP --> TTI[Test and Tools Infrastructure]
    click TTI "/handbook/engineering/infrastructure/test-platform/test-and-tools-infrastructure-team/"

Design

The Infrastructure Library contains documents that outline our thinking about the problems we are solving and represents the current state for any topic, playing a significant role in how we produce technical solutions to meet the challenges we face.

Dogfooding

The Infrastructure department uses GitLab and GitLab features extensively as the main tool for operating many environments, including GitLab.com.

We follow the same dogfooding process as part of the Engineering function, while keeping the department mission statement as the primary prioritization driver. The prioritization process is aligned to the Engineering function level prioritization process which defines where the priority of dogfooding lies with regards to other technical decisions the Infrastructure department makes.

When we consider building tools to help us operate GitLab.com, we follow the 5x rule to determine whether to build the tool as a feature in GitLab or outside of GitLab. To track Infrastructure’s contributions back into the GitLab product, we tag those issues with the appropriate Dogfooding label.

Handbook use at the Infrastructure department

At GitLab, we have a handbook first policy. It is how we communicate process changes, and how we build up a single source of truth for work that is being delivered every day.

The handbook usage page guide lists a number of general tips. Highlighting the ones that can be encountered most frequently in the Infrastructure department:

  1. The wider community can benefit from training materials, architectural diagrams, technical documentation, and how-to documentation. A good place for this detailed information is in the related project documentation. A handbook page can contain a high level overview, and link to more in-depth information placed in the project documentation.
  2. Think about the audience consuming the material in the handbook. A detailed run through of a GitLab.com operational runbook in the handbook might provide information that is not applicable to self-managed users, potentially causing confusion. Additionally, the handbook is not a go-to place for operational information, and grouping operational information together in a single place while explaining the general context with links as a reference will increase visibility.
  3. Ensure that the handbook pages are easy to consume. Checklists, onboarding, repeatable tasks should be either automated or created in a form of template that can be linked from the handbook.
  4. The handbook is the process. The handbook describes our principles, and our epics and issues are our principles put into practice.

Projects

Classification of the Infrastructure department projects is described on the infrastructure department projects page.

The infrastructure issue tracker is the backlog and a catch-all project for the infrastructure teams and tracks the work our teams are doing–unrelated to an ongoing change or incident.

In addition to tracking the backlog, Infrastructure Department projects are captured in our Infrastructure Department Epic as well as in our Quarterly Objectives & Key Results

Supporting Product Features

We have a model that we use to help us support product features. This model provides details on how we collaborate to ship new features to Production.

Ownership

The Infrastructure team maintains responsibility for the underlying infrastructure on which customer-facing services run. Specific ownership details are in the GitLab Service Ownership Policy

Stable Counterparts

Infrastructure SREs may be aligned with stage groups as stable counterparts.

Stable Counterparts are used as a framework for managing reliable services at GitLab. The framework provides guidelines for collaboration between Stage Groups and Infrastructure Teams.

Interviewing

The Infrastructure department hires for a number of different technical specialisms and positions across its teams. This Infrastructure Interviewing Guide offers more detail on some of our regular openings, interview process and other useful information related to applying to jobs with us. More information on our current openings can be found on the careers page.

Slack Channels

General Issue Trackers

Resources

Other Pages


Alert Playbook Management

Purpose

During an incident, playbooks are vital to the engineer on call (EOC) in resolving an alert. Having all of the salient information laid out in one place saves the EOC time in diagnosing and resolving the incident. It empowers the EOC with a set of standard steps for responding to incidents. Additionally, it can greatly reduce the stress of dealing with an alert when working on an unfamiliar service.

Capacity Planning for GitLab Infrastructure

Introduction

In order to scale GitLab infrastructure at the right time and to prevent incidents, we employ a capacity planning process for example for GitLab.com and GitLab Dedicated.

In parts, this process is predictive and gets input from a forecasting tool to predict future needs. This aims to provide an earlier and less obstrusive warning to infrastructure teams before components reach their individual saturation levels. The forecasting tool generates capacity warnings which are converted to issues and these issues are raised in various status meetings.

Career Development in the Infrastructure Department

Career Development Discovery & Planning

There are a number of tools we use to plot and manage career development:

Maintaining current role descriptions which establish expectations for hiring and ongoing performance expectations is an important supporting function for effective Career Development planning.

The rest of the tools are for active engagement by the Team Member along with their Manager. The origin activity for this is the Big Picture Career Conversation, followed up with quarterly checkpoints and frequent 1:1s. Finally, 360 Feedback and Talent Assessment provide annual opportunities for additional insight on progress.

Cells
This is the handbook page for the Cells project. Cells is one of the top priorities for FY2025, with the goal of providing additional scalability for GitLab.com. This handbook page contains the project information such as the project plan, roadmap, workstreams, DRIs, stakeholders, and communication channels. It also has links to important documentation such as the Cells design blueprints.
Change Management

Purpose

Change Management has traditionally referred to the processes, procedures, tools and techniques applied in IT environments to carefully manage changes in an operational environment: change tickets and plans, approvals, change review meetings, scheduling, and other red tape.

In our context, Change Management refers to the guidelines we apply to manage changes in the operational environment with the aim of doing so (in order of highest to lowest priority) safely, effectively and efficiently. In some cases, this will require the use of elements from traditional change management; in most cases, we aim to build automation that removes those traditional aspects of change management to increase our speed in a safe manner.

Core Platform Sub-department

Vision

Offer enterprise-grade operational experience of GitLab products from streamlined deployment and maintenance, disaster recovery, secure search and discoverability, to high availability, scalability, and performance.

Mission

Core Platform focuses on improving our capabilities and metrics in the following areas:

All Team Members

The following people are permanent members of teams that belong to the Core Platform Sub-department:

Cost Management
GitLab Cost Management
Database

Database Reliability at GitLab

The group of Database Reliability Engineers (DBREs) are on the Reliability Engineering teams that run GitLab.com. We care most about database reliability aspects of the infrastructure and GitLab as a product.

We strive to approach database reliability from a data driven perspective as much as we can. As such, we start by defining Service Level Objectives below and document what service levels we currently aim to maintain for GitLab.com.

Emergency Change Processes for GitLab SaaS

The Infrastructure Department, responsible for managing GitLab SaaS environment, has a number of processes that have an implicit emergency process component as a part of a regular workflow. This page serves as a high level overview of the most important components of those processes, with links to pages describing said processes in more depth.

Workflow

An integral part of any irregular situation occurring on GitLab SaaS is the incident management process. This process is used for platform degradation and outage events, but it is also the process for emergency changes such as addressing critical vulnerabilities.

Engineering Productivity team
The Engineering Productivity team maximizes the value and throughput of Product Development teams and wider community contributors by improving the developer experience, streamlining the product development processes, and keeping projects secure, compliant, and easy to work on for everyone.
Getting Assistance on SaaS Platforms
How to get assistance for problems on Production Platforms
GitLab Service Ownership Policy

Purpose

This policy establishes service ownership within the engineering organization for customer-facing services, outlining responsibilities and ownership structure.

Scope

This policy applies specifically to customer-facing services and the underlying infrastructure services that support them.

Service Ownership

Customer Facing Services

Reliability::General

  • Contains all customer-facing services tied to the monolith architecture.
  • Responsibilities include design, development, deployment, and operational stability.
  • Ensuring alignment with organizational standards and meeting service level objectives (SLOs) for customer-facing services.

Reliability::Practices

  • Contains all services that require designated engineering resources and expertise.
  • Responsibilities include design, development, deployment, and operational stability of these services.
  • Collaboration with the General Team, relevant stakeholders, and development teams to ensure compliance with organizational standards, overall system architecture, and specialized requirements.
  • Higher level of collaboration between the Infrastructure and Development factions to leverage expertise, align goals, and optimize service delivery.

Infrastructure Services

The Reliability Team maintains responsibility for the underlying infrastructure on which customer-facing services run. This includes:

Incident Management

Incident Management

Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.

Incident Review

The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1

Introduction

An Incident Review is a crucial opportunity for fostering deeper understanding within a blameless culture. Its purpose extends beyond collecting action items to prevent recurrence; it is a process for learning about both the systems and the engineering culture that contribute to incidents. By discussing and analyzing how these components operate and interact, we gain valuable insights into the technical environments we support and the broader organizational context in which they function.

Infrastructure Department Frequently Asked Questions

GitLab.com Backups

Q: How often is GitLab.com backed up?

A: See our summary of our backup strategy

Q: Are GitLab.com backups encrypted?

A: Yes. We use GCP Persistent Storage volumes underneath all of our filesystems, and that is implicitly encrypted. So the live filesystems, their snapshot-based backups, database replicas, and logical backups are all fully encrypted at the block device layer. Additionally, GCP encrypts and encapsulates traffic between our nodes within our VPCs, so data in motion is also protected from eavesdropping and tampering.

Infrastructure Department Performance Indicators

Executive Summary

KPI Health Status
GitLab.com Availability SLO Okay
  • June 2024 100.00%
  • May 2024 99.99%
  • April 2024 99.95%
Corrective Action SLO Attention
  • Corrective Action SLO is at 2
Master Pipeline Stability Attention
  • April 2024 decreased to 91%
  • Cause for broken `master` for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%)
  • We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team's weekly triage report
  • More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
Merge request pipeline duration Attention
  • The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore, causing our chart to sometimes not take child pipelines into account when computing pipeline duration. A fix was made on 2024-04-29 to ensure that we're using the pipeline duration from GitLab database directly instead of calculating from queries ([see investigation](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/378#note_1740584179)). As a result, the average duration and percentiles duration are higher than previously thought.
S1 Open Customer Bug Age (OCBA) Okay
  • Promoted to KPI in FY24Q2
  • Near target for 3 consecutive months, uptick in current month due to ongoing triaging of issues
  • All S1 bugs are been reviewed for upcoming milestone planning
S2 Open Customer Bug Age (OCBA) Attention
  • Promoted to KPI in FY24Q2
  • Above target, significant reduction will require a focus on older customer impacting S2
Quality Team Member Retention Confidential
  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric
Infrastructure Team Member Retention Confidential
  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric

Key Performance Indicators

GitLab.com Availability SLO

Percentage of time during which GitLab.com is fully operational and providing service to users within SLO parameters. Definition is available on the GitLab.com Service Level Availability page. Historical Availability is available on the Service Level Availability page.

Infrastructure Department Projects
GitLab's approach to the types, data classifications, canonical locations, ownership, workflow and organization of infrastructure department projects
Infrastructure Environments

Environments

The Terraform configuration for the environments can be found in config-mgmt.

Future Iteration with Infrastructure Standards

We have a WIP initiative to iterate on our company-wide infrastructure standards. You can learn more about this on the infrastructure standards handbook page.

This page will be refactored incrementally as the standards are documented, implemented, and changes to environments take place.

Development

Name URL Purpose Deploy Database Terminal access
Development various Development on save Fixture individual dev

Development happens on a local machine. Therefore there is no way to provide any SLA. Access is to the individual dev. This could be either EE/CE depending on what the developer is working on.

Infrastructure Feature Support
How the Infrastructure Department supports shipping features to Production.
Infrastructure OKRs
Infrastructure OKRs
Infrastructure Product Management

Responsibilities

The responsibilities of the Infrastructure Product Manager are documented in the job-families page.

Engagement Model

Inbound Requests

The Infra PM can help triage and prirotize inbound requests to Infrastructure from internal teams and GitLab.com customers.

Types of requests:

  1. Dogfooding requests
  2. Security and Compliance Requests
  3. GitLab.com customer requests in remit of the Infrastructure department:

Examples of requests related to operational capabilities of GitLab.com include:

Infrastructure Technical Program Management Team
Infrastructure Technical Program Management Team drives the planning, execution, and delivery of complex infrastructure projects across Engineering and Product.
Library
Network Security Management Procedure

Purpose

GitLab architects a defense-in-depth methodology that enforces the concept of “least functionality” through restricting network access to systems, applications and services and ensures sufficient security and privacy controls are executed to protect the confidentiality, integrity, availability and safety of the organization’s network infrastructure, as well as to provide situational awareness of activity on GitLab’s networks.

Scope

GitLab’s network architecture is available to both internal and external users and hosts our DNS with Cloudflare incluing gitlab.com and gitlab.net.

Production

The Production Environment

The GitLab.com production environment is comprised of services that operate–or support the operation of–gitlab.com. For a complete list of production services see the service catalog

Rate Limiting
This page exists to consolidate GitLab Rate Limiting documentation into a single source of truth. It is intended to reflect the current state of our rate limits, with the target audience being Operators (SRE and Support team members).
Release Tools
Guide to GitLab's tools for new releases
Service Maturity Model

Introduction

This page shows the output of our service maturity model for each service in our metrics catalog. The model itself is part of the metrics catalog, and uses information from the metrics catalog and the service catalog to score each service.

To achieve a particular level in the maturity model, a service must meet all the criteria for that level and all previous levels. Some criteria do not apply to all services (for instance, services like PgBouncer do not need development documentation).

Team

See the SaaS Platforms Organizational Structure for teams in Infrastructure.

Test Platform Sub-Department
Test Platform Sub-Department enables successful development and deployment of high quality GitLab software applications by providing innovative build automated solutions, reliable tooling, refined test efficiency, and fostering an environment where Quality is Everyone's responsibility.
The Infrastructure Platforms Section

Mission

The Infrastructure Platforms section enables GitLab Engineering to build and deliver safe, scalable and efficient features for multi-tenant and single-tenant GitLab SaaS platforms (GitLab.com and GitLab Dedicated).

Vision

To deliver on the mission, we are in the process of formalising the building blocks we need to work on.

Direction

In FY25, teams in the Platforms Section of the Infrastructure Department have collaborated on the “North Star”, which is then used to set the SaaS Platforms Strategy.