The Infrastructure Department is responsible for the availability, reliability, performance, and scalability of GitLab.com and other supporting services
Mission
The Infrastructure Department enables GitLab (the company) to deliver a single DevOps application, and GitLab SaaS users to focus on generating value for their own businesses by ensuring that we operate an enterprise-grade SaaS platform.
The Infrastructure Department does this by focusing on availability, reliability, performance, and scalability efforts.
These responsibilities have cost efficiency as an additional driving force, reinforced by the properly prioritized dogfooding efforts.
Many other teams also contribute to the success of the SaaS platform because GitLab.com is not a role.
However, it is the responsibility of the Infrastructure Department to drive the ongoing evolution of the SaaS platform, enabled by platform observability data.
Getting Assistance
If you’re a GitLab team member and are looking to alert the Infrastructure teams about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
The Infrastructure Department operates a fast, secure, and reliable SaaS platform to which (and with which) everyone can contribute.
Integral part of this vision is to:
Build a highly performant team of engineers, combining operational and software development experience to influence the best in reliable infrastructure.
Work publicly in accordance with our transparency value.
Other strategic initiatives to achieve this vision are driven by the needs of enterprise customers looking to adopt GitLab.com. The GitLab.com strategy catalogs top customer requests for the SaaS offering and outlines strategic initiatves across both Infrastructure and Stage Groups needed to address these gaps.
We are also Product Development
Unlike typical companies, part of the mandates of our Security, Infrastructure, and Support Departments is to contribute to the development of the GitLab Product. This follows from these concepts, many of which are also behaviors attached to our core values:
We should not expect new team members to join the company with these instincts, so we should be willing to teach them
It is part of managers’ responsibility to teach these values and behaviors
Organization structure
(click the boxes for more details)
flowchart LR
I[Infrastructure]
click I "/handbook/engineering/infrastructure/"
I --> TPM[Technical Program Management]
click TPM "/handbook/engineering/infrastructure/technical-program-management/"
I --> EP[Engineering Productivity]
click EP "/handbook/engineering/infrastructure/engineering-productivity/"
I --> C[Core Platform]
click C "/handbook/engineering/infrastructure/core-platform/"
I --> EA[Engineering Analytics]
click EA "/handbook/engineering/quality/engineering-analytics/"
I --> TP[Test Platform]
click TP "/handbook/engineering/infrastructure/test-platform/"
I --> SP[SaaS Platforms]
click SP "/handbook/engineering/infrastructure/platforms/"
C --> SS[Systems Stage]
click SS "/handbook/engineering/infrastructure/core-platform/systems/"
SS --> GC[Gitaly]
click GC "/handbook/engineering/infrastructure-platforms/data-access/gitaly/"
SS --> Git[Git]
click GG "/handbook/engineering/infrastructure-platforms/data-access/git/"
SS --> Geo
click Geo "/handbook/engineering/infrastructure/core-platform/systems/geo/"
SS --> DB[Distribution::Build]
click DB "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
SS --> DD[Distribution::Deploy]
click DD "/handbook/engineering/infrastructure/core-platform/systems/distribution/"
C --> DS[Data Stores Stage]
click DS "/handbook/engineering/infrastructure/core-platform/data_stores/"
DS --> TS[Tenant Scale]
click TS "/handbook/engineering/infrastructure/core-platform/tenant-scale/"
DS --> Database
click Database "/handbook/engineering/infrastructure-platforms/data-access/database-framework/"
DS --> CC[Cloud Connector]
click CC "/handbook/engineering/infrastructure/core-platform/data_stores/cloud-connector/"
SP --> DE[Delivery]
click DE "/handbook/engineering/infrastructure/team/delivery/"
DE --> Deployments
DE --> Releases
SP --> Ops
click Ops "/handbook/engineering/infrastructure/team/ops/"
SP --> Foundations
click Foundations "/handbook/engineering/infrastructure/team/foundations/"
SP --> Scalability
click Scalability "/handbook/engineering/infrastructure/team/scalability/"
Scalability --> Observability
Scalability --> Practices
SP --> D[Dedicated]
click D "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
D --> E[Environment Automation]
click E "/handbook/engineering/infrastructure/team/gitlab-dedicated/"
D --> PSS[Public Sector Services]
click PSS "/handbook/engineering/infrastructure/team/gitlab-dedicated/us-public-sector-services/"
D --> Switchboard
click Switchboard "/handbook/engineering/infrastructure/team/gitlab-dedicated/switchboard/"
TP --> SMP[Self-Managed Platform]
click SMP "/handbook/engineering/infrastructure/test-platform/self-managed-platform-team/"
TP --> TE[Test Engineering]
click TE "/handbook/engineering/infrastructure/test-platform/test-engineering-team/"
TP --> TTI[Test and Tools Infrastructure]
click TTI "/handbook/engineering/infrastructure/test-platform/test-and-tools-infrastructure-team/"
Design
The Infrastructure Library contains documents that outline our thinking about the problems we are solving and represents the current state for any topic, playing a significant role in how we produce technical solutions to meet the challenges we face.
Dogfooding
The Infrastructure department uses GitLab and GitLab features extensively as the main tool for operating many environments, including GitLab.com.
When we consider building tools to help us operate GitLab.com, we follow the 5x rule to determine whether to build the tool as a feature in GitLab or outside of GitLab. To track Infrastructure’s contributions back into the GitLab product, we tag those issues with the appropriate Dogfooding label.
Handbook use at the Infrastructure department
At GitLab, we have a handbook first policy. It is how we communicate process changes, and how we build up a single source of truth for work that is being delivered every day.
The handbook usage page guide lists a number of general tips. Highlighting the ones that can be encountered most frequently in the Infrastructure department:
The wider community can benefit from training materials, architectural diagrams, technical documentation, and how-to documentation. A good place for this detailed information is in the related project documentation. A handbook page can contain a high level overview, and link to more in-depth information placed in the project documentation.
Think about the audience consuming the material in the handbook. A detailed run through of a GitLab.com operational runbook in the handbook might provide information that is not applicable to self-managed users, potentially causing confusion. Additionally, the handbook is not a go-to place for operational information, and grouping operational information together in a single place while explaining the general context with links as a reference will increase visibility.
Ensure that the handbook pages are easy to consume. Checklists, onboarding, repeatable tasks should be either automated or created in a form of template that can be linked from the handbook.
The handbook is the process. The handbook describes our principles, and our epics and issues are our principles put into practice.
The infrastructure issue tracker is the backlog and a catch-all project for the infrastructure teams and tracks the work our teams are doing–unrelated to an ongoing change or incident.
We have a model that we use to help us support product features. This model provides details on how we collaborate to ship new features to Production.
Ownership
The Infrastructure team maintains responsibility for the underlying infrastructure on which customer-facing services run. Specific ownership details are in the GitLab Service Ownership Policy
The Infrastructure department hires for a number of different technical specialisms and positions across its teams. This Infrastructure Interviewing Guide offers more detail on some of our regular openings, interview process and other useful information related to applying to jobs with us. More information on our current openings can be found on the careers page.
During an incident, playbooks are vital to the engineer on call (EOC) in resolving an alert. Having all of the salient information laid out in one place saves the EOC time in diagnosing and resolving the incident. It empowers the EOC with a set of standard steps for responding to incidents. Additionally, it can greatly reduce the stress of dealing with an alert when working on an unfamiliar service.
In order to scale GitLab infrastructure at the right time and to prevent incidents, we employ a capacity planning process for example for GitLab.com and GitLab Dedicated.
In parts, this process is predictive and gets input from a forecasting tool to predict future needs.
This aims to provide an earlier and less obstrusive warning to infrastructure teams before components reach their individual saturation levels.
The forecasting tool generates capacity warnings which are converted to issues and these issues are raised in various status meetings.
Maintaining current role descriptions which establish expectations for hiring and ongoing performance expectations is an important supporting function for effective Career Development planning.
The rest of the tools are for active engagement by the Team Member along with their Manager. The origin activity for this is the Big Picture Career Conversation, followed up with quarterly checkpoints and frequent 1:1s. Finally, 360 Feedback and Talent Assessment provide annual opportunities for additional insight on progress.
This is the handbook page for the Cells project. Cells is one of the top priorities for FY2025, with the goal of providing additional scalability for GitLab.com. This handbook page contains the project information such as the project plan, roadmap, workstreams, DRIs, stakeholders, and communication channels. It also has links to important documentation such as the Cells design blueprints.
Change Management has traditionally referred to the processes, procedures, tools and techniques applied in IT environments to carefully manage changes in an operational environment: change tickets and plans, approvals, change review meetings, scheduling, and other red tape.
In our context, Change Management refers to the guidelines we apply to manage changes in the operational environment with the aim of doing so (in order of highest to lowest priority) safely, effectively and efficiently. In some cases, this will require the use of elements from traditional change management; in most cases, we aim to build automation that removes those traditional aspects of change management to increase our speed in a safe manner.
Offer enterprise-grade operational experience of GitLab products from streamlined deployment and maintenance, disaster recovery, secure search and discoverability, to high availability, scalability, and performance.
Mission
Core Platform focuses on improving our capabilities and metrics in the following areas:
The group of Database Reliability Engineers (DBREs) are on the Reliability
Engineering teams that run GitLab.com. We care most about database
reliability aspects of the infrastructure and GitLab as a product.
We strive to approach database reliability from a data driven
perspective as much as we can. As such, we start by defining Service
Level Objectives below and document what service levels we currently aim
to maintain for GitLab.com.
The Infrastructure Department, responsible for managing GitLab SaaS environment, has a number of processes that have an implicit emergency process component as a part of a regular workflow. This page serves as a high level overview of the most important components of those processes, with links to pages describing said processes in more depth.
Workflow
An integral part of any irregular situation occurring on GitLab SaaS is the incident management process.
This process is used for platform degradation and outage events, but it is also the process for emergency changes such as addressing critical vulnerabilities.
The Engineering Productivity team maximizes the value and throughput of Product Development teams and wider community contributors by improving the developer experience, streamlining the product development processes, and keeping projects secure, compliant, and easy to work on for everyone.
This policy establishes service ownership within the engineering organization for customer-facing services, outlining responsibilities and ownership structure.
Scope
This policy applies specifically to customer-facing services and the underlying infrastructure services that support them.
Service Ownership
Customer Facing Services
Reliability::General
Contains all customer-facing services tied to the monolith architecture.
Responsibilities include design, development, deployment, and operational stability.
Ensuring alignment with organizational standards and meeting service level objectives (SLOs) for customer-facing services.
Reliability::Practices
Contains all services that require designated engineering resources and expertise.
Responsibilities include design, development, deployment, and operational stability of these services.
Collaboration with the General Team, relevant stakeholders, and development teams to ensure compliance with organizational standards, overall system architecture, and specialized requirements.
Higher level of collaboration between the Infrastructure and Development factions to leverage expertise, align goals, and optimize service delivery.
Infrastructure Services
The Reliability Team maintains responsibility for the underlying infrastructure on which customer-facing services run. This includes:
If you’re a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you’re a GitLab team member looking for who is currently the Engineer On Call (EOC), please see the Who is the Current EOC? section.
If you’re a GitLab team member looking for the status of a recent incident, please see the incident board. For detailed information about incident status changes, please see the Incident Workflow section.
Incident Management
Incidents are anomalous conditions that result in—or may lead
to—service degradation or outages. These events require human
intervention to avert disruptions or restore service to operational status.
Incidents are always given immediate attention.
The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
Introduction
An Incident Review is a crucial opportunity for fostering deeper understanding within a blameless culture. Its purpose extends beyond collecting action items to prevent recurrence; it is a process for learning about both the systems and the engineering culture that contribute to incidents. By discussing and analyzing how these components operate and interact, we gain valuable insights into the technical environments we support and the broader organizational context in which they function.
Cause for broken `master` for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%)
We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team's weekly triage report
More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore,
causing our chart to sometimes not take child pipelines into account when computing pipeline duration.
A fix was made on 2024-04-29 to ensure that we're using the pipeline duration from GitLab database directly instead of calculating from queries ([see investigation](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/378#note_1740584179)).
As a result, the average duration and percentiles duration are higher than previously thought.
The Terraform configuration for the environments can be found in config-mgmt.
Future Iteration with Infrastructure Standards
We have a WIP initiative to iterate on our company-wide infrastructure standards. You can learn more about this on the infrastructure standards handbook page.
This page will be refactored incrementally as the standards are documented, implemented, and changes to environments take place.
Development
Name
URL
Purpose
Deploy
Database
Terminal access
Development
various
Development
on save
Fixture
individual dev
Development happens on a local machine. Therefore there is no way to provide any SLA. Access is to the individual dev. This could be either EE/CE depending on what the developer is working on.
GitLab.com customer requests in remit of the Infrastructure department:
GitLab.com customers, especially enterprises, may often have requests related to operational capabilities or non-functional requirements of GitLab.com (e.g. availability, security, and performance of the service). Requests related to functionality within the application itself should be directed to the appropriate stage team using the standard feature request template.
Examples of requests related to operational capabilities of GitLab.com include:
Infrastructure Technical Program Management Team drives the planning, execution, and delivery of complex infrastructure projects across Engineering and Product.
GitLab architects a defense-in-depth methodology that enforces the concept of “least functionality” through restricting network access to systems, applications and services and ensures sufficient security and privacy controls are executed to protect the confidentiality, integrity, availability and safety of the organization’s network infrastructure, as well as to provide situational awareness of activity on GitLab’s networks.
Scope
GitLab’s network architecture is available to both internal and external users and hosts our DNS with Cloudflare incluing gitlab.com and gitlab.net.
If you’re a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you’re a GitLab team member looking for help with a security problem, please see the Engaging the Security On-Call section.
The Production Environment
The GitLab.com production environment is comprised of services that operate–or support the operation of–gitlab.com.
For a complete list of production services see the service catalog
This page exists to consolidate GitLab Rate Limiting documentation into a single source of truth. It is intended to reflect the current state of our rate limits, with the target audience being Operators (SRE and Support team members).
This page shows the output of our service maturity model for each
service in our metrics catalog. The model itself is part of the
metrics catalog, and uses information from the metrics catalog and the
service catalog to score each service.
To achieve a particular level in the maturity model, a service must meet
all the criteria for that level and all previous levels. Some criteria
do not apply to all services (for instance, services like PgBouncer do
not need development documentation).
Test Platform Sub-Department enables successful development and deployment of high quality GitLab software applications by providing innovative build automated solutions, reliable tooling, refined test efficiency, and fostering an environment where Quality is Everyone's responsibility.
The Infrastructure Platforms section enables GitLab Engineering to build and deliver safe, scalable and efficient features for multi-tenant and single-tenant GitLab SaaS platforms (GitLab.com and GitLab Dedicated).
Vision
To deliver on the mission, we are in the process of formalising the building blocks we need to work on.