The Infrastructure Department is responsible for the availability, reliability, performance, and scalability of GitLab.com and other supporting services
Mission
The Infrastructure Department enables GitLab (the company) to deliver a single DevOps application, and GitLab SaaS users to focus on generating value for their own businesses by ensuring that we operate an enterprise-grade SaaS platform.
The Infrastructure Department does this by focusing on availability, reliability, performance, and scalability efforts.
These responsibilities have cost efficiency as an additional driving force, reinforced by the properly prioritized dogfooding efforts.
Many other teams also contribute to the success of the SaaS platform because GitLab.com is not a role.
However, it is the responsibility of the Infrastructure Department to drive the ongoing evolution of the SaaS platform, enabled by platform observability data.
Getting Assistance
If you’re a GitLab team member and are looking to alert the Infrastructure teams about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
The Infrastructure Department operates a fast, secure, and reliable SaaS platform to which (and with which) everyone can contribute.
Integral part of this vision is to:
Build a highly performant team of engineers, combining operational and software development experience to influence the best in reliable infrastructure.
Work publicly in accordance with our transparency value.
Other strategic initiatives to achieve this vision are driven by the needs of enterprise customers looking to adopt GitLab.com. The GitLab.com strategy catalogs top customer requests for the SaaS offering and outlines strategic initiatives across both Infrastructure and Stage Groups needed to address these gaps.
We are also Product Development
Unlike typical companies, part of the mandates of our Security, Infrastructure, and Support Departments is to contribute to the development of the GitLab Product. This follows from these concepts, many of which are also behaviors attached to our core values:
We should not expect new team members to join the company with these instincts, so we should be willing to teach them
It is part of managers’ responsibility to teach these values and behaviors
Organization structure
(click the boxes for more details)
Technical Roadmap
Infrastructure maintains a Technical Roadmap
for planning projects over the short (1y), medium (2y), and long term (3y).
This serves as our strategic compass,
helping us balance immediate needs with long-term sustainability.
The Technical Roadmap is based on the Product Roadmap,
where Product provides the “What” (customer needs) and “Why” (business strategy).
Engineers then determine the “How” (technical implementation),
while Engineering Managers plan the “When” (scheduling).
This comprehensive roadmap emphasizes building high-quality,
complete features in a sustainable manner.
The Technical Roadmap serves three key purposes:
It helps build engineering excellence by addressing critical areas that might not show up in product backlogs,
such as technical debt, performance improvements, platform improvements, and system scalability.
It enables the department to be proactive rather than reactive.
By regularly asking key questions like “Where do we see the biggest instability in our systems?” or
“What is generating the most toil?”, we can address issues before they become critical problems.
This helps maintain our SLOs and keeps our customers happy.
It aligns engineering efforts with business goals, ensuring technical improvements drive GitLab’s success.
Each technical roadmap item is prioritized based on business value and strategic alignment.
Current State
The Infrastructure Roadmap is maintained as a static site.
GitLab team-members can review the current technical roadmap,
at infra-roadmap.gitlab.com.
NOTE:
The Infrastructure Roadmap is not publicly available as some of the projects and
initiatives may not be considered unSAFE.
The site presents the roadmap in a visual manner, showing:
Dependencies between planned initiatives
Filtering options by confidence, stage, or tags
Individual roadmaps for each stage within the department
Impact analysis through dependency visualization
Updating the Roadmap
Changes to the Roadmap are made through merge requests to the infra-roadmap project.
The data is stored in YAML format, and changes can be made by editing the YAML.
This allows for version control and collaborative discussion through the merge request process.
Full instructions for making changes to the Infrastructure Roadmap are available
in the project’s README.md.
Everyone is encouraged to contribute to the roadmap,
whether proposing new initiatives or making smaller changes
like updating descriptions or adding links to relevant issues.
Design
The Infrastructure Library contains documents that outline our thinking about the problems we are solving and represents the current state for any topic, playing a significant role in how we produce technical solutions to meet the challenges we face.
Dogfooding
The Infrastructure department uses GitLab and GitLab features extensively as the main tool for operating many environments, including GitLab.com.
When we consider building tools to help us operate GitLab.com, we follow the 5x rule to determine whether to build the tool as a feature in GitLab or outside of GitLab. To track Infrastructure’s contributions back into the GitLab product, we tag those issues with the appropriate Dogfooding label.
Handbook use at the Infrastructure department
At GitLab, we have a handbook first policy. It is how we communicate process changes, and how we build up a single source of truth for work that is being delivered every day.
The handbook usage page guide lists a number of general tips. Highlighting the ones that can be encountered most frequently in the Infrastructure department:
The wider community can benefit from training materials, architectural diagrams, technical documentation, and how-to documentation. A good place for this detailed information is in the related project documentation. A handbook page can contain a high level overview, and link to more in-depth information placed in the project documentation.
Think about the audience consuming the material in the handbook. A detailed run through of a GitLab.com operational runbook in the handbook might provide information that is not applicable to self-managed users, potentially causing confusion. Additionally, the handbook is not a go-to place for operational information, and grouping operational information together in a single place while explaining the general context with links as a reference will increase visibility.
Ensure that the handbook pages are easy to consume. Checklists, onboarding, repeatable tasks should be either automated or created in a form of template that can be linked from the handbook.
The handbook is the process. The handbook describes our principles, and our epics and issues are our principles put into practice.
The infrastructure issue tracker is the backlog and a catch-all project for the infrastructure teams and tracks the work our teams are doing–unrelated to an ongoing change or incident.
The Infrastructure department hires for a number of different technical specialisms and positions across its teams. This Infrastructure Interviewing Guide offers more detail on some of our regular openings, interview process and other useful information related to applying to jobs with us. More information on our current openings can be found on the careers page.
During an incident, playbooks are vital to the engineer on call (EOC) in resolving an alert. Having all of the salient information laid out in one place saves the EOC time in diagnosing and resolving the incident. It empowers the EOC with a set of standard steps for responding to incidents. Additionally, it can greatly reduce the stress of dealing with an alert when working on an unfamiliar service.
In order to scale GitLab infrastructure at the right time and to prevent incidents, we employ a capacity planning process for example for GitLab.com and GitLab Dedicated.
In parts, this process is predictive and gets input from a forecasting tool to predict future needs.
This aims to provide an earlier and less obstrusive warning to infrastructure teams before components reach their individual saturation levels.
The forecasting tool generates capacity warnings which are converted to issues and these issues are raised in various status meetings.
Maintaining current role descriptions which establish expectations for hiring and ongoing performance expectations is an important supporting function for effective Career Development planning.
The rest of the tools are for active engagement by the Team Member along with their Manager. The origin activity for this is the Big Picture Career Conversation, followed up with quarterly checkpoints and frequent 1:1s. Finally, 360 Feedback and Talent Assessment provide annual opportunities for additional insight on progress.
This is the handbook page for the Cells project. Cells is one of the top priorities for FY2025, with the goal of providing additional scalability for GitLab.com. This handbook page contains the project information such as the project plan, roadmap, workstreams, DRIs, stakeholders, and communication channels. It also has links to important documentation such as the Cells design blueprints.
Change Management has traditionally referred to the processes, procedures, tools and techniques applied in IT environments to carefully manage changes in an operational environment: change tickets and plans, approvals, change review meetings, scheduling, and other red tape.
In our context, Change Management refers to the guidelines we apply to manage changes in the operational environment with the aim of doing so (in order of highest to lowest priority) safely, effectively and efficiently. In some cases, this will require the use of elements from traditional change management; in most cases, we aim to build automation that removes those traditional aspects of change management to increase our speed in a safe manner.
Offer enterprise-grade operational experience of GitLab products from streamlined deployment and maintenance, disaster recovery, secure search and discoverability, to high availability, scalability, and performance.
Mission
Core Platform focuses on improving our capabilities and metrics in the following areas:
The group of Database Reliability Engineers (DBREs) are on the Reliability
Engineering teams that run GitLab.com. We care most about database
reliability aspects of the infrastructure and GitLab as a product.
We strive to approach database reliability from a data driven
perspective as much as we can. As such, we start by defining Service
Level Objectives below and document what service levels we currently aim
to maintain for GitLab.com.
The Infrastructure Department, responsible for managing GitLab SaaS environment, has a number of processes that have an implicit emergency process component as a part of a regular workflow. This page serves as a high level overview of the most important components of those processes, with links to pages describing said processes in more depth.
Workflow
An integral part of any irregular situation occurring on GitLab SaaS is the incident management process.
This process is used for platform degradation and outage events, but it is also the process for emergency changes such as addressing critical vulnerabilities.
The Engineering Productivity team maximizes the value and throughput of Product Development teams and wider community contributors by improving the developer experience, streamlining the product development processes, and keeping projects secure, compliant, and easy to work on for everyone.
If you’re a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you’re a GitLab team member looking for who is currently the Engineer On Call (EOC), please see the Who is the Current EOC? section.
If you’re a GitLab team member looking for the status of a recent incident, please see the incident board. For detailed information about incident status changes, please see the Incident Workflow section.
Incident Management
Incidents are anomalous conditions that result in—or may lead
to—service degradation or outages. These events require human
intervention to avert disruptions or restore service to operational status.
Incidents are always given immediate attention.
The primary goals of writing an Incident Review are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.1
Introduction
An Incident Review is a crucial opportunity for fostering deeper understanding within a blameless culture. Its purpose extends beyond collecting action items to prevent recurrence; it is a process for learning about both the systems and the engineering culture that contribute to incidents. By discussing and analyzing how these components operate and interact, we gain valuable insights into the technical environments we support and the broader organizational context in which they function.
Cause for broken `master` for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%)
We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team's weekly triage report
More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore,
causing our chart to sometimes not take child pipelines into account when computing pipeline duration.
A fix was made on 2024-04-29 to ensure that we're using the pipeline duration from GitLab database directly instead of calculating from queries ([see investigation](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/378#note_1740584179)).
As a result, the average duration and percentiles duration are higher than previously thought.
The Terraform configuration for the environments can be found in config-mgmt.
Future Iteration with Infrastructure Standards
We have a WIP initiative to iterate on our company-wide infrastructure standards. You can learn more about this on the infrastructure standards handbook page.
This page will be refactored incrementally as the standards are documented, implemented, and changes to environments take place.
Development
Name
URL
Purpose
Deploy
Database
Terminal access
Development
various
Development
on save
Fixture
individual dev
Development happens on a local machine. Therefore there is no way to provide any SLA. Access is to the individual dev. This could be either EE/CE depending on what the developer is working on.
GitLab.com customer requests in remit of the Infrastructure department:
GitLab.com customers, especially enterprises, may often have requests related to operational capabilities or non-functional requirements of GitLab.com (e.g. availability, security, and performance of the service). Requests related to functionality within the application itself should be directed to the appropriate stage team using the standard feature request template.
Examples of requests related to operational capabilities of GitLab.com include:
GitLab architects a defense-in-depth methodology that enforces the concept of “least functionality” through restricting network access to systems, applications and services and ensures sufficient security and privacy controls are executed to protect the confidentiality, integrity, availability and safety of the organization’s network infrastructure, as well as to provide situational awareness of activity on GitLab’s networks.
Scope
GitLab’s network architecture is available to both internal and external users and hosts our DNS with Cloudflare incluing gitlab.com and gitlab.net.
If you’re a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you’re a GitLab team member looking for help with a security problem, please see the Engaging the Security On-Call section.
The Production Environment
The GitLab.com production environment is comprised of services that operate–or support the operation of–gitlab.com.
For a complete list of production services see the service catalog
This page exists to consolidate GitLab Rate Limiting documentation into a single source of truth. It is intended to reflect the current state of our rate limits, with the target audience being Operators (SRE and Support team members).
This page shows the output of our service maturity model for each
service in our metrics catalog. The model itself is part of the
metrics catalog, and uses information from the metrics catalog and the
service catalog to score each service.
To achieve a particular level in the maturity model, a service must meet
all the criteria for that level and all previous levels. Some criteria
do not apply to all services (for instance, services like PgBouncer do
not need development documentation).
The Infrastructure Platforms section enables GitLab Engineering to build and deliver safe, scalable and efficient features for multi-tenant and single-tenant GitLab SaaS platforms (GitLab.com and GitLab Dedicated).
Vision
To deliver on the mission, we are in the process of formalising the building blocks we need to work on.
When you visit any website, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and change our default settings. However, blocking some types of cookies may impact your experience of the site and the services we are able to offer.
Cookie Policy
User ID: da726d7f-b04e-4b1d-af52-f1b39e569a46
This User ID will be used as a unique identifier while storing and accessing your preferences for future.
Timestamp: --
Strictly Necessary Cookies
Always Active
These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, enabling you to securely log into the site, filling in forms, or using the customer checkout. GitLab processes any personal data collected through these cookies on the basis of our legitimate interest.
Functionality Cookies
These cookies enable helpful but non-essential website functions that improve your website experience. By recognizing you when you return to our website, they may, for example, allow us to personalize our content for you or remember your preferences. If you do not allow these cookies then some or all of these services may not function properly. GitLab processes any personal data collected through these cookies on the basis of your consent
Performance and Analytics Cookies
These cookies allow us and our third-party service providers to recognize and count the number of visitors on our websites and to see how visitors move around our websites when they are using it. This helps us improve our products and ensures that users can easily find what they need on our websites. These cookies usually generate aggregate statistics that are not associated with an individual. To the extent any personal data is collected through these cookies, GitLab processes that data on the basis of your consent.
Targeting and Advertising Cookies
These cookies enable different advertising related functions. They may allow us to record information about your visit to our websites, such as pages visited, links followed, and videos viewed so we can make our websites and the advertising displayed on it more relevant to your interests. They may be set through our website by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant advertisements on other websites. GitLab processes any personal data collected through these cookies on the basis of your consent.