FY26 Production Engineering Ops Team Roadmaps

Overview

This section contains the FY26 roadmaps for the Production Engineering Ops team’s key focus areas:


FY26 - Disaster Recovery

References

RTO High-Level Overview

Background

Disaster Recovery covers the tools and processes needed to restore GitLab.com to a working state with all customer data in a disaster event. It is a controlled activity that is reported on for compliance purposes. We must have confidence in our ability to recover from a disaster event, and we must be able to communicate that confidence to our customers through consistent, measurable results.

FY26 - Hosted Runners

Table of Contents

Background

GitLab-hosted runners (a.k.a hosted runners for GitLab.com) are managed, deployed, and maintained through cross-team collaboration. SRE supports and sometimes leads deployment solutions to ensure the scalability and reliability of the service in Production.

In FY25, SRE supported Runners in the following chronologically-sorted efforts:

  1. Assisted Dedicated Runners with the Beta release in AWS
  2. Released new runner type offerings: Medium and Large ARM64-based runners
  3. Explored PoC - Deploying Hosted Runners using Dedicated Runners architecture
  4. Found a solution to the VPC peering limits which prevented the scalability of .com hosted runners
  5. Implemented the VPC re-design across all .com runner shards
  6. Consolidated .com runner environment deviations in terraform

In FY26, the focus will be on enabling the runners team to better self-service runners operational work.

FY26 - Incident Management
Goals and plans for incident management in FY26
FY26 - Patching & OS Modernization

Background

In FY25, the Ops team took ownership over the patching processes that are applied to the supporting infrastructure of GitLab.com. An overview of the areas we currently have applied focus:

  1. Establishing runbooks for VM-based patching operations
  2. Automating notifications to service owners for required patches and reboots
  3. Creating an automation framework for patching our VM fleet

North star

All deployed software represents a potential security risk: we aim to ensure that all vulnerabilities impacting the supporting infrastructure of our SaaS products are able to be detected and resolved in an automated fashion.

Last modified February 24, 2025: Adding FY26 roadmap docs to handbook (15bc3b55)