FY26 - Hosted Runners
Table of Contents
Background
GitLab-hosted runners (a.k.a hosted runners for GitLab.com) are managed, deployed, and maintained through cross-team collaboration. SRE supports and sometimes leads deployment solutions to ensure the scalability and reliability of the service in Production.
In FY25, SRE supported Runners in the following chronologically-sorted efforts:
- Assisted Dedicated Runners with the Beta release in AWS
- Released new runner type offerings: Medium and Large ARM64-based runners
- Explored PoC - Deploying Hosted Runners using Dedicated Runners architecture
- Found a solution to the VPC peering limits which prevented the scalability of .com hosted runners
- Implemented the VPC re-design across all .com runner shards
- Consolidated .com runner environment deviations in terraform
In FY26, the focus will be on enabling the runners team to better self-service runners operational work.
North Star
Achieving fully automated, hands-off scalability and maintainability for GitLab-hosted runners, ensuring seamless, continuous operations with minimal manual intervention.
Current Landscape
The following sections highlight the shortcomings of the current landscape. While focusing on areas needing the most growth may seem pessimistic, our setup and processes are in a better position today compared to a year ago. However, we’re still far from ideal and have room for improvement.
Knowledge Silos
Access to information about maintaining and scaling Hosted runners is limited. Expertise is demonstrated by a few engineers, and the process is not well-documented.
Scalability Fatigue
The scalability process, both for offering new runner types and for scaling existing ones, is hands-on, demanding, prone to human error, repetitive, and boring.
Cost Efficiency
The cost efficiency for some shards is far from ideal. While insights to this data are available for the FinOps team, they’re not empowered to take action or make decisions based on this data. Additionally, we’re not investing in exploring alternative deployment methods or understanding the current CPU usage to minimize the cost of hosting .com runners.
Goals for FY26
Effort Level Key:
- LF: Low Effort
- MF: Medium Effort
- TF: Tremendous Effort
Knowledge Transparency
Objectives:
- Improve the quality of deployment and scalability docs (MF)
- Create discoverable, easy-to-reach information (LF)
- Convert docs to CR template for better cross-team transparency (LF)
- Coordinate with Runners team to transfer Runbooks ownership (MF)
Related Links:
Scalability Fatigue
Objectives: Automate scaling the existing shards process:
- Author epic/issue in the runner’s tracker to (LF):
- Improve existing tools (deployer and GRIT)
- Consider embedding the process in a pipeline
- Research alternative deployment solutions (e.g., revisit runner-managers in k8s)
Related Links:
Cost Efficiency
Objectives:
- Work with FinOps to determine objectives (LF)
- Achieve reasonable cost efficiency for all existing shards (MF)
- Explore more cost-efficient hosting methods (MF)
- Optimize resources usage (TF)
Related Links:
- Investigate FinOps Cloud Efficiency vs Grafana Idle Efficiency
- Reduce autoscaling parameters for larger Linux runners due to high compute inefficiency
15bc3b55
)