Runner Managers on Kubernetes
| Status | Authors | Coach | DRIs | Owning Stage | Created |
|---|---|---|---|---|---|
| proposed |
igorwwwwwwwwwwwwwwwwwwww
|
josephburnett
|
kkyrala
|
team Runners Platform | 2025-12-01 |
Summary
We propose migrating GitLab.com hosted runner managers from Chef-managed VMs to Kubernetes-managed deployments. Our current infrastructure requires 2 engineer-days per month for manual deployments, batching 30-50 runner changes into a single risky release per month. This deployment bottleneck limits the pace of innovation for 40+ engineers across 5 teams working on runner infrastructure and products that depend on it.
This migration delivers immediate value by enabling daily deployments. Configuration changes that currently take weeks can be deployed same-day. Version updates become automated via Renovate. The 2 engineer-days per month currently spent on deployment toil can be redirected to higher-value work.
Additionally, consolidating configuration from the current Terraform/Chef split into a single ArgoCD-managed system makes the infrastructure easier to reason about and maintain. Managing 12 shards is operationally infeasible today, forcing us to reject custom shard requests from Duo Workflow, Runner Core, and Secure stage. This migration removes the deployment bottleneck.
The migration is low-risk: it’s a lift-and-shift that keeps the same executor architecture, runs both systems in parallel, and can roll back anytime.
Motivation
Our current runner manager infrastructure is Chef-managed VMs in a blue/green deployment model. This consumes roughly 2 engineer-days per month in deployment toil, but the strategic cost is far greater: it limits the pace of innovation for 40+ engineers across 5 teams (Runner Core, CI Functions Platform, Runners Platform, and the Duo Agent Platform product teams).
Deployment velocity mismatch: GitLab Runner sees 30-50 merges per month (1-2 per day average), but our monthly deployment cadence batches all these changes into a single risky deployment per month. This prevents rapid iteration on features and improvements.
Following GitLab’s own principles: GitLab development depends heavily on CI infrastructure—not just for stability, but for continuous innovation in features, performance, and developer experience. GitLab champions continuous deployment: GitLab.com deploys 12 times per day with far higher development velocity. Our runner infrastructure should follow the same principle: frequent, low-risk deployments enable rapid iteration while maintaining stability.
Pain points:
- Manual deployment process: Change management issues, ChatOps draining, blue/green coordination
- Split configuration: VMs provisioned with Terraform (config-mgmt), configured with Chef (chef-repo) — fragmented and hard to reason about
- Unmanaged infrastructure between deploys: We disable Chef on the non-active fleet to prevent runner managers from starting, leaving those hosts unmanaged (no SSH key updates, user access changes, system configuration updates) until the next deployment
- Rejected custom configurations: The operational complexity of maintaining 12 existing shards forces us to reject new shard requests from teams including Duo Workflow, Runner Core, and Secure stage, directly blocking strategic product initiatives
Goals
- Daily deployments with minimal manual intervention (down from ~2 engineer-days per deployment)
- Consolidated configuration in ArgoCD, eliminating the Terraform/Chef split
- Automated version updates with Renovate
- Foundation for future automation: shard creation, custom configurations, maybe autoscaling
Non-Goals
- Changing executor architecture: We keep Docker Machine for now; the separate Kata/Firecracker/gVisor evaluation is out of scope
- Modifying runner functionality: No changes to how runners execute jobs or interact with GitLab
- Immediate Chef deprecation: While this reduces our Chef footprint, it does not address Chef usage elsewhere in GitLab.com infrastructure
- Automating shard creation: GCP projects, quotas, and runner tokens still require manual provisioning
Proposal
A lift-and-shift migration of runner managers from Chef-managed VMs to Kubernetes. We keep Docker Machine executors unchanged — only the runner manager deployment changes.
Relation to Unified Runners Platform v5
This work is a subset of the broader Unified Runners Platform v5 initiative, specifically part of Block 0: Design and Foundation. This establishes the foundation for deploying runners going forward. We are extracting it into its own scope to ship value quickly rather than waiting for the full v5 rollout.
Why incremental? Decoupling runner manager deployment from executor changes lets us deliver immediate value while limiting risk. A combined migration would increase complexity and delay benefits.
Why this is low-risk:
- Ephemeral Job VMs stay in the same networks; we only need to validate runner manager to executor connectivity
- Run both runner manager architectures side-by-side during migration; control traffic using replica counts and concurrency settings
- Two-way-door decision: roll back to Chef VMs anytime if issues arise
What changes:
- Deploy runner managers using ArgoCD + GitLab Runner Helm chart (with Docker Machine support)
- Automated version updates with Renovate (merge MR → deploy)
- No more manual blue/green coordination
Deployment Method Selection
We use the GitLab Runner Helm chart for Kubernetes deployment. This is the industry-standard method for deploying applications on Kubernetes and the most widely adopted approach for runners on Kubernetes.
Why Helm chart:
- Dogfooding: Using what customers use for runners on Kubernetes. We also use Helm to deploy GitLab itself. See Dogfooding Strategy below.
- Industry standard: Helm is the de facto standard for Kubernetes application packaging.
- Proven maturity: Production-validated with extensive documentation and community support.
- Executor flexibility: Supports all executors including Docker Machine, allowing us to decouple deployment mechanism changes from executor migration.
Evaluation:
| Method | Kubernetes | Maturity | Adoption | Dogfooding Value | Notes |
|---|---|---|---|---|---|
| Helm Chart | Native | High | High | High - benefits most k8s users | Selected approach |
| Operator | Native | Low | Low (OpenShift-focused) | Low - limited user base | Less mature. Future consideration if custom autoscaling needed. |
| GRIT | Via Operator | Medium (VMs), Low (k8s) | Low | Low - limited user base | VM-first, not Kubernetes-native. No Docker Machine support. |
| Omnibus | No | High | High (VMs) | N/A - not k8s | Current approach. Manual deployment orchestration, no native rolling deploys. |
See the alternatives section below for detailed analysis.
Note on adoption metrics: Quantitative adoption data does not exist for any deployment method. Assessments are based on qualitative indicators such as documentation positioning and community engagement.
Future flexibility: Using Helm today enables migration to the GitLab Runner Operator in the future if requirements emerge. If the Operator provides a Helm chart, we have a clear migration path.
Future executor migration: Docker Machine is deprecated and scheduled for removal in GitLab 20.0 (May 2027). This migration simplifies a future move away from Docker Machine, whether to docker-autoscaler or a Kubernetes-based executor.
Dogfooding Strategy
The Runner Platform is Runner Core’s “Customer-0”. By using the GitLab Runner Helm chart for this migration, we dogfood what self-managed customers use for Kubernetes deployments. Many of our customers are up against the same operational challenges we face—deployment velocity, operational complexity, and scale. This approach ensures that any problems we encounter at GitLab.com scale will benefit all customers, not just internal tooling.
Operating runners at scale with the Helm chart validates it for advanced use cases and feeds improvements back to the broader GitLab community. This demonstrates consistency: we use Helm charts to deploy GitLab on GitLab.com, and GitLab Dedicated uses Helm charts to deploy GitLab. Using Helm for runners extends this pattern across our infrastructure.
This deployment approach also enables faster iteration on new runner features. With continuous deployment and significantly lower deployment friction, we can test experimental configurations daily and validate new runner capabilities at GitLab.com scale before general availability. Early adoption surfaces edge cases and performance characteristics that benefit all customers, while giving Runner Core immediate feedback on real-world behavior at scale.
Risks
- Network connectivity: Runner managers in Kubernetes must maintain proper network access to existing Docker Machine subnets
- Performance and scaling: Kubernetes deployments may exhibit different behavior under high load; validation required
- Configuration parity: Exact parity between old and new systems during migration is critical. Post migration the old system will be allowed to gather dust, and then eventually deleted from config.
- Single cluster blast radius: Cluster-wide issues could affect multiple shards. Mitigated by multi-AZ cluster. If needed, we can deploy multiple clusters - ArgoCD supports multi-cluster management natively
- Kubernetes operational complexity: Pod-specific failure modes (out-of-memory kills, image pull failures, evictions) require Kubernetes-specific troubleshooting expertise
- Shallow health checks: Existing health checks primarily verify the process is running. The Helm chart additionally calls
gitlab-runner verify, but this doesn’t catch issues like invalid tokens or configuration problems that prevent job processing. For continuous deployment, deeper health checks are needed—ideally verifying that a runner has successfully processed a real job before considering it healthy. - Cilium health check bug: GKE clusters with Cilium have a bug where pods with flapping health checks lose network connectivity. Upstream fix merged but not yet in GKE. Workaround: ensure stable health checks
- Pod ephemeral storage: If a pod dies, in-progress jobs are abandoned (not transferable). Orphaned VMs cleaned up by ci-project-cleaner
Design and implementation details
Architecture Overview
The following diagrams illustrate the current state, migration phase, and target state of the runner manager infrastructure.
Current State (Before)
graph TB
subgraph "Configuration Management"
TF[Terraform]
Chef[Chef]
ChatOps[ChatOps]
end
subgraph "gitlab-ci-155816 Project"
subgraph "Blue Fleet"
VM1B[Runner Manager VM]
end
subgraph "Green Fleet"
VM1G[Runner Manager VM]
end
end
subgraph "Shard Projects (e.g. gitlab-r-saas-l-p-amd64-1)"
DM1[Docker Machine VMs]
end
TF -->|Provisions| VM1B
TF -->|Provisions| VM1G
Chef -->|Configures| VM1B
Chef -->|Configures| VM1G
ChatOps -->|Deploys| VM1B
ChatOps -->|Deploys| VM1G
VM1B -->|Spawns| DM1
VM1G -->|Spawns| DM1
style TF fill:#ff9999
style Chef fill:#ff9999
style ChatOps fill:#ff9999
During Migration
Note: Old system continues to use ChatOps for deployment. New system uses Terraform to provision the GKE cluster, ArgoCD for deployments.
graph TB
subgraph "Old System"
subgraph "Configuration Management"
TF[Terraform]
Chef[Chef]
end
subgraph "Chef-Managed VMs"
VM1[Runner Manager VM]
VM2[Runner Manager VM]
end
end
subgraph "New System"
ArgoCD[ArgoCD]
subgraph "GKE Cluster: runner-managers"
K8S1[Runner Manager Pod]
K8S2[Runner Manager Pod]
end
end
subgraph "Shard Projects (e.g. gitlab-r-saas-l-p-amd64-1)"
DM1[Docker Machine VMs]
DM2[Docker Machine VMs]
end
TF -->|Provisions| VM1
TF -->|Provisions| VM2
Chef -->|Configures| VM1
Chef -->|Configures| VM2
ArgoCD -->|Deploys| K8S1
ArgoCD -->|Deploys| K8S2
VM1 -->|Spawns| DM1
VM2 -->|Spawns| DM1
K8S1 -->|Spawns| DM2
K8S2 -->|Spawns| DM2
style TF fill:#ff9999
style Chef fill:#ff9999
style ArgoCD fill:#99ff99
Target State (After)
graph TB
subgraph "Configuration Management"
TF[Terraform]
Renovate[Renovate]
ArgoCD[ArgoCD]
end
subgraph "gitlab-ci-155816 Project"
subgraph GKE["GKE Cluster: runner-managers"]
K8S1[Pod: env:gprd, shard:private]
K8S2[Pod: env:gprd, shard:small]
K8S3[Pod: env:gstg, shard:private]
end
end
subgraph "Shard Projects"
subgraph "gitlab-r-saas-l-p-amd64-1"
DM1[Docker Machine VMs]
end
subgraph "gitlab-r-saas-l-s-amd64-1"
DM2[Docker Machine VMs]
end
end
TF -.->|Provisions| GKE
Renovate -->|Creates MRs| ArgoCD
ArgoCD -->|Deploys| K8S1
ArgoCD -->|Deploys| K8S2
ArgoCD -->|Deploys| K8S3
K8S1 -->|Spawns| DM1
K8S2 -->|Spawns| DM2
K8S3 -->|Spawns| DM1
style ArgoCD fill:#99ff99
style Renovate fill:#99ff99
Infrastructure
GKE Cluster:
- Name:
runner-managers - GCP Project:
gitlab-ci-155816(existing project) - Network access to existing Docker Machine subnets
Cost Impact:
GKE adds a flat ~$80/month cluster management fee. While Docker Machine job VMs represent the bulk of our CI/CD infrastructure spending, the current runner manager fleet itself represents significant cost: 58 c2-standard-30 machines at ~$50k/month.
Recent analysis shows these machines are heavily underutilized: CPU never exceeds 30% and memory typically uses ≤10%. This suggests the runner managers were over-provisioned when introduced 5 years ago, and workload distribution has since evolved. Kubernetes right-sizing could deliver substantial savings beyond the operational benefits.
Deployment Structure:
- One runner manager deployment per
(environment, shard, project)tuple - Shards may contain multiple projects for horizontal scaling
- Docker Machine executors continue to provision VMs in shard-specific GCP projects:
gitlab-r-saas-l-p-amd64-1for private runnersgitlab-r-saas-l-s-amd64-1for amd64-small shard- Network configuration in shared
gitlab-ci-155816project
Configuration Management
- Single ArgoCD application in argocd/apps, separate config per
(environment, shard) - Uses official GitLab Runner Helm chart with Docker Machine support
- The Helm chart calls
gitlab-runner registerwhich generates config from a template, rather than using a stable config file we manage directly. This requires careful validation during migration - Renovate creates MRs for version updates; human review and merge triggers deploy (fully automated deploys may come later as we gain confidence)
- Configure Renovate to track pre-release images — nightly releases are currently pushed to packagecloud (migrating to Pulp), and nightly image tags per commit are available in the container registry
Per-shard configuration tuning:
request_concurrency: Controls job request parallelism (currently varies from 10-15 across shards)- Kubernetes
terminationGracePeriodSeconds: Must exceed max job timeout for the shard (see Rolling Deploys section) - Deployment rollout parameters:
maxUnavailable/maxSurge(see Rolling Deploys section)
Rolling Deploys and Graceful Shutdown
Unlike the legacy rolling deployment approach, Kubernetes handles rolling deploys natively using maxSurge/maxUnavailable settings.
Graceful shutdown timing:
The current VM-based deployment uses a 2-hour systemd timeout (TimeoutStopSec=2h, configured in cookbook-wrapper-gitlab-runner) for graceful shutdown.
Recommendation: Set terminationGracePeriodSeconds to the shard’s max job timeout plus 5 minutes for cleanup overhead. For private gitlab-org small, this means 14700s (4h 5m); for most other shards, 11100s (3h 5m).
Idle VM cleanup adds shutdown time: Runner managers sequentially remove idle VMs on stop. High-capacity shards (e.g., 600 idle VMs on small-amd64) could add 20-30 minutes to shutdown. Consider setting remove_machines_on_stop: false and relying on ci-project-cleaner for orphan cleanup.
Configuration:
- Shutdown signal: Must be
SIGQUITfor graceful runner shutdown (already set in Helm chart defaults) - Termination grace period: Set per-shard based on maximum job timeout (see timing above)
- maxSurge: Set to 100% to mirror current blue/green behavior — new pods come up before old pods drain. This avoids capacity reduction during deploys
- maxUnavailable: Set to 0 during rollout to maintain capacity
- Monitoring: The GitLab Runner Helm chart sets
publishNotReadyAddresses: trueon the metrics Service, ensuring Prometheus continues scraping metrics from terminating pods during graceful shutdown
ArgoCD and long-running deploys: ArgoCD does not block new syncs while pods are terminating — rapid successive deploys may cause resource pressure (multiple pod generations running). Renovate’s MR-based workflow provides a natural gate.
Secrets Management
- Runner tokens already in Vault; provision fresh tokens for new runners
- One token per
(environment, shard)instead of one per host
GCP Authentication
Runner managers need GCP credentials to provision Docker Machine VMs. Currently, Chef-managed VMs use a shared, long-lived service account key file on disk (/etc/gitlab-runner/service-account.json, using runners-cache@gitlab-ci-155816.iam.gserviceaccount.com).
For Kubernetes, we use Workload Identity to map Kubernetes ServiceAccounts to GCP service accounts. This eliminates long-lived keys and allows pods to authenticate to GCP APIs using short-lived, automatically rotated credentials.
Production Change Lock (PCL)
Integrate with change-lock to prevent deployments during freeze periods. See PCL guide.
GKE Node Upgrades
GKE node upgrades evict pods, which is problematic with termination grace periods of 4+ hours. Use blue/green node pool upgrades with soak time >= termination grace period. Configure PodDisruptionBudgets to control eviction rate. Blue/green upgrades can run beyond the maintenance window, be cancelled/resumed/rolled back, and respect PDBs and termination grace periods.
Unlike our current blue/green VM deployments, GKE blue/green node pool upgrades are fully automated and transparent to the application — no manual coordination required.
Monitoring and Observability
- Configure Prometheus/Mimir for the new cluster
- Enable metrics and debugging endpoints by setting
metrics.enabled: truein Helm chart (exposes/metrics,/debug/pprof/, and/debug/jobs/liston port 9252) - Enable JSON logging and ship logs to Elasticsearch
- Update CI dashboards in runbooks to support both fleets during migration
- Consider adding process-exporter to monitor docker-machine processes (not currently in place for VM-based runner managers, not a blocker)
Migration Execution
Migration order: Start with a low-risk shard (e.g., tamland) for initial validation, then gradually migrate larger shards.
- Create GKE cluster, deploy first runner manager with minimal capacity, validate connectivity, monitoring, and deploys
- Gradually deploy additional shards, increase capacity, validate parity with Chef-managed fleet
- Decommission Chef-managed VMs; clean up config-mgmt and chef-repo
Rollback Procedures
During migration: Reduce Kubernetes replica count to 0, increase Chef-managed VM capacity. Runner managers are stateless.
After full migration:
- Version/config rollback: Revert the MR in Git; ArgoCD syncs the previous state automatically
- Emergency (return to Chef VMs): Scale down Kubernetes deployments to 0, re-enable Chef-managed VMs. This path remains viable as long as Chef configuration is maintained
Process Updates
Internal processes that need to be updated or replaced:
- Update or replace ChatOps runner commands
- Update CI runbooks
- Update deployment process (see also runner-rollout-gen)
- Update gameday template
Note: ci-project-cleaner for stale VM cleanup remains unchanged.
Open Questions
- Runner manager pod sizing: Current VMs vary significantly — private and shared-gitlab-org shards use
n2d-standard-4, while the bulk of the fleet usesc2-standard-30. Metrics suggest significant right-sizing opportunity. Need to determine appropriate resource requests/limits per shard. - Validate Kubernetes deployment compatibility with the docker-machine executor.
- GKE Autopilot compatibility: GKE Autopilot limits
terminationGracePeriodSecondsto 600 seconds (10 minutes), which is incompatible with our 3-4 hour job timeouts. Extended-duration pods can run up to 7 days with an annotation, but it’s unclear whether this extends the grace period limit. Investigate if Autopilot is viable. - Scope: macOS and Windows shards use different executors (instance/fleeting and custom/autoscaler). macOS runs on AWS bare metal via fleeting and requires VPN connectivity to the GKE cluster, adding complexity around network wiring and potential IP conflicts. Windows runner managers currently run on Windows VMs but should be deployable on Linux with some code changes, allowing them to use the same Kubernetes deployment approach. Determine whether they are in scope for this migration or handled separately.
Current Shard Configuration
| Shard | Executor | Instance Type | Concurrent | Limit | IdleCount | Runner Count | Privileged | MaxBuilds | DiskSize | Job Timeout |
|---|---|---|---|---|---|---|---|---|---|---|
| saas-linux-small-amd64 | docker+machine | c2-standard-30 | 1200 | 1300 | 600 | 12 | true | 1 | 30 GB | 3h |
| saas-linux-medium-amd64 | docker+machine | c2-standard-30 | 1200 | 1300 | 200 | 10 | true | 1 | 50 GB | 3h |
| saas-linux-large-amd64 | docker+machine | c2-standard-30 | 1200 | 1300 | 125 | 10 | true | 1 | 100 GB | 3h |
| saas-linux-xlarge-amd64 | docker+machine | c2-standard-30 | 375 | 1200 | 5 | 10 | true | 1 | 200 GB | 3h |
| saas-linux-2xlarge-amd64 | docker+machine | c2-standard-30 | 187 | 1200 | 2 | 10 | true | 1 | 200 GB | 3h |
| saas-linux-small-arm64 | docker+machine | c2-standard-30 | 220 | 220 | 40 | 6 | true | 1 | 30 GB | 3h |
| saas-linux-medium-arm64 | docker+machine | c2-standard-30 | 375 | 1200 | 15 | 6 | true | 1 | 50 GB | 3h |
| saas-linux-large-arm64 | docker+machine | c2-standard-30 | 375 | 1200 | 15 | 6 | true | 1 | 100 GB | 3h |
| saas-linux-medium-amd64-gpu-standard | docker+machine | c2-standard-30 | 1200 | 1300 | 25 | 6 | true | 1 | 50 GB | 3h |
| private (gitlab-org, small) | docker+machine | n2d-standard-4 | 1980 | 1125 | 10 | 16 | true | 40 | 100 GB | 4h |
| private (gitlab-org, medium) | docker+machine | n2d-standard-4 | 1980 | 625 | 10 | 16 | true | 40 | 100 GB | 3h |
| private (gitlab-org, large) | docker+machine | n2d-standard-4 | 1980 | 100 | 10 | 16 | true | 40 | 100 GB | 3h |
| private (gitlab-com) | docker+machine | n2d-standard-4 | 1980 | 150 | 10 | 16 | true | 40 | 100 GB | 2h |
| shared-gitlab-org | docker+machine | n2d-standard-4 | 1200 | 900 | 15 | 12 | false | 10 | 50 GB | 1.5h |
| shared-gitlab-org (dind) | docker+machine | n2d-standard-4 | 1200 | 100 | 15 | 12 | true | 1 | 50 GB | 1.5h |
| tamland | docker | n2d-standard-4 | 20 | 10 | - | 1 | - | - | - | - |
| saas-macos-medium-m1 | instance (fleeting) | c2-standard-30 | 40 | 40 | - | 4 | - | - | - | 3h |
| saas-macos-large-m2pro | instance (fleeting) | c2-standard-30 | 12 | 40 | - | 4 | - | - | - | 3h |
| saas-windows-medium-amd64 | custom (autoscaler) | n1-standard-4 | 100 | 100 | - | 2 | - | - | - | 2h |
- Executor: Runner executor type (docker+machine, instance, custom)
- Concurrent (gitlab-runner): Max simultaneous jobs per runner manager
- Limit (gitlab-runner): Max jobs per runner (within a manager); private shard has multiple runners with different limits
- IdleCount (docker-machine): Pre-warmed idle VMs; high values impact shutdown time
- Runner Count: Current fleet size (for reference)
- Privileged (docker): Whether containers run in privileged mode (required for Docker-in-Docker)
- MaxBuilds (docker-machine): Number of jobs a VM can run before being destroyed; 1 = ephemeral VMs, higher values = VM reuse
- DiskSize (docker-machine): Job VM disk size; scales with runner size (30GB for small, 50GB for medium, 100GB for large, 200GB for xlarge/2xlarge)
- Job Timeout: Maximum job duration configured in GitLab
Alternative Solutions
Do Nothing
Keep current Chef + Terraform setup.
Pros: No migration effort, team familiar with tooling, existing runbooks remain valid.
Cons: Operational toil remains (~2 engineer-days/deployment), no path to daily deployments.
Decision: Rejected. Signed off by VP of Infrastructure Platforms.
Improve Automation in Current VM-Based System
Automate the existing Chef + Terraform workflow instead of migrating.
Pros: Lower migration risk, no need to validate Kubernetes behavior.
Cons: Still maintains split config (Terraform + Chef), blue/green complexity remains, goes against infrastructure direction (ArgoCD, Kubernetes), custom automation requires significant investment.
Decision: Rejected. Doesn’t address fundamental problems; investment better spent on migration. Signed off by VP of Infrastructure Platforms.
Use GRIT (GitLab Runner Infrastructure Toolkit)
Use GRIT to manage runner infrastructure with Terraform modules. GRIT is actively used for Hosted Runners on GitLab Dedicated.
Pros:
- Purpose-built for GitLab Runner with standardized module interfaces
- Supports multiple cloud providers (AWS, GCP)
- Successfully used by Dedicated team for greenfield AWS deployments
- Includes deployer tool for zero-downtime VM deployments
Cons:
-
VM-first, not Kubernetes-native. GRIT’s primary use case is provisioning VMs with docker-autoscaler. While Kubernetes modules exist, they use the GitLab Runner Operator (which has lower maturity than the Helm chart).
-
No Docker Machine executor support. GRIT supports docker-autoscaler, instance, shell, and kubernetes executors. We have shards running Docker Machine today. While we want to migrate off of Docker Machine, coupling the executor migration with the deployment mechanism migration creates unnecessary risk. The Helm chart supports Docker Machine, allowing us to tackle these migrations independently.
-
Terraform-based, not GitOps-native. GRIT uses Terraform modules to manage infrastructure. Our target for Kubernetes is ArgoCD/GitOps, consistent with the rest of the Production Engineering organization.
-
Doesn’t solve deployment toil. GRIT’s deployer provides zero-downtime deployments for VMs through custom orchestration (blue/green coordination, SSH tunnels, gRPC process wrapper). Kubernetes provides these capabilities natively (rolling deploys, graceful termination, health checks). Our goal is continuous deployment, not building custom deployment tooling.
-
Building blocks, not turnkey. Dedicated uses a subset of GRIT modules and built additional infrastructure for their specific needs. This flexibility is valuable for greenfield deployments but requires integration work.
-
Limited adoption for Kubernetes. GRIT is used primarily by internal teams (Dedicated, Demo Architecture) for VM-based deployments, though some external customers use it as well. The Helm chart is the most widely adopted method for deploying runners on Kubernetes.
Decision: Not selected. GRIT’s primary use case is VM-based runner deployments with scheduled releases. We need Kubernetes-native deployment for continuous delivery. Signed off by VP of Infrastructure Platforms.
8f6c7e92)
