Disaster Recovery

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
ongoing	`jarv`				2024-01-29

This document is a work-in-progress and proposes architecture changes for the GitLab.com SaaS. The goal of these changes are to maintain GitLab.com service continuity in the case a regional or zonal outage.

For the current state see Disaster Recovery Policies for GitLab Backups.

A zonal recovery is required when all resources are unavailable in one of the three availability zones in us-east1 or us-central1.
A regional recovery is required when all resources become unavailable in one of the regions critical to operation of GitLab.com, either us-east1 or us-central1.

Services not included in the current DR strategy for FY24 and FY25

We have limited the scope of DR to services that support primary services (Web, API, Git, Pages, Sidekiq, CI, and Registry). These services tie directly into our overall availability score (internal link) for GitLab.com.

For example, DR does not include the following:

AI services including code suggestions
Error tracking and other observability services like tracing
CustomersDot, responsible for billing and new subscriptions
Advanced Search

DR Implementation Targets

The FY24 targets were:

	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Zonal	2 hours	1 hour
Regional	96 hours	2 hours

The FY25 targets before cell architecture are:

	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Zonal	0 minutes	0 minutes
Regional	48 hours	0 minutes

Note: While the RPO values are targets, they cannot be met exactly due to the limitations of regional bucket replication and replication lag of Gitaly and PostgreSQL.

Current Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Zonal Recovery

We have not yet simulated a full zonal outage on GitLab.com. The following are RTO/RPO estimates based on what we have been able to test using the disaster recovery runbook. It is assumed that each service can be restored in parallel. A parallel restore is the only way we are able to meet the FY24 RTO target of 2 hours for a zonal recovery.

Service	RTO	RPO
PostgreSQL	1.5 hr	<=5 min
Redis ¹	0	0
Gitaly	30 min	<=1 hr
CI	30 min	not applicable
Load balancing (HAProxy)	30 min	not applicable
Frontend services (Web, API, Git, Pages, Registry) ²	15 min	0
Monitoring (Prometheus, Thanos, Grafana, Alerting)	0	not applicable
Operations (Deployments, runbooks, operational tooling, Chef) ³	30 min	4 hr
PackageCloud (distribution of packages for self-managed)	0	0

Current Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Regional Recovery

Regional recovery requires a complete rebuild of GitLab.com using backups that are stored in multi-region buckets. The recovery has not yet been validated end-to-end, so we don’t know how long the RTO is for a regional failure. Our target RTO for FY25 is to have a procedure to recover from a regional outage in under 48 hours.

The following are considerations for choosing multi-region buckets over dual-region buckets:

We operate out of a single region so multi-region storage is only used for disaster recovery.
Although Google recommends dual-region for disaster recovery, dual-region is not an available storage type for disk snapshots.
To mitigate the bandwidth limitation of multi-region buckets, we spread Gitaly VMs infra across multiple projects.

Proposals for Regional and Zonal Recovery

Most of the Redis load is on the primary node, so losing replicas should not cause any service interruption ↩︎
We setup maximum replicas in our Kubernetes clusters servicing front-end traffic, this is done to avoid saturating downstream dependencies. For a zonal failure, a cluster reconfiguration is necessary to increase these maximums. ↩︎
There is a 4 hr RPO for Operations because Chef is an single point of failure in a single availability zone and our restore method uses disk snapshots, taken every 4 hours. While most of our Chef configuration is also stored in Git, some data (like node registrations) are only stored on the server. ↩︎

Regional Recovery

Improving the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Regional Recovery

The following list the top challenges that limit our ability to drive RTO to 48 hours for a regional recovery.

We have a large amount of legacy infrastructure managed using Chef. This configuration has been difficult for us to manage and would require a large a mount of manual copying and duplication to create new infrastructure in an alternate region.
Operational infrastructure is located in a single region, us-central1. For a regional failure in this region, it requires rebuilding the ops infrastructure with only local copies of runbooks and tooling scripts.
Observability is hosted in a single region.
The infrastructure (dev.gitlab.org) that builds Docker images and packages is located in a single region, and is a single point of failure.
There is no launch-pad that would allow us to get a head-start on a regional recovery. Our IaC (Infrastructure-as-Code) does not allow us to switch regions for provisioning.
We don’t have confidence that Google can provide us with the capacity we need in a new region, specifically the large amount of SSD necessary to restore all of our customer Git data.
We use Global DNS for internal DNS making it difficult to use multiple instances with the same name across multiple regions, we also don’t incorporate regions into DNS names for our internal endpoints (for example dashboards, logs, etc).
If we deploy replicas in another region to reduce RPO we are not yet sure of the latency or cloud spend impacts.
We have special/negotiated Quota increases for Compute, Network, and API with the Google Cloud Platform only for a single region, we have to match these quotas in a new region, and keep them in sync.
We have not standardized a way to divert traffic at the edge from 1 region to another.
In monitoring, and configuration we have places where we hardcode the region to us-east1.

Regional recovery work-streams

The first step of our regional recovery plan creates new infrastructure in the recovery region that involves a large number of manual steps. To give us a head-start on recovery, we propose a “regional bulkhead” deployment in a new GCP region.

Zonal Recovery

Improving the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Zonal Recovery

The following represents our current DR challenges and are candidates for problems that we should address in this architecture blueprint.

Postgres replicas run close to capacity and are scaled manually. New instances must go through Terraform CI pipelines and Chef configuration. Over-provisioning to absorb a zone failure would add significant cloud-spend (see proposal section at the end of the document for details).
HAProxy (load balancing) is scaled manually and must go through Terraform CI pipelines and Chef configuration.
CI runner managers are present in 2 availability zones and scaled close to capacity. New instances must go through Terraform CI pipelines and Chef configuration.
In a zone there are saturation limits, like the number of replicas that need to be manually adjusted if load is shifted away from a failed availability zone.
Gitaly RPO is limited by the frequency of disk snapshots, RTO is limited by the time it takes to provision and configure through Terraform CI pipelines and Chef configuration.
Monitoring infrastructure that collects metrics from Chef managed VMs is redundant across 2 availability zones and scaled manually. New instances must go through Terraform CI pipelines and Chef configuration.
The Chef server which is responsible for all configuration of Chef managed VMs is a single point of failure located in us-central1. It has a local Postgres database and files on local disk.
The infrastructure (dev.gitlab.org) that builds Docker images and packages is located in a single region, and is a single point of failure.

Zonal recovery work-streams

Improvements around zonal recovery revolve around improving the time it takes to provision for fleets that do not automatically scale. There is already work in-progress to completely eliminate statically allocated VMs like HAProxy. Additionally efforts can be made to shorten launch and configuration times for fleets that are not able to automatically scale like Gitaly, PostgreSQL and Redis.

Last modified December 31, 2024: Update DR/Backups handbook pages (b55fdc56)

View page source - Edit this page - please contribute.