Disaster Recovery
This page contains information related to upcoming products, features, and functionality.
It is important to note that the information presented is for informational purposes only.
Please do not rely on this information for purchasing or planning purposes.
The development, release, and timing of any products, features, or functionality may be
subject to change or delay and remain at the sole discretion of GitLab Inc.
Status |
Authors |
Coach |
DRIs |
Owning Stage |
Created |
ongoing
|
jarv
|
|
|
|
2024-01-29
|
This document is a work-in-progress and proposes architecture changes for the GitLab.com SaaS.
The goal of these changes are to maintain GitLab.com service continuity in the case a regional or zonal outage.
For the current state see Disaster Recovery Policies for GitLab Backups.
- A zonal recovery is required when all resources are unavailable in one of the three availability zones in
us-east1
or us-central1
.
- A regional recovery is required when all resources become unavailable in one of the regions critical to operation of GitLab.com, either
us-east1
or us-central1
.
Services not included in the current DR strategy for FY24 and FY25
We have limited the scope of DR to services that support primary services (Web, API, Git, Pages, Sidekiq, CI, and Registry).
These services tie directly into our overall availability score (internal link) for GitLab.com.
For example, DR does not include the following:
- AI services including code suggestions
- Error tracking and other observability services like tracing
- CustomersDot, responsible for billing and new subscriptions
- Advanced Search
DR Implementation Targets
The FY24 targets were:
|
Recovery Time Objective (RTO) |
Recovery Point Objective (RPO) |
Zonal |
2 hours |
1 hour |
Regional |
96 hours |
2 hours |
The FY25 targets before cell architecture are:
|
Recovery Time Objective (RTO) |
Recovery Point Objective (RPO) |
Zonal |
0 minutes |
0 minutes |
Regional |
48 hours |
0 minutes |
Note: While the RPO values are targets, they cannot be met exactly due to the limitations of regional bucket replication and replication lag of Gitaly and PostgreSQL.
Current Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Zonal Recovery
We have not yet simulated a full zonal outage on GitLab.com.
The following are RTO/RPO estimates based on what we have been able to test using the disaster recovery runbook.
It is assumed that each service can be restored in parallel.
A parallel restore is the only way we are able to meet the FY24 RTO target of 2 hours for a zonal recovery.
Service |
RTO |
RPO |
PostgreSQL |
1.5 hr |
<=5 min |
Redis |
0 |
0 |
Gitaly |
30 min |
<=1 hr |
CI |
30 min |
not applicable |
Load balancing (HAProxy) |
30 min |
not applicable |
Frontend services (Web, API, Git, Pages, Registry) |
15 min |
0 |
Monitoring (Prometheus, Thanos, Grafana, Alerting) |
0 |
not applicable |
Operations (Deployments, runbooks, operational tooling, Chef) |
30 min |
4 hr |
PackageCloud (distribution of packages for self-managed) |
0 |
0 |
Current Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Regional Recovery
Regional recovery requires a complete rebuild of GitLab.com using backups that are stored in multi-region buckets.
The recovery has not yet been validated end-to-end, so we don’t know how long the RTO is for a regional failure.
Our target RTO for FY25 is to have a procedure to recover from a regional outage in under 48 hours.
The following are considerations for choosing multi-region buckets over dual-region buckets:
- We operate out of a single region so multi-region storage is only used for disaster recovery.
- Although Google recommends dual-region for disaster recovery, dual-region is not an available storage type for disk snapshots.
- To mitigate the bandwidth limitation of multi-region buckets, we spread Gitaly VMs infra across multiple projects.
Proposals for Regional and Zonal Recovery
Improving the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Regional Recovery
The following list the top challenges that limit our ability to drive RTO
to 48 hours for a regional recovery.
- We have a large amount of legacy infrastructure managed using Chef. This configuration has been difficult for us to manage and would require a large a mount of manual copying and duplication to create new infrastructure in an alternate region.
- Operational infrastructure is located in a single region,
us-central1
. For a regional failure in this region, it requires rebuilding the ops infrastructure with only local copies of runbooks and tooling scripts.
- Observability is hosted in a single region.
- The infrastructure (
dev.gitlab.org
) that builds Docker images and packages is located in a single region, and is a single point of failure.
- There is no launch-pad that would allow us to get a head-start on a regional recovery. Our IaC (Infrastructure-as-Code) does not allow us to switch regions for provisioning.
- We don’t have confidence that Google can provide us with the capacity we need in a new region, specifically the large amount of SSD necessary to restore all of our customer Git data.
- We use Global DNS for internal DNS making it difficult to use multiple instances with the same name across multiple regions, we also don’t incorporate regions into DNS names for our internal endpoints (for example dashboards, logs, etc).
- If we deploy replicas in another region to reduce RPO we are not yet sure of the latency or cloud spend impacts.
- We have special/negotiated Quota increases for Compute, Network, and API with the Google Cloud Platform only for a single region, we have to match these quotas in a new region, and keep them in sync.
- We have not standardized a way to divert traffic at the edge from 1 region to another.
- In monitoring, and configuration we have places where we hardcode the region to
us-east1
.
Regional recovery work-streams
The first step of our regional recovery plan creates new infrastructure in the recovery region that involves a large number of manual steps.
To give us a head-start on recovery, we propose a “regional bulkhead” deployment in a new GCP region.
Improving the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for Zonal Recovery
The following represents our current DR challenges and are candidates for problems that we should address in this architecture blueprint.
- Postgres replicas run close to capacity and are scaled manually. New
instances must go through Terraform CI pipelines and Chef configuration.
Over-provisioning to absorb a zone failure would add significant cloud-spend
(see proposal section at the end of the document for details).
- HAProxy (load balancing) is scaled manually and must go through Terraform CI
pipelines and Chef configuration.
- CI runner managers are present in 2 availability zones and scaled close to
capacity. New instances must go through Terraform CI pipelines and Chef
configuration.
- In a zone there are saturation limits, like the number of replicas that need
to be manually adjusted if load is shifted away from a failed availability
zone.
- Gitaly
RPO
is limited by the frequency of disk snapshots, RTO
is limited
by the time it takes to provision and configure through Terraform CI
pipelines and Chef configuration.
- Monitoring infrastructure that collects metrics from Chef managed VMs is
redundant across 2 availability zones and scaled manually. New instances must
go through Terraform CI pipelines and Chef configuration.
- The Chef server which is responsible for all configuration of Chef managed
VMs is a single point of failure located in
us-central1
. It has a local
Postgres database and files on local disk.
- The infrastructure (
dev.gitlab.org
) that builds Docker images and packages
is located in a single region, and is a single point of failure.
Zonal recovery work-streams
Improvements around zonal recovery revolve around improving the time it takes to provision for fleets that do not automatically scale.
There is already work in-progress to completely eliminate statically allocated VMs like HAProxy.
Additionally efforts can be made to shorten launch and configuration times for fleets that are not able to automatically scale like Gitaly, PostgreSQL and Redis.