Cloud spanner backup strategy selection for Topology Service

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
	`daveyleach`	`sxuereb`		devops tenant-scale	2025-05-09

Summary

This ADR documents the decision to implement a comprehensive backup strategy for the Cloud Spanner instance supporting the Topology Service, using multi-region configuration, 36-hour Point-in-Time Recovery (PITR), and daily incremental backups with 90-day retention to achieve robust disaster recovery capabilities while maintaining acceptable performance.

Context

The Topology Service uses Cloud Spanner as its primary database for storing critical infrastructure metadata including:

Cell configurations and metadata
Sequence ranges for ID allocation
Claim records for uniqueness enforcement (usernames, emails, routes)
Classification data for routing requests

As documented in the Topology Service design, this service is critical for Cells infrastructure operation, requiring robust backup and disaster recovery capabilities.

Technical Constraints

Problem Statement

We need to determine the optimal backup strategy for our Topology Service Spanner instance that:

Provides comprehensive protection against various failure scenarios
Maintains compliance with GitLab’s disaster recovery requirements. GitLab.com current state, Gitlab.com targets, Cells targets
Minimizes performance impact on production workloads
Balances backup persistence with operational flexibility
Has restore validation performed on a regular basis

Testing Methodology

Performance Impact Testing

Google Cloud’s official documentation warns that extending Spanner’s point-in-time recovery retention beyond the default one-hour period may impact performance. However, our internal load testing contradicts this concern for our specific use case. When we tested our exact database configuration at production-equivalent scale, extending the retention period from 1 hour to 36 hours produced no detectable impact on either performance metrics or CPU utilization.

Load testing was conducted in Issue #474 to evaluate PITR impact.

Backup Recovery Testing

Testing performed in project spanner_backups validated:

Backup creation and restoration procedures
PITR capabilities and limitations
Recovery time objectives
Data consistency after recovery

Results

Performance Impact Analysis

Testing details in Issue

Failure Scenario Coverage Analysis

Based on testing and analysis documented in Spanner Failure Scenarios and Backup Protection

Failure Scenario	Description	Multi-Region Config	Full/Incremental Backup	Point-in-Time Recovery (PITR)	Best Solution	RPO	RTO
Logical Corruption (<1.5 days)	Application bug corrupts data within PITR window	❌ Unprotected	✅ Protected	✅ Protected	PITR - Restore to exact moment before corruption	< 1 minute	~Hours
Failed Migration	Schema migration fails and corrupts data	❌ Unprotected	✅ Protected	✅ Protected	PITR - Immediate rollback capability	< 1 minute	~Hours
Accidental Table Drop	Production table dropped by mistake	❌ Unprotected	✅ Protected	⚠️ Partially Protected (within PITR window)	Full Backup - Recreate table with original data	24 hours	~Hours
Logical Corruption (>1.5 days)	Data corruption discovered after PITR expires	❌ Unprotected	✅ Protected	❌ Unprotected	Full Backup - Longer retention period	24 hours	~Hours
Accidental Database Deletion	Admin accidentally deletes entire database	❌ Unprotected	✅ Protected	❌ Unprotected	Full Backup - Deletion protection in place	24 hours	~Hours
Regional Failure	Entire region becomes unavailable	✅ Protected	❌ Unprotected	❌ Unprotected	Multi-Region Config - Automatic failover	< 1 minute	< 1 minute
Multi-Region Disaster	Natural disaster affects multiple regions	✅ Protected	❌ Unprotected	❌ Unprotected	Multi-Region Config - Geographic redundancy	< 1 minute	~Hours if both writer regions fail otherwise < 1 minute

Storage Cost Analysis

Cloud Spanner Pricing Reference

Source: https://cloud.google.com/spanner/pricing

Region	Type	Count	Rate per 100GB/hour
us-east4	Read-write	2x	$0.03014
us-east1	Read-write	2x	$0.02740
us-west2	Read-only	1x	$0.01644
europe-west1	Read-only	1x	$0.01370
asia-southeast1	Read-only	1x	$0.01567

Backup storage: $0.30/GB/month (flat rate for multi region backups)

Usage Cost Calculation

Conservative estimates assume 75.96 GB database size (2x current projections) to account for incomplete table and index designs.

Component	Storage Size	Monthly Cost	Annual Cost	Notes
Multi-region base storage	75.96 GB	$89.20	$1,070.40	4 read-write + 3 read-only replicas
36-hour PITR overhead	~7.6 GB (+10%)	$8.92	$107.04	MVCC versions for 6.48M changes
Backup Strategy Options:
Option A: Daily full schedule (90-day)	6,836.4 GB	$2,050.92	$24,611.04	90× base storage
Option B: Incremental schedule (90-day)	1,230.6 GB	$369.18	$4,430.16	8 fulls + 82 incrementals @ 10%
Total with Option A	-	$2,149.04	$25,788.48	Full backup strategy
Total with Option B	-	$467.30	$5,607.60	Incremental strategy (78% savings)

Key Assumptions:

Database doubled from 37.98 GB projection to account for design uncertainty
10% annual growth rate based on historical trends of Postgres growth
Incremental sizing: 1 full backup every 14 days (8 total) + daily incrementals at 10% of full size
All costs include multi-region replication across 7 total replicas
Backup storage charged at $0.30/GB/month

Key Findings

PITR Performance Impact: Testing showed no significant performance degradation with 36-hour PITR compared to the 1-hour baseline based on an approximation of our database size
Backup Storage: According to Cloud Spanner documentation, backups are automatically replicated across all configured regions
Recovery Time: Database restoration takes approximately 20-30 minutes, plus ~1 hour for redeployment, totaling ~2 hours RTO (conservative estimate). More thorough testing planned in issue #483
Version Retention: PITR maintains multiple versions using Multi-Version Concurrency Control (MVCC) with minimal overhead at our scale

Decision

We will implement the following backup strategy for the Cloud Spanner Topology Service:

Configure production as multi-region as per the existing ADR to handle regional failures automatically and multi-regional failures in combination with backups
Enable Point-in-Time Recovery for 36 hours to provide immediate recovery capabilities for recent issues and protection from most identified failure scenarios that aren’t covered by being multi region. Data corruption is likely to cause issues that will be discovered immediately and the 36 hour window provides buffer for discovery and response window
Implement daily incremental backup schedule with 90-day retention to leverage Spanner’s automatic full backup management (creates full backups as needed, then up to 13 incrementals per chain) while matching our current backup policies and protecting against long-term unnoticed data corruption

Rationale

Comprehensive Coverage: This strategy provides protection against all identified failure scenarios
Performance Validated: Load testing confirms 36-hour PITR has no significant performance impact at our scale. Longer PITR retention period provides minimal benefit with increased risk around performance.
Alignment with Standards: Meets or exceeds GitLab’s existing PostgreSQL backup policies
Cost-Effective: Balances storage costs with recovery capabilities. Longer PITR retention period provides minimal benefit with increased storage cost.
Operational Flexibility: Daily backups provide recovery options beyond PITR window

Implementation Details

PITR Configuration:

Retention Period: 36 hours
Provides overlap with 24-hour incremental backup windows
Enables precise recovery to any point within the window

Backup Schedule:

Incremental Backups: Daily at 02:00 UTC (Full Backups Automatically created by Spanner as needed)
Retention: 90 days
Location: Multi-region replication (automatic)

Access Control:

Restricted to broken glass escalation procedures
IAM-based authentication (no username/password)
Audit logging for all backup operations

Consequences

Positive Consequences

Robust Disaster Recovery: Comprehensive protection against hardware failures, data corruption, and human errors
Validated Performance: Load testing proved 36-hour PITR adds negligible latency (<0.25ms) at our scale
Geographic Redundancy: Multi-region configuration provides continental-level disaster protection with automatic failover
Flexible Recovery Options: 36-hour PITR for precise recent recovery, 90-day backups for older incidents
Compliance Ready: Meets enterprise DR requirements with <1 minute RTO/RPO for regional failures, ~2 hours RTO for backup restoration scenarios (including redeployment time)
Operational Efficiency: Fully managed service eliminates backup maintenance overhead (saves ~0.5 FTE)

Negative Consequences

Regional Change Limitation: Cannot change instance regional configuration while backups exist
Performance Implications: Possible long term performance implications of having longer PITR than default
Recovery Point Objective: For data corruption discovered after 36 hours, recovery point is limited to backup intervals (up to 24 hours of data loss)
Storage Costs: Maintaining 90 days of daily backups incurs additional storage expenses
Recovery Time for Backup Restores: Database restoration from backups takes 20-30 minutes, plus ~1 hour for redeployment (~2 hours total conservative estimate), with no point-in-time recovery capabilities on restored backups

Mitigations

Regional Changes: Implement controlled process for regional migrations using change management process
Recovery Testing: Regular restore drills to optimize procedures and reduce recovery time
Monitoring and Alerts: Implement monitoring to ensure backup changes or drop protection removal don’t go undetected

References

Last modified September 22, 2025: Update disaster recovery details for topology service (bbacfd09)

View page source - Edit this page - please contribute.