AWS Primary Region Selection for Cells Infrastructure

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
	`tkhandelwal3`	`sxuereb`		devops tenant-scale	2025-05-08

Summary

This ADR documents the decision to use us-east-1 (Northern Virginia) as the primary AWS region for Cells infrastructure based on comprehensive latency testing against the Topology Service hosted in GCP.

Context

As part of the Cells architecture evolution, we are using AWS to provision our first Cells while maintaining critical infrastructure components like the Topology Service in GCP. This cross-cloud architecture requires careful consideration of network latency to ensure optimal performance.

The cross-cloud communication between AWS Cells and GCP infrastructure impacts:

Organization classification and routing.
Data migration when moving organizations from Legacy Cell to Cells.
Cell discovery and health checks.
Configuration management.

Given this architecture, the latency between AWS-hosted Cells and the GCP-hosted Topology Service directly impacts:

The latency to Claim a resource.
Data transfer speed when migrating organizations from Legacy Cell to Cell.
SSH Routing for Git traffic.

Problem Statement

We need to determine the optimal AWS region for hosting Cells that minimizes latency to the Topology Service while considering:

Network performance and latency characteristics
Future scalability requirements
Geographic distribution of traffic
Disaster recovery capabilities
Availbility of features

Testing Methodology

Test Infrastructure

Load Testing Tool: k6 with custom script for generating consistent load.
Test Duration: 5 runs of 360 seconds each per region
Virtual Users: 100 concurrent connections
Target Endpoint: https://topology-rest.gitlab.net/v1/classify (Production Topology Service)
Instance Type: m5a.xlarge (4 vCPUs, 16GB RAM, up to 10Gbps network)

Regions Tested

us-east-1: Northern Virginia (geographically closest to GCP us-east1)
us-east-2: Ohio

Note: us-central-1 does not exist in AWS; only us-east and us-west regions are available.

Test Configuration

// k6 test configuration
export const options = {
  vus: 100,           // 100 virtual users
  duration: '360s',   // 6 minutes per run
};

Testing was performed and documented in GitLab Issue #475 Comment.

Results

Latency Measurements

The following results represent the average across 5 test runs for each region:

Region	Median	P95	P99.9	Requests/sec
us-east-1	23.608ms	32.31ms	80.88ms	3,589/s
us-east-2	43.768ms	51.234ms	106.462ms	2,031/s

[!note] See the complete test data at: https://gitlab.com/gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team/-/issues/475#note_2705749726

Performance Analysis

P95 Latency: us-east-1 demonstrates 40% better performance than us-east-2
- us-east-1: 32.31ms
- us-east-2: 51.234ms
- Improvement: ~37% reduction
P99.9 Latency: us-east-1 shows ~30% better performance for tail latencies
- us-east-1: 80.88ms
- us-east-2: 106.462ms
- Improvement: ~24% reduction
Throughput: us-east-1 achieved 76% higher request throughput
- us-east-1: ~3,589 requests/second
- us-east-2: ~2,031 requests/second

Decision

We will use us-east-1 (Northern Virginia) as the primary region for AWS Cells deployment.

Rationale

Superior Performance: 40% better P95 latency ensures better user experience for the majority of requests
Higher Throughput: 76% higher request processing capability provides better scalability
Geographic Proximity: Closest AWS region to GCP’s us-east1 where Topology Service is hosted
Network Path Optimization: Shorter network paths between Northern Virginia and GCP’s eastern regions
Consistency: More stable performance characteristics across test runs

Consequences

Positive Consequences

Improved Performance: Lower latency translates to faster response times for Cell operations.
Better Scalability: Higher throughput capacity allows for more efficient resource utilization.
Optimal Data Transfer: Reduced latency benefits organization migrations from Legacy Cell to Cells.
Feature Availability: Access to the latest AWS features as they typically launch in us-east-1 first.

Negative Consequences

Regional Concentration Risk: us-east-1 historically experiences the most AWS incidents due to its high concentration of infrastructure.
Reliability Concerns: Higher incident rate compared to other AWS regions may impact overall system reliability.
Quota Request Delays: As the busiest AWS region, us-east-1 may have longer processing times for quota increase requests needed for Cell provisioning, potentially impacting deployment timelines

Mitigations

Multi-AZ Deployment: Deploy across multiple availability zones within us-east-1 to mitigate AZ-specific failures
Secondary Region Readiness: Maintain us-east-2 as a backup_region with Geo.
Proactive Quota Management: Submit quota increase requests well in advance of Cell provisioning needs and maintain buffer capacity to account for potential delays

Last modified August 26, 2025: feat: add ADR for primary region selection for AWS cells (60732add)

View page source - Edit this page - please contribute.