Cloud Spanner Region Configuration for Topology Service

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
	`a_richter`			devops tenant-scale	2025-05-08

The Topology Service in the Cells architecture requires a highly available database with strong consistency guarantees. Cloud Spanner offers both regional and multi-regional configurations. For our implementation, we need to determine the optimal configuration that balances performance, disaster recovery capabilities, and cost.

Key considerations:

Client side latency example HTTP Router -> Topology Service.
Server side latency inter spanner communication such as the Paxos replication.
Disaster recovery needs and resilience against regional outages.
Cost implications for different configurations.
SLAs provided by Google.

Multi-regional configurations under consideration:

nam3: Read-write replicas in east of US, and optional replicas in Europe and Asia
nam11: Read-write replicas across multiple North American regions
nam-eur-asia3: Read-write across North America, and read only across Europe and Asia

Client Side Latency Testing

We tested the latency between various GCP regions and the current Topology Service hosted in us-east1/us-central1 to mimic our production environment where HTTP Router can send requests to it from all over the globe. Testing was performed using GCP VMs running k6 load testing tool, which provides an indication of cross-region latency patterns.

Test VMs were created in the following regions:

asia-east1-a (n1-standard-4)
europe-west1-b (n1-standard-4)
us-east1 (n1-standard-4)
us-central1 (n1-standard-4)

The k6 test script made POST requests to the Topology Service’s /v1/classify endpoint with a single virtual user over 30 seconds.

K6s Latency Testing Methodology

See this Comment for further details

import http from 'k6/http';
import { check } from 'k6';

export const options = {
  vus: 1, // Number of virtual users
  duration: '30s', // Test duration
};

export default function() {
  const url = 'https://topology-rest.runway.gitlab.net/v1/classify';
  const payload = JSON.stringify({
    type: "CELL_ID",
    value: "2"
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
    },
  };

  const response = http.post(url, payload, params);

  // Check if the request was successful
  check(response, {
    'status is 200': (r) => r.status === 200,
  });
}

Results

Baseline Configuration (us-east1/us-central1 deployment):

Run	us-east1	europe-west1-b	asia-east1-c
1	11.57ms	105.69ms	182.41ms
2	11.27ms	106.12ms	188.10ms
3	12.04ms	105.12ms	182.79ms
Avg	11.63ms	105.64ms	184.43ms

Multi-Regional Configuration (us-east1/us-central1/europe-west1/asia-east1 deployment):

Run	us-east1	europe-west1	asia-east1	us-central1
1	14.35ms	17.59ms	11.48ms	31.30ms
2	14.82ms	18.12ms	11.23ms	32.87ms
3	13.92ms	17.21ms	11.79ms	30.76ms
Avg	14.36ms	17.64ms	11.50ms	31.64ms

Analysis

The latency testing reveals significant improvements for international regions when using multi-regional deployment:

Europe latency improvement: 83% reduction (105.64ms → 17.64ms)
Asia latency improvement: 94% reduction (184.43ms → 11.50ms)
US East Coast: Slight increase (11.63ms → 14.36ms) due to distributed routing

These results demonstrate that while our proposed nam11 configuration serves North American users well, there would be substantial performance benefits for future EU or Asia-based deployments if we migrate to nam-eur-asia3.

Server Side Latency

Spanner uses Paxos for replication. In multi-regional a successful write means that it’s done on both regions. Both regions need to have low latency between them to speed up the writes. The Performance Dashboard below shows that the least amount of latency is between us-east1 and us-east4 which are the regions for nam3.

Latency between zone pairs

source

Decision

We will base our instance configuration off nam3 with the following modification:

us-east1 as primary instead of us-east4: Our primary region is us-east1
Additional Read-only replicas: When we will add Legacy Cell to Cells Cluster we will add read-only replicas to optimize client side latency.

Read-Write Regions	Read-Only Regions	Witness Region	Optional Read Only Region
`us-east1` (primary), `us-east4`	`us-west2`, `asia-southeast1`, `europe-west1`	us-central1	`asia-southeast2`, `europe-west2`

We will rely on Google’s default encryption for data at rest, which is automatically enabled, rather than implementing Customer-Managed Encryption Keys (CMEK).

Consequences

When we change the topology of Spanner this will trigger an instance move. Using terraform can result into data loss since it will re-create the instance, but if we add read-replicas right before we add Legacy Cell to Cells Cluster there would be no data loss.

Benefits

Reduced Costs up until we need it: We can start small and expand while not changing the read-write topology.
Least server side latency for writes.
Ability to expand to other contents like Europe and Asia for client side latency.
Less maintenance overhead with encryption keys compared to CMEK

Alternatives Considered

Alternative 1: `nam-eur-asia3`

Context: Read-write in us-east1 and us-central1 with out-of-the-box read-only regions.

Decision: Not ideal for server side latency, and high cost from the start with little regional expandability.

Alternative 2: `nam11`

Context: Read-write in us-east1 and us-central1 with a single optional read-only regions.

Decision: Not ideal for server side latency with little regional expandability.

Cost Analysis

Configuration	Compute	Replication	Total Cost
nam3	$12,138.60	$888.18	$13,026.78
nam11	$11,702.50	$888.18	$12,590.68
nam-eur-asia3	$20,751.58	$1937.12	$22,688.70

This estimate is based on:

5 nodes of compute capacity
1 TB of SSD storage
10 GB of data written/modified per hour
Enterprise Plus Edition

Last modified July 17, 2025: ADR: Spanner Configuration for Topology Service (679cec32)

View page source - Edit this page - please contribute.