Topology Service Transactional Behavior
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
accepted |
ayufan
|
2025-07-02 |
This document outlines the design goals and architecture of Topology Service implementing transactional behavior for Claims Service.
This document does simplify some concepts (intentionally or unintentionally), so it is not reflective of the actual implementation. The API presented should be considered as an example to present the concepts, not the final state.
Note: The protobuf and database structures used in this document are only for presenting the behavior and does not represent the final data structures used.
Essential Concepts
Distributed Lease-Based Coordination
This system implements a distributed lease-based coordination mechanism for managing globally unique claims (usernames, emails, routes) across multiple GitLab cells. It ensures only one cell can “own” a particular claim at any given time, preventing conflicts in a distributed environment.
Core Behavioral Principles
1. Lease-First Coordination
The system follows a “lease-first, commit-later” pattern:
- Before making any local changes, acquire a lease from the central Topology Service
- Only after successful lease acquisition, proceed with local database operations
- After local success, commit the lease to make changes permanent
- If anything fails, rollback the lease to maintain consistency
2. Atomic Batch Operations
Multiple related models MUST be processed together:
- Collect all claim changes from multiple models
- Send a single batch request to Topology Service
- Process all local database changes in one transaction
- Commit or rollback all claims together
Critical Constraint: The same claims can only be marked for one operation at the same time, meaning it cannot be marked for both creation and destruction at the same time.
3. Time-Bounded Leases
Leases are time-bounded through the reconciliation process:
- Leases only have creation timestamps, no explicit expiration
- Reconciliation process determines staleness based on age (default 10 minutes threshold)
- Prevents indefinite locks as leases are rolled back after some threshold
- Background reconciliation ensures consistency
4. Lease Exclusivity
Critical Rule: Objects with active leases (lease_id != NULL AND lease_op != NULL
) cannot be claimed by other operations:
- Create Operations: Will fail with primary key constraint if object already exists
- Destroy Operations: Will fail with conditional update if object has active lease
- Temporal Lock: Objects remain locked until committed/rolled back
- Automatic Release: Stale leases are rolled back through reconciliation, making objects available again
5. Ownership Security
Critical Security Constraint: Only the cell that created the claim can destroy it:
- Cell ID Verification: Cells can only operate on their own leases
- Prevents Interference: Cells cannot destroy claims created by other cells
- Security Isolation: Malicious or buggy cells cannot disrupt other cells’ data
System Participants
Rails Concern: Claim Attributes
- Primary Role: Handles application actions (create, update or destroy) that modify claimable attributes
- Responsibilities:
- Validates models locally before claiming
- Collects all claims from multiple models into batch requests
- Coordinates with Topology Service for lease acquisition
- Manages local database transactions
- Handles immediate commit/rollback after local operations
- Retries reasonable amount of times network failures
Sidekiq Worker or Cron Job: Lost Transaction Recovery
- Frequency: Every minute
- Operation: Reconciles leases that exist in Topology Service and in Rails (or vice versa)
- Strategy: Rails-driven cleanup with idempotent Topology Service operations
- Process:
- List outstanding leases from Topology Service with cursor-based pagination
- Separate stale and active leases based on creation time and staleness threshold
- Commit active leases that exist locally, rollback stale leases
- Clean up orphaned local lease records
Sidekiq Cron Job: Data Verification Process
- Frequency: A few times a day. The cron job execution should be best randomizes between cells to avoid creating spiky workload. This might mean limiting concurrency globally via Topology Service.
- Operation: Validates the consistency of claims between Rails and Topology Service
- Strategy: Rails-driven validation process
- Process:
- Iterate information stored on Topology Service
- Compare with information stored in Rails DB
- Process all claims defined in Rails codebase
- Correct information stored in Topology Service (Create, Update or Destroy)
Topology Service
Primary Role: Centralized coordination service managing global claim state
Responsibilities:
- Enforces lease exclusivity through database constraints
- Manages lease lifecycle (create, commit, rollback)
- Provides atomic batch operations for multiple claims
- Ensures cells can only operate on their own claims
Happy Path Workflow
sequenceDiagram participant User participant Rails as Rails App participant RailsDB as Rails DB participant TopologyService as Topology Service participant CloudSpanner as Cloud Spanner User->>Rails: Save Multiple Models (User + Email + Route) Note over Rails: Step 1: Pre-flight Claims Acquisition Rails->>Rails: Validate all models Rails->>Rails: Collect claims from all models Rails->>Rails: Generate batch BeingUpdateRequest Note over Rails: BEFORE Rails DB Transaction Rails->>TopologyService: BeginUpdate(batched creates, destroys) TopologyService->>CloudSpanner: BEGIN TopologyService->>CloudSpanner: INSERT INTO leases_outstanding SET lease_id=@id TopologyService->>CloudSpanner: INSERT INTO claims SET claim_attributes, lease_id=@id,<br/>lease_op='create', cell_id=@id TopologyService->>CloudSpanner: UPDATE claims SET lease_id=@id, lease_op='destroy'<br/>WHERE lease_id IS NULL AND cell_id=@id (destroys) Note over CloudSpanner: Primary key constraints + conditional updates enforce exclusivity CloudSpanner-->>TopologyService: Success - All constraints satisfied TopologyService->>CloudSpanner: COMMIT Transaction TopologyService-->>Rails: BeginUpdateResponse(lease_payload) Note over Rails: Step 2: Local Database Transaction Rails->>RailsDB: BEGIN Transaction Rails->>RailsDB: Save user changes Rails->>RailsDB: Save email changes Rails->>RailsDB: Save route changes Rails->>RailsDB: Insert lease in leases_outstanding Rails->>RailsDB: COMMIT Transaction Note over Rails: Step 3: Lease Commitment Rails->>TopologyService: CommitUpdate(lease_id) TopologyService->>CloudSpanner: BEGIN Transaction TopologyService->>CloudSpanner: DELETE claims WHERE lease_op='destroy' AND lease_id=@id, cell_id=@id TopologyService->>CloudSpanner: UPDATE claims SET lease_id=NULL, lease_op='no-ip'<br/>WHERE lease_op='create' AND lease_id=@id AND cell_id=@id TopologyService->>CloudSpanner: DELETE FROM leases_outstanding WHERE lease_id=@id TopologyService->>CloudSpanner: COMMIT Transaction TopologyService-->>Rails: CommitUpdateResponse() Rails->>RailsDB: DELETE lease from leases_outstanding Rails-->>User: Success - All models saved atomically
Detailed Steps
- Pre-flight Validation: Rails validates all models locally and generates batch claim requests
- Lease Acquisition: Single atomic transaction in Topology Service acquires leases for all claims
- Local Database Transaction: Rails saves all changes and creates lease tracking record
- Lease Commitment: Topology Service finalizes all claims and removes lease
- Cleanup: Rails removes local lease tracking record
Detailed Process Flow
Phase 1: Pre-Flight Claims Acquisition
User saves models → Rails validates → Claims generated → Topology Service called
What happens:
- Model Validation: Rails validates the model locally first
- Claim Generation: For each changed unique attribute, generate create/destroy claims
- Batch Collection: If multiple models are being saved, collect all claims together
- Topology Service Call: Send
BeginUpdate()
request with all claims BEFORE local DB transaction
Why this order:
- Fail Fast: If claims conflict, fail before making any local changes
- Atomicity: Either all claims succeed or all fail
- Efficiency: Single network call for multiple models. Important due to long write latency on doing multi-region writes.
Phase 2: Atomic Lease Creation in Topology Service
BeginUpdate() → Single transaction → Database constraints + lease exclusivity enforced
What happens in Topology Service during BeginUpdate()
:
- Lease Record: Insert into
leases_outstanding
with full payload - Create Claims: Insert new claims with
lease_op='create'
and the lease_id- Primary key constraint on (claim_type, claim_value) prevents duplicates
- Constraint: Creates must reference claims that don’t exist in the system
- Mark Destroys: Update existing claims ONLY if it’s not leased (
lease_id IS NULL
) ANDcell_id
matches the requesting cell- Conditional update ensures no concurrent operations on same object AND only creator can destroy
- Constraint: Destroys must reference claims that are committed and not leased, and are owned by the requesting cell
- Batch Validation: The same claim (
claim_type
,claim_value
) can only have one operation (create or destroy) to avoid irreconcilable state transitions - Lease Exclusivity: Only objects not being leased (
lease_id IS NULL
) are open for new operations - Atomic Success/Failure: If any operation fails, entire transaction automatically rolls back
Ownership Rule:
- Only unlocked objects (where
lease_id IS NULL
) can be claimed - Owned by Cell only claims that belong to the cell can be modified
- This prevents concurrent modifications and ensures operation isolation
Why this approach:
- Minimize Global Replication Latency: The create portion of changes become routable in a system, allowing to hide cross-regional Cloud Spanner replication latency. This is aligned with expectation how the failure rate. We expect 99.9% of operations to succeed. As such creating records that might be rolled back is rather exception.
- Exclusive Access: Only one operation can work on an object at a time
- Prevents Race Conditions: Cannot claim objects already being modified
- Temporal Isolation: Leases provide time-bounded exclusive access
- Database-Level Enforcement: Constraints are faster and more reliable than application checks
- No Explicit Conflict Checking: Database constraints handle uniqueness automatically
Phase 3: Local Database Transaction
Lease acquired → Rails DB transaction → Save all models → Create lease record
What happens in Rails:
- Transaction Start: Begin Rails database transaction
- Model Saves: Save all the model changes that generated the claims
- Lease Tracking: Create
leases_outstanding
record withlease_id
and creation date - Transaction Commit: Commit all changes together
Why after lease acquisition:
- Safety: Local changes only happen after global coordination succeeds
- Tracking: Inserting
lease_id
intoleases_outstanding
ensures that local transaction was properly committed - Immediate Cleanup: Lease record in Rails DB enables prompt cleanup after transaction completion
- Rollback Capability: If local DB fails, we have
lease_id
to clean up
Phase 4: Lease Commitment
Local success → CommitUpdate() → Finalize claims → Remove lease
What happens in Rails before CommitUpdate()
:
- Immediate Commit: The
CommitUpdate()/RollbackUpdate()
is triggered by Railsafter_commit
/after_rollback
hooks - No-transaction Check: Rails checks that there’s no local DB transaction open
- Rails Check: Before doing
CommitUpdate()
Rails checkleases_outstanding
to ensure that transaction was committed successfully
What happens in Topology Service during CommitUpdate()
:
- Destroy Processing:
DELETE claims WHERE lease_op='destroy' AND lease_id=@id
- Create Finalization:
UPDATE claims SET lease_id=NULL, lease_op='no-op' WHERE lease_op='create' AND lease_id=@id
- Lease Cleanup:
DELETE FROM leases_outstanding WHERE lease_id=@id
- Rails Cleanup: Rails deletes its leases_outstanding record
Why this two-phase approach:
- Durability: Creates become permanent, destroys are executed
- Immediate Cleanup: Leases are removed immediately after successful completion
- Idempotency: CommitUpdate can be retried safely - operations are idempotent
Terminology by Participant
Rails Client
Local Schema
-- Rails migration for leases_outstanding table (synchronized with Cloud Spanner)
-- This table only contains active leases - entries are deleted when consumed
CREATE TABLE leases_outstanding (
lease_id STRING(36) NOT NULL UNIQUE,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
-- Indexes for performance and operational queries
CREATE INDEX idx_leases_outstanding_created ON leases_outstanding(created_at);
Model Definition
# Rails ActiveRecord concern for handling distributed claims
module CellsUniqueness
extend ActiveSupport::Concern
class_methods do
def cell_cluster_unique_attributes(*attributes, claim_type:, owner:, owner_type:)
@cell_unique_config = {
attributes: attributes,
claim_type: claim_type,
owner: owner,
owner_type: owner_type
}
end
end
end
# Example model implementation
class Route < ApplicationRecord
include CellsUniqueness
self.cell_cluster_table_name = Gitlab::Cells::ClaimRecord::TableName::ROUTES
self.cell_cluster_table_record_id = :id
cell_cluster_unique_attributes :path,
claim_type: Gitlab::Cells::ClaimRecord::ClaimType::ROUTES,
owner: -> { project || group },
owner_type: -> { project ? Gitlab::Cells::ClaimRecord::Owner::PROJECT : group ? Gitlab::Cells::ClaimRecord::Owner::GROUP : Gitlab::Cells::ClaimRecord::Owner::UNSPECIFIED }
end
Operations
- Claim Generation: Extract unique attributes from model changes into create/destroy claims
- Batch Collection: Aggregate claims from multiple models into single BeingUpdateRequest
- Batch Validation: Ensure creates and destroys within a batch reference different claims to avoid modeling conflicts
- Lease Tracking: Store lease_id locally for reconciliation and cleanup
- Immediate Cleanup: Remove lease records after successful commit/rollback
- Error Handling: Convert gRPC errors to Rails validation errors
Background Reconciliation
Rails-Driven Cleanup Process
# Rails-driven cleanup with idempotent Topology Service operations
class ClaimsLeaseReconciliationService
LEASE_STALENESS_THRESHOLD = 10.minutes # Consider lease stale if older than threshold
def self.reconcile_outstanding_leases
topology_service = Gitlab::Cells::TopologyServiceClient.new
cursor = nil
loop do
# Get leases from Topology Service with cursor-based pagination
response = topology_service.list_outstanding_leases(
ListOutstandingLeasesRequest.new(
cell_id: current_cell_id,
cursor: cursor
)
)
break if response.leases.empty?
# Process active leases: commit if they exist locally
local_active_leases = LeasesOutstanding.where(lease_id: response.leases.pluck(:lease_id)).pluck(:lease_id)
local_active_leases.each do |lease_id|
topology_service.commit_update(CommitUpdateRequest.new(cell_id: current_cell_id, lease_id: lease_id))
LeasesOutstanding.find_by(lease_id: lease_id)&.destroy!
end
# Find stale leases that are missing locally
now = Time.current
local_active_leases = local_active_leases.to_set
stale_leases = response.leases
.reject { |lease| local_active_leases.include?(lease.lease_id) }
.select { |lease| lease.created_at.to_time < (now - LEASE_STALENESS_THRESHOLD) }
# Process stale leases: rollback all (idempotent)
stale_leases.each do |lease|
topology_service.rollback_update(RollbackUpdateRequest.new(cell_id: current_cell_id, lease_id: lease.lease_id))
# Clean up any local record that might exist
LeasesOutstanding.find_by(lease_id: lease.lease_id)&.destroy!
end
# Update cursor for next iteration
cursor = response.next_cursor
break if cursor.blank? # No more pages
end
# Exception case: leases missing from TS but present locally
# This might happen during "Scenario 5A: Failed to Delete Local Lease Record"
stale_local_leases = LeasesOutstanding.where('created_at < ?', Time.current - LEASE_STALENESS_THRESHOLD)
if stale_local_leases.exists?
Rails.logger.error "Found #{stale_local_leases.count} stale local leases without TS counterparts"
stale_local_leases.destroy_all
end
end
end
Reconciliation Principles
- Primary Cleanup: Rails is responsible for cleaning up outstanding leases
- Cursor-Based Pagination: Guarantees iteration through all leases without infinite loops
- Idempotent Operations: Topology Service CommitUpdate/RollbackUpdate operations are idempotent
- Staleness-Based Cleanup: Leases older than threshold are considered stale (local property)
- Complete Processing: Both stale and active leases are processed to ensure forward progress
- Exception Handling: Local leases without corresponding TS leases indicate system issues
- Immediate Cleanup: Leases are removed immediately
Topology Service Data Verification
Overview
The data verification process ensures consistency between the Topology Service’s global claim state and each cell’s local database records. This is a cell-initiated process that runs periodically to detect and reconcile discrepancies.
flowchart TD A[Scheduled Verification] --> B[For Each Model Class] B --> C[List Claims from TS by Table] C --> D[Process Claims in ID Ranges] D --> E[Compare with Local Records] E --> F[Identify Discrepancies] F --> G[Filter Recent Records] G --> H[Execute Corrections] H --> I[Next Range] I --> J{More Ranges?} J -->|Yes| C J -->|No| K[Next Model] K --> B K -->|Done| L[End] style A fill:#e1f5fe style L fill:#e8f5e8 style G fill:#fff3e0 style H fill:#fff3e0
Core Verification Process
The system uses cursor-based pagination to fetch claims from TS in ID ranges, then compares them with local records using efficient hash-based matching. Three types of discrepancies are detected: missing claims in TS, different claim values, and extra claims in TS. Recent records (within 1 hour) are skipped to avoid correcting transient changes.
Optionally, the gRPC streaming can be used to fetch records live, but allowing the streaming replication to resume from the specific cursor.
Verification Service Implementation
class ClaimsVerificationService
CURSOR_RANGE_SIZE = 1000 # Records per TS request
LOCAL_BATCH_SIZE = 500 # Local records per batch
RECENT_RECORD_THRESHOLD = 1.hour # Skip recent records
def self.verify_all_claim_models(models_with_claims)
models_with_claims.each do |model_class|
verify_model_claims(model_class)
end
end
def self.verify_model_claims(model_class)
topology_service = Gitlab::Cells::TopologyServiceClient.new
cursor = 0
loop do
# Fetch claims from TS in cursor-based ranges
response = topology_service.list_claims(
ListClaimsRequest.new(
cell_id: current_cell_id,
table_name: model_class.cell_cluster_table_name,
cursor: cursor,
limit: CURSOR_RANGE_SIZE
)
)
break if response.claims.empty?
# Process the returned ID range
process_claims_range(model_class, response.claims, response.start_range, response.end_range, config)
cursor = response.next_cursor
break if cursor.nil?
end
end
def self.process_claims_range(model_class, ts_claims, start_range, end_range, config)
# Create hash map for O(1) claim lookups
mapped_ts_claims = ts_claims.to_h(&:method(:claim_key))
recent_threshold = Time.current - RECENT_RECORD_THRESHOLD
# Query local records matching the exact TS range
model_class.where(id: start_range...end_range).find_in_batches(batch_size: LOCAL_BATCH_SIZE) do |local_batch|
missing_ts_claims = []
different_local_claims = []
different_ts_claims = []
local_batch.each do |record|
# Skip recent records to avoid transient conflicts
next if record_is_recent?(record, recent_threshold)
model_class.cell_claim_attributes.each do |claim_attributes|
# Generate expected claim from local record
claim_record = record.generate_claim_record(claim_attributes)
# Look up and remove from hash (tracks processed claims)
ts_claim_record = mapped_ts_claims.delete(claim_key(claim_record))
if ! ts_claim_record
# Missing in TS: local exists, no TS claim
missing_ts_claims << claim_record
elsif ! claim_records_match?(claim_record, ts_claim_record)
# Different: both exist but data differs
next if ts_claim_is_recent?(ts_claim_record, recent_threshold)
different_local_claims << claim_record
different_ts_claims << ts_claim_record
end
# If match exactly, no action needed
end
end
# Execute corrections in batches
if missing_ts_claims.present?
# TODO: This can fail if it's already claimed by the other cell. Need manual intervene.
begin_update_and_commit(creates: missing_ts_claims)
end
if different_ts_claims.present?
# TODO: This can fail if it's already claimed by the other cell. Need manual intervene.
# Non-atomic update: destroy then create
begin_update_and_commit(destroys: different_ts_claims)
begin_update_and_commit(creates: different_local_claims)
end
end
# Remaining hash entries are extra TS claims
extra_ts_claims = mapped_ts_claims.values.reject { |ts_claim| ts_claim_is_recent?(ts_claim, recent_threshold) }
if extra_ts_claims.present?
begin_update_and_commit(destroys: extra_ts_claims)
end
end
def self.record_is_recent?(record, recent_threshold)
record.created_at > recent_threshold
end
def self.ts_claim_is_recent?(ts_claim, recent_threshold)
ts_claim.created_at.to_time > recent_threshold
end
def self.claim_records_match?(expected_claim, ts_claim)
# TODO: Compare claim type, value, and owner
end
def self.claim_key(ts_claim)
# Unique identifier for hash lookups
"#{ts_claim.claim_type}/#{ts_claim.claim_value}"
end
def self.begin_update_and_commit(creates: [], destroys: [])
topology_service = Gitlab::Cells::TopologyServiceClient.new
# Use standard BeginUpdate/CommitUpdate pattern
response = topology_service.begin_update(
BeginUpdateRequest.new(
cell_id: current_cell_id,
creates: creates,
destroys: destroys
)
)
# Immediate commit (no local DB changes for verification)
topology_service.commit_update(
CommitUpdateRequest.new(
cell_id: current_cell_id,
lease_id: response.lease_payload.lease_id
)
)
Rails.logger.info "Verified: #{creates.size} creates, #{destroys.size} destroys"
end
private
def self.current_cell_id
Gitlab::Cells.current_cell_id
end
end
Key Behaviors
- Range-Based Processing: TS returns claims in ID ranges (start_range to end_range), local queries match exact ranges
- Hash-Based Matching: O(1) lookups using claim_key, processed claims removed via delete()
- Three-Way Comparison: Missing (local only), Different (both exist, data differs), Extra (TS only)
- Recent Record Protection: Skip records created within threshold to avoid transient conflicts
- Batch Corrections: BeginUpdate/CommitUpdate pattern used for all corrections, separate batches per discrepancy type
- Cursor Pagination: Automatic advancement through all records, no gaps or overlaps
Recent Record Protection
The verification system protects against correcting transient changes:
- Configurable Threshold:
RECENT_RECORD_THRESHOLD = 1.hour
(adjustable) - Local Record Protection: Skips entire record if
created_at
is within threshold - Claim Comparison Protection: Skips correction if either local or TS claim is recent
- Extra Claim Protection: Filters out recent TS claims from deletion
Benefits:
- Transient Change Handling: Prevents correction of records still propagating
- Race Condition Prevention: Avoids interfering with ongoing operations
- Operational Safety: Reduces risk of correcting legitimate new data
Topology Service
Cloud Spanner Schema
-- Claims table with integrated lease tracking
CREATE TABLE claims (
claim_id STRING(36) NOT NULL, -- UUID for internal tracking
claim_type INT64 NOT NULL, -- Type of claim (maps to protobuf ClaimType enum)
claim_value STRING(255) NOT NULL, -- The actual value being claimed
owner_type INT64 NOT NULL, -- Type of owner (maps to protobuf Owner enum)
owner_value STRING(255) NOT NULL, -- Owner identifier
cell_id STRING(100) NOT NULL, -- Cell ID that created this claim
table_name INT64 NOT NULL, -- Database table name (maps to protobuf TableName enum)
table_record_id INT64 NOT NULL, -- Record ID in the source table
lease_id STRING(36), -- NULL for committed claims, UUID for leased claims
lease_op INT NOT NULL DEFAULT 0, -- 0: 'no-op', 1: 'create', 2: 'destroy'
created_at TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
updated_at TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (claim_type, claim_value);
-- CRITICAL CONSTRAINTS:
-- 1. Objects with lease_id != NULL cannot be claimed by other operations
-- 2. Only the cell that created a claim (cell_id) can destroy it
-- These constraints prevent concurrent operations and unauthorized access
-- Outstanding leases table (mirrored with Rails for synchronization)
CREATE TABLE leases_outstanding (
lease_id STRING(36) NOT NULL, -- UUID of the lease
cell_id STRING(100) NOT NULL, -- Cell ID that owns the lease
lease_payload BYTES(MAX) NOT NULL, -- Serialized LeasePayload protobuf
created_at TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (lease_id);
-- Performance and operational indexes
CREATE INDEX idx_leases_outstanding_cell ON leases_outstanding(cell_id);
CREATE INDEX idx_claims_cell ON claims(cell_id);
CREATE INDEX idx_claims_lease_id ON claims(lease_id) STORING (lease_op);
-- Additional indexes for verification queries
CREATE INDEX idx_claims_table_record ON claims(cell_id, table_name, table_record_id);
CREATE INDEX idx_claims_table_cursor ON claims(cell_id, table_name, table_record_id) STORING (claim_type, claim_value, owner_type, owner_value);
Protobuf Definitions
syntax = "proto3";
package gitlab.cells.claims.v1;
option go_package = "gitlab.com/gitlab-org/cells/claims/v1;claimsv1";
import "google/protobuf/timestamp.proto";
// Service definition for Topology Service Claims API
service ClaimsService {
// BeginUpdate creates/destroys with lease - atomic operation
rpc BeginUpdate(BeginUpdateRequest) returns (BeginUpdateResponse);
// CommitUpdate finalizes the operations and removes lease
rpc CommitUpdate(CommitUpdateRequest) returns (CommitUpdateResponse);
// RollbackUpdate reverts the operations and removes lease
rpc RollbackUpdate(RollbackUpdateRequest) returns (RollbackUpdateResponse);
// List outstanding leases for a client (for reconciliation)
rpc ListOutstandingLeases(ListOutstandingLeasesRequest) returns (ListOutstandingLeasesResponse);
// List claims by table for verification
rpc ListClaims(ListClaimsRequest) returns (ListClaimsResponse);
}
// Core claim record definition
message ClaimRecord {
enum ClaimType {
UNSPECIFIED = 0;
ROUTES = 1;
EMAIL = 3;
}
enum Owner {
UNSPECIFIED = 0;
GROUP = 1;
PROJECT = 2;
USER = 3;
}
enum TableName {
TABLE_UNSPECIFIED = 0;
USERS = 1;
EMAILS = 2;
ROUTES = 3;
PROJECTS = 4;
GROUPS = 5;
}
ClaimType claim_type = 1;
string claim_value = 2; // The actual value being claimed (e.g., "john@example.com")
Owner owner = 3;
string owner_value = 4; // Owner identifier
// Table information for verification
TableName table_name = 6;
int64 table_record_id = 7;
}
// BeginUpdate operation request - can include multiple creates and destroys
message BeginUpdateRequest {
string cell_id = 1; // Cell ID requesting the lease
repeated ClaimRecord creates_claims = 2; // Claims to create
repeated ClaimRecord destroys_claims = 3; // Claims to destroy
}
// Lease payload stored in Cloud Spanner and returned to clients
message LeasePayload {
string lease_id = 1; // UUID of the lease
string cell_id = 2; // Cell ID that owns the lease
google.protobuf.Timestamp created_at = 3;
BeginUpdateRequest original_request = 4; // Complete original request for reconciliation
}
// BeginUpdate operation response
message BeginUpdateResponse {
LeasePayload lease_payload = 1;
}
// CommitUpdate operation request
message CommitUpdateRequest {
string cell_id = 1;
string lease_id = 2;
}
// CommitUpdate operation response
message CommitUpdateResponse {
// Empty response - success indicated by no gRPC error
}
// RollbackUpdate operation request
message RollbackUpdateRequest {
string cell_id = 1;
string lease_id = 2;
}
// RollbackUpdate operation response
message RollbackUpdateResponse {
// Empty response - success indicated by no gRPC error
}
// List outstanding leases request
message ListOutstandingLeasesRequest {
string cell_id = 1;
string cursor = 2; // Optional cursor for pagination
}
// Outstanding lease information
message OutstandingLease {
LeasePayload lease_payload = 1;
}
// List outstanding leases response
message ListOutstandingLeasesResponse {
repeated OutstandingLease leases = 1;
string next_cursor = 2; // Cursor for next page, empty if no more pages
}
// List claims request for verification
message ListClaimsRequest {
string cell_id = 1;
ClaimRecord.TableName table_name = 2;
int64 cursor = 3;
int32 limit = 4;
}
// Claim information for verification
message ClaimInfo {
ClaimRecord.ClaimType claim_type = 1;
string claim_value = 2;
ClaimRecord.Owner owner_type = 3;
string owner_value = 4;
ClaimRecord.TableName table_name = 6;
int64 table_record_id = 7;
google.protobuf.Timestamp created_at = 8;
google.protobuf.Timestamp updated_at = 9;
}
// List claims response with range information
message ListClaimsResponse {
repeated ClaimInfo claims = 1;
int64 start_range = 2;
int64 end_range = 3;
int64 next_cursor = 4;
}
gRPC API Behaviors
- BeginUpdate(): Atomically acquire leases for batch operations, enforce exclusivity constraints
- Validation: Ensures creates and destroys reference different claims within the same batch
- Create Processing: Inserts new claims that don’t exist in the system
- Destroy Processing: Updates existing claims owned by the requesting cell
- CommitUpdate(): Finalize claims (delete destroys, clear lease_id from creates) and remove leases
- RollbackUpdate(): Revert claims (delete creates, clear lease_id from destroys) and remove leases
- ListOutstandingLeases(): Retrieve leases for reconciliation with cursor-based pagination
- ListClaims(): Retrieve claims by table for verification with range information
Unhappy Path Workflows
Failure Point 1: Pre-flight Validation Failure
Scenario: Model validation fails before lease acquisition
sequenceDiagram participant User participant Rails User->>Rails: Save invalid model Rails->>Rails: Validate model Note over Rails: Validation fails Rails-->>User: Validation error
Recovery: No recovery needed - no resources acquired, user sees validation error
Failure Point 2: Lease Acquisition Failure
Scenario 2A: Topology Service Conflict - Permanent Claim
sequenceDiagram participant User participant Rails as Rails App participant RailsDB as Rails DB (PostgreSQL) participant TopologyService as Topology Service (Go) participant CloudSpanner as Cloud Spanner User->>Rails: Save Model with conflicting claim Rails->>Rails: Validate model and generate claims Rails->>Rails: Validate batch (creates/destroys reference different claims) Rails->>TopologyService: BeginUpdate(creates, destroys) TopologyService->>CloudSpanner: BEGIN Transaction TopologyService->>CloudSpanner: Validate batch constraints TopologyService->>CloudSpanner: Insert lease in leases_outstanding TopologyService->>CloudSpanner: Try to INSERT claim (create) or UPDATE claim (destroy) Note over CloudSpanner: Primary key constraint violation OR conditional update fails Note over CloudSpanner: Object already exists OR has active lease (lease_id != NULL) CloudSpanner-->>TopologyService: ABORTED: Constraint/lease violation TopologyService->>CloudSpanner: Transaction automatically rolled back TopologyService-->>Rails: gRPC Error (AlreadyExists/FailedPrecondition) Rails->>Rails: Add validation error to model Rails-->>User: Save failed: "Object already claimed or temporarily locked"
Different conflict types:
- Permanent Conflict: Object permanently owned by another cell → “Already taken”
- Temporary Conflict: Object temporarily leased by another operation → “Try again later”
- Batch Validation Conflict: Creates and destroys reference same claim → “Invalid batch operation”
Recovery: No recovery needed - user sees appropriate error message
Scenario 2B: Network Failure During Execute
sequenceDiagram participant Rails participant TopologyService Rails->>TopologyService: BeginUpdate(creates, destroys) Note over TopologyService: Network timeout TopologyService->>Rails: Connection lost Rails-->>User: "Service temporarily unavailable"
Recovery: No recovery needed - no lease acquired, safe to retry. The limited amount of retries might be done by the application.
Failure Point 3: Local Database Transaction Failure
Scenario 3A: Rails Database Constraint Violation
sequenceDiagram participant User participant Rails as Rails App participant RailsDB as Rails DB (PostgreSQL) participant TopologyService as Topology Service (Go) participant CloudSpanner as Cloud Spanner User->>Rails: Save Models Rails->>TopologyService: Execute(creates, destroys) TopologyService->>CloudSpanner: Execute operations successfully TopologyService-->>Rails: BeginUpdateResponse(lease_payload) Rails->>RailsDB: BEGIN Transaction Rails->>RailsDB: Save user (success) Rails->>RailsDB: Save email (database constraint violation) RailsDB-->>Rails: ERROR (unique constraint violated) Rails->>RailsDB: ROLLBACK Transaction Note over Rails: Cleanup lease - Rails DB failed Rails->>TopologyService: RollbackUpdate(lease_id) TopologyService->>CloudSpanner: BEGIN Transaction TopologyService->>CloudSpanner: DELETE FROM claims WHERE lease_op='create' AND lease_id=@id TopologyService->>CloudSpanner: UPDATE claims SET lease_id=NULL, lease_op='no-op'<br/>WHERE lease_op='destroy' AND lease_id=@id TopologyService->>CloudSpanner: DELETE FROM leases_outstanding WHERE lease_id=@id TopologyService->>CloudSpanner: COMMIT Transaction TopologyService-->>Rails: RollbackUpdateResponse() Rails-->>User: Save failed: Database constraint violation
Recovery: Automatic rollback of lease, user sees error
Scenario 3B: Application Crash During Local Transaction
sequenceDiagram participant User participant Rails as Rails App participant Reconciliation as Rails Reconciliation Job participant RailsDB as Rails DB (PostgreSQL) participant TopologyService as Topology Service (Go) participant CloudSpanner as Cloud Spanner User->>Rails: Save Models Rails->>TopologyService: BeginUpdate(creates, destroys) TopologyService->>CloudSpanner: Execute operations successfully TopologyService-->>Rails: BeginUpdateResponse(lease_payload) Rails->>RailsDB: BEGIN Transaction Rails->>RailsDB: Start saving changes Note over Rails: Application crash / Network failure Note over Rails: Transaction lost, no commit/rollback Note over Reconciliation: Background reconciliation detects issue Reconciliation->>TopologyService: ListOutstandingLeases() TopologyService-->>Reconciliation: Returns lease still outstanding Reconciliation->>RailsDB: Check if lease exists locally RailsDB-->>Reconciliation: Lease NOT found (was never committed) Note over Reconciliation: Lease older than LEASE_STALENESS_THRESHOLD Reconciliation->>TopologyService: RollbackUpdate(lease_id) TopologyService->>CloudSpanner: Process rollback operation TopologyService->>CloudSpanner: BEGIN Transaction TopologyService->>CloudSpanner: DELETE claims WHERE lease_op='create' AND lease_id=@id TopologyService->>CloudSpanner: UPDATE claims SET lease_id=NULL, lease_op='no-op'<br/>WHERE lease_op='destroy' AND lease_id=@id TopologyService->>CloudSpanner: DELETE FROM leases_outstanding WHERE lease_id=@id TopologyService->>CloudSpanner: COMMIT Transaction TopologyService-->>Reconciliation: Success
Recovery: Background reconciliation rolls back orphaned lease
Failure Point 4: Lease Commitment Failure
Scenario 4A: Missing Commit After Successful Local Transaction
sequenceDiagram participant User participant Rails as Rails App participant Reconciliation as Rails Reconciliation Job participant RailsDB as Rails DB (PostgreSQL) participant TopologyService as Topology Service (Go) participant CloudSpanner as Cloud Spanner User->>Rails: Save Models Rails->>TopologyService: BeginUpdate(creates, destroys) TopologyService->>CloudSpanner: Execute operations successfully TopologyService-->>Rails: BeginUpdateResponse(lease_payload) Rails->>RailsDB: BEGIN Transaction Rails->>RailsDB: Save all changes successfully Rails->>RailsDB: Insert lease in leases_outstanding Rails->>RailsDB: COMMIT Transaction Note over Rails: Network failure / Application crash Note over Rails: CommitUpdate() call to TS never made Note over Reconciliation: Background reconciliation detects issue Reconciliation->>TopologyService: ListOutstandingLeases() TopologyService-->>Reconciliation: Returns lease still outstanding Reconciliation->>RailsDB: Check if lease exists locally RailsDB-->>Reconciliation: Lease found (should be committed) Reconciliation->>TopologyService: CommitUpdate(lease_id) TopologyService->>CloudSpanner: Process commit operation TopologyService-->>Reconciliation: Success Reconciliation->>RailsDB: DELETE FROM leases_outstanding WHERE lease_id=@id
Recovery: Background reconciliation commits the lease
Scenario 4B: Topology Service Unavailable During Commit
sequenceDiagram participant Rails participant TopologyService Note over Rails: Local transaction committed Rails->>TopologyService: CommitUpdate(lease_id) Note over TopologyService: Service unavailable TopologyService-->>Rails: ServiceUnavailable error Rails->>Rails: Retry operation.
Recovery: The operation will be retried reasonable amount of times. Otherwise, the Reconciliation Job will handle it at later time.
Failure Point 5: Cleanup Failure
Scenario 5A: Failed to Delete Local Lease Record
sequenceDiagram participant Rails participant Reconciliation as Rails Reconciliation participant RailsDB Note over Rails: Commit successful Rails->>RailsDB: DELETE lease record Note over RailsDB: Delete fails RailsDB-->>Rails: ERROR Rails->>Rails: Log warning, continue Note over Reconciliation: Background cleanup finds orphaned record Reconciliation->>RailsDB: DELETE stale lease records
Recovery: Background reconciliation eventually removes stale records
Advanced Behaviors
Lease Exclusivity and Concurrency Control
Cell A: BeginUpdate(create email@example.com) → Lease acquired
Cell B: BeginUpdate(destroy email@example.com) → BLOCKED until Cell A commits/rollbacks
Behavior: Objects with active leases cannot be claimed:
- Creates: Will fail with primary key constraint if object exists (regardless of lease status)
- Destroys: Will fail with conditional update if object has
lease_id != NULL
- Temporal Lock: Object remains locked until lease expires or is committed/rolled back
- Automatic Release: Stale leases are cleaned up through reconciliation, making objects available again
Example scenarios:
- Email change collision: User changes email while admin tries to delete it → First wins, second fails. User will have to retry at later time.
- Route transfer conflict: Two operations try to move same route → First wins, second fails
- Concurrent creation: Two cells try to create same username → First wins, second fails permanently
Multi-Model Coordination
User + Email + Route changes → Single batch BeginUpdate() → All-or-nothing semantics
Example: Creating user with email and route:
- User model generates username claim
- Email model generates email claim
- Route model generates path and name claims
- All 4 claims sent in single BeginUpdate() call
- All models saved in single Rails transaction
- All claims committed together
Why: Business operations often span multiple models - atomic coordination required
Performance Characteristics
Optimizations
- Batch Processing: Multiple claims in single RPC call
- Efficient Indexes: Cloud Spanner indexes optimized for lease operations
- Connection Pooling: Reuse gRPC connections
- Background Processing: Async cleanup doesn’t block user operations
- Constraint-Based Conflicts: Database handles uniqueness without application logic
Scalability
- Cell Independence: Each cell operates independently until conflicts
- Centralized Coordination: The conflicts prevent other Cells from making the change, no cross-cell communication
- Time-Bounded Locks: Automatic cleanup prevents indefinite blocking through staleness detection
- Horizontal Scaling: Cloud Spanner scales with claim volume
Why This Design Works
1. Exclusive Access Control
- Objects can only be modified by one operation at a time
- Prevents data corruption from concurrent modifications
- Clear temporal boundaries for exclusive access
2. Graceful Concurrency Handling
- Permanent conflicts (duplicates) fail immediately
- Temporary conflicts (leases) can be retried
- Users get appropriate feedback for different conflict types
3. Consistency Without Distributed Transactions
- Uses leases instead of 2PC (Two-Phase Commit) - uses a model of distributed locking
- Simpler failure modes than distributed transactions
- Time-bounded recovery from failures through staleness-based cleanup
- No 2PC Required: Each cell manages its own local state independently, with the push of the data to the Topology Service
4. Operational Simplicity
- Clear failure modes and recovery procedures
- Observable through standard metrics and logs
- Self-healing through Rails-driven cleanup with idempotent Topology Service operations
- Immediate Cleanup: Leases are removed as soon as possible - lingering leases indicate exceptions
5. Developer Ergonomics
- Transparent integration with ActiveRecord
- Declarative configuration
- Familiar transaction semantics
6. Production Robustness
- Handles network partitions gracefully
- Automatic recovery from cell crashes
- Comprehensive error handling and retry logic
7. Comprehensive Verification
- Detects and corrects missing, extra, and different claims
- Efficient cursor-based processing for large datasets
- Recent record protection prevents interference with ongoing operations
- Hash-based matching for optimal performance
- Migration of the data is integral part of the verification
Open Questions and Considerations
Performance Considerations
- Batch Size Limits: What’s the maximum number of claims per batch to optimize performance vs. transaction size?
- Lease Duration: What is the good staleness threshold to ensure that it handles inflight transaction?
- Better Local Transaction Detection: Should system track a local database transaction and attach it to Topology Service lease? This would allow the Reconciliation process to see if the transaction finished
- Connection Pooling: How many concurrent connections should Rails maintain to Topology Service, and how should they be distributed across cells?
Security Considerations
- Authentication: The Topology Service is secured with usage of mutual TLS
- Authorization: Should there be additional authorization beyond cell_id in request payload for authorizing the validity of the request?
- Audit Trail: Should all claim operations be logged for security auditing and compliance?
- Rate Limiting: What are the rate-limits for the operations?
Operational Considerations
- Monitoring: What metrics should be tracked - lease age, conflict rates, reconciliation frequency?
- Alerting: When should operators be notified - stale lease threshold, reconciliation failures, or high conflict rates?
- Disaster Recovery: How to handle Topology Service outages and ensure data consistency during recovery?
Edge Cases
- Lease Staleness Race: What happens if a lease becomes stale during local transaction - should it be allowed or rejected?
- Partial Batch Failures: The system will not support partial success in batch operations and only maintain all-or-nothing semantics
- Concurrent Reconciliation: How to handle multiple reconciliation processes running simultaneously? Should different reconciliation processes work on a different ranges?
- Create/Destroy Conflicts: How should the system handle requests that try to create and destroy the same claim in a single batch? This greatly complicates Commit and Rollback operations on Topology Service. This might be required to correct claims as part of Verification Process
Future Enhancements
- Lease Renewal: Should long-running operations be able to refresh leases to prevent staleness?
- Lease Queuing: Should there be a queue for waiting operations when leases conflict? Or, are we good with always presenting failure to the user. Maybe the lease conflicts would be only retryable when running Background Jobs (Sidekiq).
- Lease Priorities: Should certain operations (admin vs. user) have priority over others?
- Admin Controls: What type of admin API access is to resolve failure modes that occur on edge cases?
- Mulit-layered approach: How system should implement multi-layered approach for storing leases, where we have layered Topology Services each having its own database and sending claims to upstream databases?
Testing Strategy
- Chaos Engineering: How to test behavior under various failure scenarios - network partitions, service crashes, clock skew?
- Load Testing: What’s the maximum throughput the system can handle under various conflict scenarios?
- Consistency Testing: How to verify consistency across all failure modes?
- Integration Testing: How to test the complete flow across Rails, Topology Service, and Cloud Spanner?
Additional Topics
Cell downtime vs Cell decommissioning
The Topology Service cannot rollback leases, as this might put the Cell information to be outdated. This being critical if other Cells will claim the record that were previously held by the downtime Cell. As such the Topology Service does not implement any Reconciliation process, as it is not authoritative to make decisions about the lease.
In a case of Cell downtime this would have to be administrative action to remove outstanding leases belonging to the Cell.
This is different to Cell decommissioning. In this case it is expected that decommissioned Cell should not hold any information in Topology Service (all information being migrated out). However, decommissioning might mean deliberate removal of the data. In such case, there might be exposed administrative interface to drop all data belonging to the particular Cell that is present in Topology Service to make names available again.
Auditing and logging
The data structure describing leases stores the whole lease_payload
.
This makes the lease_payload
be available to a Cell to validate the payload
sent (if it requires as part of Commit or Rollback).
This makes it possible for the Cell as part of the Reconciliation Process to validate or revert all changes for outstanding leases.
This also makes it easier to expose those outstanding leases as part of the administrative interface in case of resolving conflicts or edge cases.
Race condition when doing CommitUpdate()/RollbackUpdate()
The CommitUpdate()/RollbackUpdate()
will be opportunistically be executed
as part of after_commit/after_rollback
. However, the Reconciliation Process
might execute those at the same time. This is likely happening rather often.
The Rollback
will be design be executed after the staleness period. We could
introduce similar threshold to perform Commit
to ensure that after_commit
had
time to execute. This should greatly reduce a chance of this happening.
Hide global replication latency
The BeginUpdate()
has immediate effect on routing. Objects that are created
are routable right away. This optimizes for 99.99% case, where we will follow
the happy path, and changes made are gonna be sticky to the Cell.
Doing BeginUpdate()
to impact routing will hide the global replication latency,
as we can expect that doing local Cell changes will take mostly longer than
Cloud Spanner replication lag.
It means that in a period between BeginUpdate()
and CommitUpdate()
we will have
extra records pointing to the resource, that are cleaned up with the CommitUpdate()
.
This should not pose any side effects to the system in a case of claims. If other
buckets of data are stored that have different commit expectations they can be modeled
accordingly. However, of the routing purposes it pose no side effects, except improving
user experience.
You can read about the reasons here.
Alternative Approaches to Consider
In-Transaction Claims Processing
Execute claims within the Rails transaction rather than before it:
1. Local DB: BEGIN
2. Local DB: INSERT/UPDATE/DESTROY (local operations)
3. TS DB: BeginUpdate() => lease (~100ms)
4. Local DB: INSERT lease
5. Local DB: COMMIT
6. TS DB: CommitUpdate(lease)
Considerations:
- Connection Pool Impact: Would require strict 250ms timeout on TS BeginUpdate to prevent connection pool bottlenecks
- Transaction Duration: Extends every local database transaction by network round-trip time
- Lock Contention: Holds database locks and connections during network operations
- Scalability Impact: Could exhaust connection pools under high concurrency
Trade-offs:
- Error handling: Simpler error handling vs. resource contention and scalability concerns
- Implementation: Follows well current lazy evaluation approach in Rails allowing to properly capture all claims as they are saved to database significantly reducing development complexity.
Separate Leased Table with UUID Cross-Join
Create a separate leased
table that references claims by UUID:
CREATE TABLE leased (
claim_id STRING(36) NOT NULL, -- References claims.claim_id
lease_id STRING(36) NOT NULL,
lease_op STRING(10) NOT NULL, -- 'create', 'destroy'
client_id STRING(100) NOT NULL,
created_at TIMESTAMP NOT NULL,
) PRIMARY KEY (claim_id);
Benefits:
- Cleaner Separation: Permanent ownership (claims) vs. temporary locking (leased)
- Efficient Lease Queries: Direct queries against leased table
- Complex Lease Metadata: Room for detailed lease analytics without cluttering claims table
Trade-offs: Better separation of concerns vs. increased transaction complexity and performance overhead
Separate Leased Table Considerations
- Eventual Consistency: Is it affecting the cross-join table?
- JOIN Complexity: Most queries require joins between claims and leased tables
- Transaction Atomicity: More complex to ensure referential integrity across tables
- Race Conditions: Higher risk of inconsistent state between table operations
- Cloud Spanner Limitations: No foreign key constraints, potential for orphaned records
- Structure: Would the table require to duplicate claim_type and claim_value with unique index to model claims semantics?
Two-Phase Commit (2PC)
Use traditional distributed transactions across cells:
Benefits:
- Proven Pattern: Well-understood distributed transaction semantics
Considerations:
- Overkill: 2PC is needed when there are many writers write to a single dataset. The information stored in Topology Service by design is not overlapping between Cells. Cells do not have to modify another Cell information, as such 2PC is not needed
- Complex: Requires coordination between all participants
- Coordinator Failure: Single point of failure that can block all participants
- Performance Overhead: Multiple network round-trips and blocking phases
- Operational Complexity: Requires distributed transaction coordinator management
- Recovery Complexity: Manual intervention often needed for failed transactions
Documents
- AIP-158 - Google guidance on implementing cursor-based pagination
57f04125
)