Topology Service Claims In Database Transactions
This page contains information related to upcoming products, features, and functionality.
It is important to note that the information presented is for informational purposes only.
Please do not rely on this information for purchasing or planning purposes.
The development, release, and timing of any products, features, or functionality may be
subject to change or delay and remain at the sole discretion of GitLab Inc.
| Status | Authors | Coach | DRIs | Owning Stage | Created |
|---|---|---|---|---|---|
sxuereb
|
ayufan
|
ayufan
|
devops tenant-scale | 2025-05-09 |
Context
GitLab’s Cells architecture requires coordination with the Topology Service to claim cluster-wide unique resources like emails, routes which is described in detail in Topology Service transactions. The critical decision is whether to perform these claims requests to the Topology Service inside or outside Database transactions.
Key Requirements:
- Claim resources to prevent conflicts across cells.
- Maintain data consistency between the Rails DB and the Topology Service.
- Support batching of performance claims.
- Handle ~50 claims/sec average, with the ability to 6x our rate to ~300 claims/sec.
- Database stability not affected.
Technical Constraints:
- Network calls to Topology Service: ~200ms P99.95
- Current connection pool: 58 connections support ~290 claims/sec at 200ms
- Need Postgres primary keys, for example, user.id for users. This exists when saving records to the database inside the transaction.
- Rails automatically wraps creates/updates in transactions
Decision
Implement claims INSIDE ActiveRecord transactions using callbacks. The implementation will:
- Use
after_save(officially documented) orbefore_commit(not documented) callback that run inside of a transaction. - Claim resources after INSERT but before COMMIT (when all IDs are available).
- Batch multiple claims from the same transaction when possible.
- Set a client-side 200ms timeout on Topology Service requests.
- Implement a client-side circuit breaker on when there are
Nclaims/sec Topology Service requests inside of a transaction.Nshould be configurable and takes database load into consideration. - Part of the rollout of claims will be to use a feature flag and progressively increase the percentage of claims we do to observe any negative database effects. The exact rollout process is yet to be figured out and will be part of Adopt GitLab.com to Cell Cluster as Legacy Cell
Monitoring Requirements:
- Connection pool utilization and wait times.
- Transaction duration for claim-related operations.
- Topology Service request latencies (P50, P99, P99.95).
- Client-side circut-breaker hits.
- Client-side timeout hits.
- Failed claims and rollback rates from the client and server.
Consequences
Positive:
- Simpler implementation: Works within existing Rails patterns and callbacks
- Faster time to production: Minimal architectural changes required
- Atomic rollback: Transaction rolls back cleanly if the claim fails
- No side effects issue: Claims happen before commit, preventing situation where we commit on the database before checking with Topology Service.
- Sufficient capacity: Current infrastructure supports 6x peak load
- Batching support: Can batch claims within the same transaction after all IDs are available
Negative:
- Connection pool usage: Network calls hold database connections during claim.
- Transaction duration: Longer transactions ~200ms can result in more load on the database:
- WAL accumulation.
- Connection pool exhaustion since each transactions holds a connection.
- Scalability ceiling: Limited by connection pool size (currently 290 claims/sec).
Alternatives
Option 1: Claims Outside Transactions
Deferred because:
- Massive refactoring required, for example thousands of
User.create!calls would need modification, and this would be required per model to claim. - Breaks Rails conventions of automatic transaction wrapping.
- Higher implementation complexity and error handling.
- Difficult to batch claims when IDs are not yet available.
- Current capacity (6x peak) makes optimization premature.
Option 2: Database Trigger
Rejected because:
- Cannot handle non-database side effects.
- Adds complexity outside the Rails application layer.
- Difficult to test and maintain.
- Verification loop already handles cascade deletion cases.
- Still doesn’t verify whether the resource is claimable cluster-wide.
Option 3: Override ActiveRecord save/update
Rejected because:
- Cannot batch multiple attributes efficiently.
- Triggers one claim per attribute instead of a batched request.
- Doesn’t work with
update_columnand similar methods.
Last modified October 13, 2025: docs(cells): claims inside of database transactions (
298cbc71)
