Cells: HTTP Routing Service
This document describes design goals and architecture of Routing Service used by Cells. To better understand where the Routing Service fits into architecture take a look at Infrastructure Architecture.
Goals
The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, gitlab.com
), instead of having to go to separate domains.
The user will be able to use https://gitlab.com
to access Cell-enabled GitLab.
Depending on the URL access, it will be transparently proxied to the correct Cell that can serve this particular information.
For example:
- All requests going to
https://gitlab.com/users/sign_in
are randomly distributed to all Cells. - All requests going to
https://gitlab.com/gitlab-org/gitlab/-/tree/master
are always directed to Cell 5, for example. - All requests going to
https://gitlab.com/my-username/my-project
are always directed to Cell 1.
-
Technology.
We decide what technology the routing service is written in. The choice is dependent on the best performing language, and the expected way and place of deployment of the routing layer. If it is required to make the service multi-cloud it might be required to deploy it to the CDN provider. Then the service needs to be written using a technology compatible with the CDN provider.
-
Cell discovery.
The routing service needs to be able to discover and monitor the health of all Cells.
-
User can use single domain to interact with many Cells.
The routing service will intelligently route all requests to Cells based on the resource being accessed versus the Cell containing the data.
-
Router endpoints classification.
The stateless routing service will fetch and cache information about endpoints from one of the Cells. We need to implement a protocol that will allow us to accurately describe the incoming request (its fingerprint), so it can be classified by one of the Cells, and the results of that can be cached. We also need to implement a mechanism for negative cache and cache eviction.
-
GraphQL and other ambiguous endpoints.
Most endpoints have a unique classification key: the Organization, which directly or indirectly (via a Group or Project) can be used to classify endpoints. Some endpoints are ambiguous in their usage (they don’t encode the classification key), or the classification key is stored deep in the payload. In these cases, we need to decide how to handle endpoints like
/api/graphql
. -
Small.
The Routing Service is configuration-driven and rules-driven, and does not implement any business logic. The maximum size of the project source code in initial phase is 1_000 lines without tests. The reason for the hard limit is to make the Routing Service to not have any special logic, and could be rewritten into any technology in a matter of a few days.
Requirements
Requirement | Description | Priority |
---|---|---|
Discovery | needs to be able to discover and monitor the health of all Cells. | high |
Security | only authorized cells can be routed to | high |
Single domain | for example GitLab.com | high |
Caching | can cache routing information for performance | high |
Low latency | 50 ms of increased latency | high |
Path-based | can make routing decision based on path | high |
Complexity | the routing service should be configuration-driven and small | high |
Rolling | the routing service works with Cells running mixed versions | high |
Feature Flags | features can be turned on, off, and % rollout | high |
Progressive Rollout | we can slowly rollout a change | medium |
Stateless | does not need database, Cells provide all routing information | medium |
Secrets-based | can make routing decision based on secret (for example JWT) | medium |
Observability | can use existing observability tooling | low |
Self-managed | can be eventually used by self-managed | low |
Regional | can route requests to different regions | low |
Low Latency
The target latency for routing service should be less than 50 ms.
Looking at the urgency: high
request we don’t have a lot of headroom on the p50.
Adding an extra 50 ms allows us to still be in or SLO on the p95 level.
There is 3 primary entry points for the application; web
, api
, and git
.
Each service is assigned a Service Level Indicator (SLI) based on latency using the apdex standard.
The corresponding Service Level Objectives (SLOs) for these SLIs require low latencies for large amount of requests.
It’s crucial to ensure that the addition of the routing layer in front of these services does not impact the SLIs.
The routing layer is a proxy for these services, and we lack a comprehensive SLI monitoring system for the entire request flow (including components like the Edge network and Load Balancers) we use the SLIs for web
, git
, and api
as a target.
The main SLI we use is the rails requests.
It has multiple satisfied
targets (apdex) depending on the request urgency:
Urgency | Duration in ms |
---|---|
:high |
250 ms |
:medium |
500 ms |
:default |
1000 ms |
:low |
5000 ms |
Analysis
The way we calculate the headroom we have is by using the following:
web
:
Target Duration | Percentile | Headroom |
---|---|---|
5000 ms | p99 | 4000 ms |
5000 ms | p95 | 4500 ms |
5000 ms | p90 | 4600 ms |
5000 ms | p50 | 4900 ms |
1000 ms | p99 | 500 ms |
1000 ms | p95 | 740 ms |
1000 ms | p90 | 840 ms |
1000 ms | p50 | 900 ms |
500 ms | p99 | 0 ms |
500 ms | p95 | 60 ms |
500 ms | p90 | 100 ms |
500 ms | p50 | 400 ms |
250 ms | p99 | 140 ms |
250 ms | p95 | 170 ms |
250 ms | p90 | 180 ms |
250 ms | p50 | 200 ms |
Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1667993089
api
:
Target Duration | Percentile | Headroom |
---|---|---|
5000 ms | p99 | 3500 ms |
5000 ms | p95 | 4300 ms |
5000 ms | p90 | 4600 ms |
5000 ms | p50 | 4900 ms |
1000 ms | p99 | 440 ms |
1000 ms | p95 | 750 ms |
1000 ms | p90 | 830 ms |
1000 ms | p50 | 950 ms |
500 ms | p99 | 450 ms |
500 ms | p95 | 480 ms |
500 ms | p90 | 490 ms |
500 ms | p50 | 490 ms |
250 ms | p99 | 90 ms |
250 ms | p95 | 170 ms |
250 ms | p90 | 210 ms |
250 ms | p50 | 230 ms |
Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1669995479
git
:
Target Duration | Percentile | Headroom |
---|---|---|
5000 ms | p99 | 3760 ms |
5000 ms | p95 | 4280 ms |
5000 ms | p90 | 4430 ms |
5000 ms | p50 | 4900 ms |
1000 ms | p99 | 500 ms |
1000 ms | p95 | 750 ms |
1000 ms | p90 | 800 ms |
1000 ms | p50 | 900 ms |
500 ms | p99 | 280 ms |
500 ms | p95 | 370 ms |
500 ms | p90 | 400 ms |
500 ms | p50 | 430 ms |
250 ms | p99 | 200 ms |
250 ms | p95 | 230 ms |
250 ms | p90 | 240 ms |
250 ms | p50 | 240 ms |
Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1671385680
Non-Goals
Not yet defined.
Proposal
The Routing Service implements the following design guidelines:
- Simple:
- Routing service does not buffer requests.
- Routing service can only proxy to a single Cell based on request headers.
- Stateless:
- Routing service does not have permanent storage.
- Routing service uses multi-level cache: in-memory, external shared cache.
- Zero-trust:
- Routing service signs each request that is being proxied.
- The trust is established by using JWT token, or mutual authentication scheme.
- Cells can be available over public internet, as long as they follow the zero-trust model.
- Configuration-based:
- Routing service is configured with a static list of Cells.
- Routing service configuration is applied as part of service deployment.
- Rule-based:
- Routing rules are a static JSON file that is part of routing service.
- Configured rules needs to be made compatible with all versions of GitLab running in a cluster.
- Rules allows to match by any criteria: header, content of the header, or route path.
- Agnostic:
- Routing service is not aware of high-level concepts like organizations.
- The classification is done per-specification provided in a rules, to find the classification key.
- The classification key result is cached.
- The single classification key cached is used to handle many similar requests.
The following diagram shows how a user request routes through DNS to the Routing Service deployed as Cloudflare Worker and the router chooses a cell to send the request to.
graph TD; user((User)); router[Routing Service]; cell_us0{Cell US0}; cell_us1{Cell US1}; cell_eu0{Cell EU0}; cell_eu1{Cell EU1}; user-->router; router-->cell_eu0; router-->cell_eu1; router-->cell_us0; router-->cell_us1; subgraph Europe cell_eu0; cell_eu1; end subgraph United States cell_us0; cell_us1; end
Routing rules
- The routing rules describe how to decode the request, find the classification key, and make the routing decision.
- The routing rules are static and defined ahead of time as part of HTTP Router deployment.
- The routing rules are defined as a JSON document describing in-order a sequence of operation.
- The routing rules might be compiled to application code to provide a way faster execution scheme.
- Each routing rule is described by the
cookies
,headers
,path
,method
, andaction
. - The
action
can beclassify
as a way to indicate that the Topology Service should be used to perform dynamic classification. - The
action
can beproxy
as a way to indicate to perform passthrough to the fixed host that is stored in HTTP Router configuration unlessproxy
address is specified. Usually, it would beCell 1
in a cluster.
The routing rules JSON structure describes all matchers:
{
"rules": [
{
"cookies": {
"<cookie_name>": {
"match_regex": "<regex_match>"
},
"<cookie_name2>": {
"match_regex": "<regex_match>"
}
},
"headers": {
"<header_name>": {
"match_regex": "<regex_match>"
},
"<header_name2>": {
"match_regex": "<regex_match>"
},
},
"path": {
"match_regex": "<regex_match>"
},
"method": ["<list_of_accepted_methods>"],
"action": "classify",
"classify": {
"type": "session_prefix|project_path|...",
"value": "string_build_from_regex_matchers"
},
"action": "proxy",
"proxy": {
"address": "cell1.gitlab.com"
}
}
]
}
Example of the routing rules that makes routing decision based session cookie, and secret:
{
"rules": [
{
"cookies": {
"_gitlab_session": {
"match_regex": "^(?<cell_name>cell.*:)" // accept `_gitlab_session` that are prefixed with `cell1:`
}
},
"action": "classify",
"classify": {
"type": "session_prefix",
"value": "${cell_name}"
}
},
{
"headers": {
"GITLAB_TOKEN": {
"match_regex": "^(?<cell_name>cell.*:)" // accept `_gitlab_session` that are prefixed with `cell1:`
}
},
"action": "classify",
"classify": {
"type": "token_prefix",
"value": "${cell_name}"
}
}
]
}
Example of the routing rules published by all Cells that makes routing decision based on the path:
{
"rules": [
{
"path": {
"match_regex": "^/api/v4/projects/(?<project_id_or_path_encoded>[^/]+)(/.*)?$"
},
"action": "classify",
"classify": {
"type": "project_id_or_path",
"value": "${project_id_or_path_encoded}"
}
}
]
}
Classification
The classification is implemented by the Classify Service of the Topology Service.
- The classification endpoint uses REST (with mTLS) to secure access.
- The classification endpoint returns only cell name to which information should be routed.
- The classification could return other equivalent classification keys to pollute cache for similar requests. This is to ensure that all similar requests can be handled quickly without having to classify each time.
- The HTTP Router retries the
classify
call for a reasonable amount of time. - The classification for a given value is cached regardless of returned response (positive or negative). The rejected classification is cached to prevent excessive amount of requests for classification keys that are not found.
- The response does contain
Cache-*
headers that control how long the requests are cached: Cloudflare Workers - Cache. - If cache is used, the
Cache-Tag:
is required to be used as a way to have mechanism to selectively wipe a particular type of cache on edge. - The cache is controlled by Topology Service, but the HTTP Router might force some response into the cache.
For the above example:
- The router sees request to
/api/v4/projects/1000/issues
. - It selects the above
rule
for this request, which requestsclassify
forproject_id_or_path_encoded
. - It decodes
project_id_or_path_encoded
to be1000
. - Checks the cache if there’s
project_id_or_path_encoded=1000
associated to any Cell. - Sends the request to
/api/v1/classify
(type=project_id_or_path
,value=1000
) if no Cells was found in cache. - Topology Service responds with the Cell holding the given project, and also all other equivalent classification keys for the resource that should be put in the cache.
- Routing Service caches for the duration specified in configuration, or response.
# POST /api/v1/classify
## Request:
{
"type": "project_id_or_path",
"value": 1000
}
## Response:
{
"action": "proxy",
"proxy": {
"address": "cell1.gitlab.com"
},
"other_classifications": [ // list of all equivalent keys that should be put in the cache
{ "type": "session_prefix", "value": "cell1" },
{ "type": "project_full_path", "value": "gitlab-org/gitlab" },
{ "type": "project_full_path", "value": "gitlab-org/gitlab" },
{ "type": "namespace_full_path", "value": "gitlab-org" }
]
}
The following code represents a negative response when a classification key was not found:
# POST /api/v4/internal/cells/classify
## Request:
{
"type": "project_id_or_path",
"value": 1000
}
## Response:
{
"action": "reject",
"reject": {
"http_status": 404
}
}
Configuration
All configuration will be provided via environment variables:
- HTTP Router will only configure an address to Topology Service
- The mTLS will be used when connecting to Topology Service to authentication / authorization.
Deployment
There are several phases to fully deploy the HTTP Routing service to GitLab.com.
- The first phase is to deploy a simple pass-through proxy in front of the webservice (
gitlab.com
).- First, we will utilize Cloudflare Routes to rollout the worker gradually, without the need to change DNS.
- (Maybe optional) The next step is to provision an internal-only DNS for
the legacy cell (e.g.
cell-1.gprd.int.gitlab.com
). We then proxy the HTTP router to this new DNS, and secure this connection with a solution likemTLS
, or Cloudflare Tunnel. In order to do this, the HTTP Router will need to be assigned thegitlab.com
DNS record, likely with custom domains.
- The second phase is to deploy a simple pass-through proxy in front of
the container registry (
registry.gitlab.com
). This will use the same deployment of the HTTP Router forgitlab.com
.- First, we will utilize Cloudflare Routes to rollout the worker gradually, without the need to change DNS.
- (Maybe optional) The next step is to provision an internal-only DNS for
the legacy cell (e.g.
cell-1-registry.gprd.int.gitlab.com
). We then proxy the HTTP router to this DNS, and secure this connection with a solution likemTLS
, or Cloudflare Tunnel. In order to do this, the HTTP Router will need to be assigned theregistry.gitlab.com
DNS record, likely with custom domains.
- The third phase involves multiple cells.
- For any new cell the HTTP Router routes to, the cell will have:
- An internal-only DNS, like
cell-2.gdrd.int.gitlab.com
that is only accessible via the HTTP Router. - A secure, encrypted connection between the HTTP Router and the cell.
- An internal-only DNS, like
- For any new cell the HTTP Router routes to, the cell will have:
Rolling Out Rule Sets
HTTP Router rule sets define the logic how HTTP requests are routed within the Cells environment.
Modifying these rule sets can potentially impact the availability of the entire site or the SLO of any specific service.
Therefore, it is crucial to exercise extreme caution when rolling out changes to the rule sets.
To implement these changes with minimal user impact and zero downtime, we will use the Gradual deployments functionality provided by Cloudflare. This approach allows us to limit the impact of any faulty changes to a small subset of total requests.
The Rule Set that is being used, is configured in the HTTP Router configuration file, where GITLAB_RULES_CONFIG
environment variable defines the name of the rule set file relative to src/rules directory.
We will use the existing deployment mechanism. We will gradually increase the rollout percentage, proceeding only when we are confident in the quality and expected outcomes of the rule set changes. The following sequence of rollout percentages is recommended: 5% → 25% → 50% → 75% → 100%.
Prerequisites
- Before processing with rollout steps, make sure you clearly defined the timeline.
- Schedule the change
- Add a new Change Lock entry to the configuration file. Use the
http-router
Change Lock tag for this entry.
Note: It is important for this rollout strategy to follow the timeline. You will need to merge MRs with a certain interval. Therefore, it’s recommended to work in pairs.
Rollout steps
- Create MR to modify CI configuration of HTTP Router Deployer
.gitlab-ci.yml
. In the global variables section, set bothCHANGE_LOCK_OVERRIDE
andOVERRIDE_LAST_PERCENTAGE
environment variables totrue
linking to a change management issue. - In the same MR, change
ROLLOUT_PERCENTAGES
environment variable in deploy-worker.sh script. Set the value to5
. Example:ROLLOUT_PERCENTAGES="5"
- Merge MR.
- Create and merge MR to update the
GITLAB_RULES_CONFIG
setting inside ofwrangler.toml
to the new rule set. - Do any validation for the new rule set and validate that no SLO was effected.
- Before increasing the
ROLLOUT_PERCENTAGES
have some baking time, which can change depending on the environment. - If no anomalies found and there is not impact on SLO’s repeat step 1 for
25
,50
,75
,100
percents. KeepCHANGE_LOCK_OVERRIDE
andOVERRIDE_LAST_PERCENTAGE
set totrue
through entire rollout cycle. - Once 100% of traffic is rollout out, open MR on deploy-worker.sh script to set the value back to the full sequence
"5 25 50 75 100"
. Example:ROLLOUT_PERCENTAGES="5 25 50 75 100"
. Remove theOVERRIDE_LAST_PERCENTAGE
andCHANGE_LOCK_OVERRIDE
environment variables in.gitlab-ci.yml
.
Request flows
- There are two Cells.
gitlab-org
is a top-level namespace and lives inCell US0
in theGitLab.com Public
organization.my-company
is a top-level namespace and lives inCell EU0
in themy-organization
organization.
Router configured to perform the following routing
- The Cell US0 supports all other public-facing projects.
- The Cell EU0 configured to generate all secrets and session cookies with a prefix like
cell_eu0_
.- The Personal Access Token is scoped to Organization, and because the Organization is part only of a single Cell, the PATs generated are prefixed with Cell identifier.
- The Session Cookie encodes Organization in-use, and because the Organization is part only of a single Cell, the session cookie generated is prefixed with Cell identifier.
- The Cell EU0 allows only private organizations, groups, and projects.
- The Cell US0 is a target Cell for all requests unless explicitly prefixed.
Router rules:
{
"rules": [
{
"cookies": {
"_gitlab_session": {
"regex_match": "^(?<cell_name>cell.*:)"
}
},
"action": "classify",
"classify": {
"type": "session_prefix",
"value": "${cell_name}"
}
},
{
"headers": {
"GITLAB_TOKEN": {
"regex_match": "^(?<cell_name>cell.*-)"
}
},
"action": "classify",
"classify": {
"type": "token_prefix",
"value": "${cell_name}"
}
},
{
"action": "classify",
"classify": {
"type": "first_cell",
}
}
]
}
Goes to /my-company/my-project
while logged in into Cell EU0
- Because user switched the Organization to
my-company
, its session cookie is prefixed withcell_eu0_
. - User sends request
/my-company/my-project
, and because the cookie is prefixed withcell_eu0_
it is directed to Cell EU0. Cell EU0
returns the correct response.
sequenceDiagram participant user as User participant router as Router participant cache as Cache participant ts as Topology Service participant cell_eu0 as Cell EU0 participant cell_eu1 as Cell EU1 user->>router: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9 router->>+cache: GetClassify(type=session_prefix, value=cell_eu0) cache->>-router: NotFound router->>+ts: Classify(type=session_prefix, value=cell_eu0) ts->>-router: Proxy(address="cell-eu0.gitlab.com") router->>cache: Cache(type=session_prefix, value=cell_eu0) = Proxy(address="cell-eu0.gitlab.com")) router->>cell_eu0: GET /my-company/my-project cell_eu0->>user: <h1>My Project...
Goes to /my-company/my-project
while not logged in
- User visits
/my-company/my-project
, and because it does not have session cookie, the request is forwarded toCell US0
. - User signs in.
- GitLab sees that user default organization is
my-company
, so it assigns session cookie withcell_eu0_
to indicate that user is meant to interact withmy-company
. - User sends request to
/my-company/my-project
again, now with the session cookie that proxies toCell EU0
. Cell EU0
returns the correct response.
NOTE:
The cache
is intentionally skipped here to reduce diagram complexity.
sequenceDiagram participant user as User participant router as Router participant ts as Topology Service participant cell_us0 as Cell US0 participant cell_eu0 as Cell EU0 user->>router: GET /my-company/my-project router->>ts: Classify(type=first_cell) ts->>router: Proxy(address="cell-us0.gitlab.com") router->>cell_us0: GET /my-company/my-project cell_us0->>user: HTTP 302 /users/sign_in?redirect=/my-company/my-project user->>router: GET /users/sign_in?redirect=/my-company/my-project router->>cell_us0: GET /users/sign_in?redirect=/my-company/my-project cell_us0-->>user: <h1>Sign in... user->>router: POST /users/sign_in?redirect=/my-company/my-project router->>cell_us0: POST /users/sign_in?redirect=/my-company/my-project cell_us0->>user: HTTP 302 /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9 user->>router: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9 router->>ts: Classify(type=session_prefix, value=cell_eu0) ts->>router: Proxy(address="cell-eu0.gitlab.com") router->>cell_eu0: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9 cell_eu0->>user: <h1>My Project...
Goes to /gitlab-org/gitlab
after last step
User visits /gitlab-org/gitlab
, and because it does have a session cookie, the request is forwarded to Cell EU0
.
There is no need to ask Topology Service, since the session cookie is cached.
sequenceDiagram participant user as User participant router as Router participant cache as Cache participant ts as Topology Service participant cell_eu0 as Cell EU0 participant cell_eu1 as Cell EU1 user->>router: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9 router->>+cache: GetClassify(type=session_prefix, value=cell_eu0) cache->>-router: Proxy(address="cell-eu0.gitlab.com")) router->>cell_eu0: GET /my-company/my-project cell_eu0->>user: <h1>My Project...
Performance and reliability considerations
- It is expected that there will be penalty when learning new classification key. However, it is expected that multi-layer cache should provide a very high cache-hit-ratio, due to low cardinality of classification key. The classification key would effectively be mapped into resource (organization, group, or project), and there’s a finite amount of those.
Alternatives
Buffering requests
The Stateless Router using Requests Buffering
describes an approach where Cell answers with X-Gitlab-Cell-Redirect
to redirect request to another Cell:
- This is based on a need to buffer the whole request (headers + body) which is very memory intensive.
- This proposal does not provide an easy way to handle mixed deployment of Cells, where Cells might be running different versions.
- This proposal likely requires caching significantly more information, since it is based on requests, rather than on decoded classification keys.
Learn request
The Stateless Router using Routes Learning
describes an approach similar to the one in this document. Except the route rules and classification
is done in a single go in a form of pre-flight check /api/v4/internal/cells/learn
:
- This makes the whole routes learning dynamic, and dependent on availability of the Cells.
- This proposal does not provide an easy way to handle mixed deployment of Cells, where Cells might be running different versions.
- This proposal likely requires caching significantly more information, since it is based on requests, rather than on decoded classification keys.
FAQ
- How and when will Routing Service compile set of rules?
To be defined.
Links
aa032b7e
)