GitLab Observability in GitLab.com and Self-Managed GitLab Instances
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
ongoing |
mappelman
|
sguyn
nklick
mkaeppler
|
devops monitor | 2023-09-11 |
Summary
GitLab observability will first be available to self-managed instances through Cloud Connector so self-managed users can take advantage of GitLab Observability without the need to manage an observability platform. This document outlines the architecture.
In addition to making GitLab observability available to self-managed instances through Cloud Connector, GitLab observability available in GitLab.com will also realign architecture to utilize Cloud Connector.
The following architectural overview can be applied to both self-managed GitLab instances and .com GitLab instances.
Goals
- Offer seamless Observability features to self-managed users without requiring them to manage a scalable and reliable observability system.
- Move Observability APIs to GitLabs project resource API.
- Maintain API consistency between .com and self-managed GitLab
Architecture
@startuml
node "GitLab Inc. Infrastructure" {
package "Private Services" {
[GitLab Observability Backend (GOB)] as GOB #yellow
}
package "Public Services" {
[GitLab (SaaS)] as SAAS
[Cloud Connector gateway] as CC
[Customers Portal] as CDOT
}
}
node "Customer Infrastructure" {
[OpenTelemetry SDK/Collector] as OT
[GitLab (Self-Managed)] as SM
[Sidekiq] as SK
}
OT --> SM : "send"
SAAS -[#blue]-> CC : " access GOB via CC"
CC -[#blue]-> GOB: " access GOB"
GOB --> SAAS : "fetch JWKS"
GOB --> CDOT : "fetch JWKS"
SM -[#blue]-> CC : "access GOB via CC"
SK --> CDOT : "sync CC access token"
note bottom of CC
Makes routing decisions:
cloud.gitlab.com/observe/* ---> GOB
...
end note
note top of GOB
- Use JWKS from SaaS to auth SaaS requests
- Use JWKS from Customers Portal to auth
Self-Managed GitLab requests
...
end note
@enduml
GitLab Observability Backend (GOB) deployments will be managed by GitLab Inc. as part of the GitLab Inc. infrastructure. GitLab.com and all self-managed GitLab instances will initially use the same GOB backend that exists today. However, in the future, we could also deploy regional GOB instances, or even customer-specific GOB instances if the need arises, and route requests to the desired GOB instance.
Blue arrows highlight the request path from both GitLab SaaS and GitLab self-managed instances to GOB. All requests pass through the Cloud Connector Gateway public service and on to the private GOB service. GitLab SaaS cells will all have access to the public Cloud Connector Gateway.
The Cloud Connector Gateway will be a single entry point Load balancer.
Detailed request architecture is outlined in the next section.
Walkthrough of Observability Requests
The following two diagrams highlight the request flow for GitLab.com customers and for self-managed GitLab customers. A couple small differences exist between the two:
- The JWT sent to GOB in GitLab.com requests is created in Rails, whereas the JWT or otherwise referred to as IJWT in the self-managed flow originates from CustomersDot.
- There is no CustomersDot JWT sync in the GitLab.com flow.
Observability offered in GitLab.com
sequenceDiagram autonumber participant U as GitLab.com user participant SDK as OpenTelemetry <br/> SDK/Collector participant .comUI as GitLab.com (UI) participant .comW as GitLab.com (WorkHorse) participant .comR as GitLab.com (Rails) participant GOB as GOB SaaS behind cloud.gitlab.com <br/>(GitLab Observability Backend) Note over U,GOB: Configure SDK/Collector U->>.comR: create Group/Project PAT .comR-->>U: PAT U->>SDK: configure with PAT & Project endpoint Note over U,GOB: Sending OpenTelemetry Data loop Requests issued by SDK/Collector SDK->>.comW: send opentelemetry data with PAT .comW->>.comR: preAuthHandler without body .comR->>.comR: auth with PAT .comR-->>.comW: metadata + JWT .comW->>GOB: send data with JWT + metadata headers GOB->>GOB: validate JWT & store data GOB-->>.comW: response .comW-->>SDK: response end Note over U,GOB: GitLab UI U->>.comUI: view .comUI->>.comW: request with cookies .comW->>.comR: preAuthHandler forward cookies .comR->>.comR: auth with cookies .comR-->>.comW: metadata + JWT .comW->>GOB: forward with JWT + metadata headers GOB->>GOB: validate JWT & return results GOB-->>.comW: response .comW-->>.comUI: response Note over U,GOB: Direct to API U->>.comR: create Group/Project PAT .comR-->>U: PAT U->>.comW: request with PAT .comW->>.comR: preAuthHandler .comR->>.comR: auth with PAT .comR-->>.comW: metadata + JWT .comW->>GOB: forward with JWT + metadata headers GOB->>GOB: validate JWT & return results GOB-->>.comW: response .comW-->>U: response
Observability offered in Self-Managed GitLab
sequenceDiagram autonumber participant U as SM user participant SDK as OpenTelemetry <br/> SDK/Collector participant SMUI as SM GitLab (UI) participant SMW as SM GitLab (WorkHorse) participant SMR as SM GitLab (Rails) participant CD as CustomersDot participant GOB as GOB SaaS behind cloud.gitlab.com <br/>(GitLab Observability Backend) Note over U,GOB: instance-to-instance auth loop chron job SMR->>CD: sync Cloud Connector access data Note over CD,SMR: Includes IJWT scoped to eligible services CD-->>SMR: access data + IJWT SMR->>SMR: store access data + IJWT end Note over U,GOB: Configure SDK/Collector U->>SMR: create Group/Project PAT SMR-->>U: PAT U->>SDK: configure with PAT & Project endpoint Note over U,GOB: Sending OpenTelemetry Data loop Requests issued by SDK/Collector SDK->>SMW: send opentelemetry data with PAT SMW->>SMR: preAuthHandler without body SMR->>SMR: auth with PAT SMR-->>SMW: metadata + IJWT SMW->>GOB: send data with IJWT + metadata headers GOB->>GOB: validate IJWT & store data GOB-->>SMW: response SMW-->>SDK: response end Note over U,GOB: GitLab UI U->>SMUI: view SMUI->>SMW: request with cookies SMW->>SMR: preAuthHandler forward cookies SMR->>SMR: auth with cookies SMR-->>SMW: metadata + IJWT SMW->>GOB: forward with IJWT + metadata headers GOB->>GOB: validate IJWT GOB-->>SMW: response SMW-->>SMUI: response Note over U,GOB: Direct to API U->>SMR: create Group/Project PAT SMR-->>U: PAT U->>SMW: request with PAT SMW->>SMR: preAuthHandler SMR->>SMR: auth with PAT SMR-->>SMW: metadata + IJWT SMW->>GOB: forward with IJWT + metadata headers GOB->>GOB: validate IJWT GOB-->>SMW: response SMW-->>U: response
Performance
It is critically important to keep the body of any observability requests from burdening Rails/Puma. All preAuthHandlers in Workhorse will ensure the body is not forwarded to Rails and will only be forwarded to GOB when auth is successful.
If we consider the daily data transmitted and stored concerning our observability of GitLab.com, and we extrapolate that across our self-managed customers who might have equally demanding needs, we can get an idea of how much data will pass through Workhorse and Cloud Connector. GitLab.com produces 150M+ metrics series, sampled every 30-60 seconds, and 18-22 TB of logs per day.
We could assume that any Ultimate tier root-level namespace on GitLab.com or any Ultimate tier self-managed instance could send a similar magnitude of data through cloud.GitLab.com
.
To get an idea of what this equals to in requests/second and bytes/second, we walk through the following basic example.
Assumptions:
- Compression: use this chart that illustrates compression ratios for observability data types
- An example metric is 396 Bytes raw and translates to a
md_metric_request
in the compression table linked above:{__name__="cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits", cluster="play-db-cluster", container="k6-grpc", instance="10.4.3.6:8080", job="integrations/Kubernetes/kube-state-metrics", namespace="play-backends", node="gke-raintank-dev-pla-raintank-dev-pla-3cd3aafc-hijt", pod="k6-grpc-5c74969fdc-6n2bn", resource="cpu", uid="56e073b7-ca28-4cf4-b3da-83dc97763bef", unit="core"}
- Trace volume can easily exceed log volume, with similar context attached to each span as would be found in a log line. Tracing is more similar to debug logging, so sampling is used to reduce the overall volume to reasonable levels. Let’s assume for argument’s sake that tracing volume after sampling is equal to logging volume.
- For compression ratios, assume
md_log_request
,md_trace_request
andmd_metric_request
Single large Ultimate tier Organization (via GitLab.com or self-managed):
-
Metrics volume: 150M active metrics series sampled every 30 seconds
- Samples/sec = 150M / 30 = 5M
- Bytes/sec = 5M * 396B / 2.21 compression ratio = 896MB/s
-
Logs volume: 18TB/day
- Bytes/sec uncompressed = 18e+12 / ( 24 x 60 x 60 ) = 208MB/s
- Bytes/sec compressed = 208 / 1.35 compression ratio = 154Mb/s
-
Traces volume: from assumptions above, assume to be the same as log volume
- Bytes/sec compressed = 208 / 1.57 compression ratio = 132Mb/s
Treating this as an upper bound assumption that the total possible demand from an Ultimate tier customer could be 896 + 154 + 132 = 1.2GB/s for data ingestion. Data ingestion demands are always magnitudes higher than reading back observability data, so we can disregard read load for the sake of this back-of-napkin math.
It is also useful to consider the average requests per second that deliver the 1.2GB/s. This will allow us to infer the number of preAuthHandler requests against Rails.
The OpenTelemetry Batch Processor is the recommended way to send metrics, logs, and traces in configurable-sized batches. The default batch size is 8192 events.
If we take the average compressed size of md_log_request
, md_trace_request
and md_metric_request
from the Opentelemetry compression comparison and the
Metric, Log, Trace Volume outlined above we can compute the average request/second using the default batch size.
-
Metrics req/s:
- 896MB per sec / 145 compressed bytes per event = 6.2M events/sec
- 6.2M events per sec / 8192 events per batch = 760 reqs/sec
-
Logs req/s:
- 154MB per sec / 268 compressed bytes per event = 0.57M events/sec
- 0.57M events per sec / 8192 events per batch = 70 reqs/sec
-
Traces req/s:
- 132MB per sec / 288 compressed bytes per event = 0.46M events/sec
- 0.46M events per sec / 8192 events per batch = 56 reqs/sec
Total requests per second using default batching is 890 requests/second for each large customer with 150M active metrics series, 18TB of daily logs, and 18TB of daily traces.
Observability for GitLab.com
For each customer of this magnitude on GitLab.com, there would be a demand of 1.2GB/s through Workhorse.
To maximize performance and GitLab service availability, a dedicated web fleet should be created for observability just like the recent addition for code suggestions. All incoming
request paths starting with /api/v4/projects/:id/observability/
can be routed to the dedicated observability web fleet in .com.
This mitigates the risk of observability requests saturating the main GitLab web fleet and allows for horizontal autoscaling of the observability fleet based on demand.
Observability for Self-Managed
The self-managed Workhorse throughput for this magnitude of data would be 1.2GB/s. Self-managed customers will have access to the appropriate configuration in our reference architecture documentation.
Authentication & Authorization
As outlined in the request flow, there are two main request legs for any interaction with GOB.
- User/Client to their GitLab instance
- GitLab instance to GOB
For all leg 1 requests, authentication and authorization will be enforced by leveraging a preAuthorizeHandler in Workhorse. This handler will call an internal Rails API to perform both authN and authZ using either the supplied PAT or browser cookies. The internal Rails APIs will first authenticate the request to ensure it was issued by Workhorse using the Workhorse/Rails JWT and shared signing secret. If successful, the endpoint will then perform the appropriate auth checks against the supplied PAT/browser cookies. If successful, Rails will return in the response back to Workhorse, all the headers that need to be sent to GOB including the IJWT (if instance is self-managed) and any metadata, in a similar manner to Workhorse/Rails websocket auth-data.
Workhorse will then forward the request to GOB with either the IJWT (self-managed) or the Workhorse JWT (SaaS). GOB will then verify the JWT to authenticate the request.
Both Rails and Workhorse use the same secret to sign JWTs so the JWKS available at the jwks_uri
found at https://gitlab.com/.well-known/openid-configuration
can be used to verify the identity of the JWT created from Rails or Workhorse, meaning requests to GOB could also be sent from Rails. For the IJWTs, the JWKS can be retrieved in accordance with this.
Enabling
Self-Managed users will enable the use of Cloud Connected GitLab Observability through the Customers Portal. All relevant licenses and subscription info will then be encoded into the IJWT which Workhorse, Rails, and GOB will use to manage access and enforce limits and quotas.
Rate-limits and Quota Management
GitLab Observability Backend already has rate limiting on its API gateway to prevent denial of service attacks.
In addition to these existing rate limits, we will also add rate limiting to Workhorse/Rails in the same idiomatic manner to which other GitLab APIs manage rate limits to ensure rate limits are enforced at the edge, mitigating any DOS attacks at the edge before calling on more downstream services.
GitLab Observability Quota management is enforced in GOB. GOB configures compute quota for each top-level namespace, depending on the customer’s license. In addition that, it will enforce storage quota appropriately based on the customer license by tracking storage and purging old data to make space for new data, ensuring storage is always kept below an upper bound for each top-level namespace. Eventually, this quota management will be mapped to the GitLab Organization construct instead of top-level namespace.
To ensure rate limiting and quota management in the GOB service is enforced correctly, GOB will use the IJWT token to extract relevant information about the customer license. Metadata headers will also be sent as part of the request leg from Workhorse to GOB that provides any additional information required by GOB to fulfill requests and enforce quotas and limits.
APIs
All observability APIs will proxy through Workhorse and will live under GitLabs project resource. HTTP will be supported initially and gRPC will be added at a later date.
System Monitoring
GitLab Observability Backend already has system monitoring in place that plugs into GitLabs centralized metrics and logging system. Additional metrics will also be captured in Workhorse and in Rails to provide visibility and alerting on the user-facing API.
e47101dc
)