GitLab Observability in GitLab.com and Self-Managed GitLab Instances

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
ongoing mappelman sguyn nklick mkaeppler devops monitor 2023-09-11

Summary

GitLab observability will first be available to self-managed instances through Cloud Connector so self-managed users can take advantage of GitLab Observability without the need to manage an observability platform. This document outlines the architecture.

In addition to making GitLab observability available to self-managed instances through Cloud Connector, GitLab observability available in GitLab.com will also realign architecture to utilize Cloud Connector.

The following architectural overview can be applied to both self-managed GitLab instances and .com GitLab instances.

Goals

  • Offer seamless Observability features to self-managed users without requiring them to manage a scalable and reliable observability system.
  • Move Observability APIs to GitLabs project resource API.
  • Maintain API consistency between .com and self-managed GitLab

Architecture

GitLab Observability Backend (GOB) deployments will be managed by GitLab Inc. as part of the GitLab Inc. infrastructure. GitLab.com and all self-managed GitLab instances will initially use the same GOB backend that exists today. However, in the future, we could also deploy regional GOB instances, or even customer-specific GOB instances if the need arises, and route requests to the desired GOB instance.

Blue arrows highlight the request path from both GitLab SaaS and GitLab self-managed instances to GOB. All requests pass through the Cloud Connector Gateway public service and on to the private GOB service. GitLab SaaS cells will all have access to the public Cloud Connector Gateway.

The Cloud Connector Gateway will be a single entry point Load balancer.

Detailed request architecture is outlined in the next section.

Walkthrough of Observability Requests

The following two diagrams highlight the request flow for GitLab.com customers and for self-managed GitLab customers. A couple small differences exist between the two:

  1. The JWT sent to GOB in GitLab.com requests is created in Rails, whereas the JWT or otherwise referred to as IJWT in the self-managed flow originates from CustomersDot.
  2. There is no CustomersDot JWT sync in the GitLab.com flow.

Observability offered in GitLab.com

GOB SaaS behind cloud.gitlab.com (GitLab Observability Backend)GitLab.com (Rails)GitLab.com (WorkHorse)GitLab.com (UI)OpenTelemetry  SDK/CollectorGitLab.com userGOB SaaS behind cloud.gitlab.com (GitLab Observability Backend)GitLab.com (Rails)GitLab.com (WorkHorse)GitLab.com (UI)OpenTelemetry  SDK/CollectorGitLab.com userConfigure SDK/CollectorSending OpenTelemetry Dataloop[Requests issued by SDK/Collector]GitLab UIDirect to APIcreate Group/Project PAT1PAT2configure with PAT & Project endpoint3send opentelemetry data with PAT4preAuthHandler without body5auth with PAT6metadata + JWT7send data with JWT + metadata headers8validate JWT & store data9response10response11view12request with cookies13preAuthHandler forward cookies14auth with cookies15metadata + JWT16forward with JWT + metadata headers17validate JWT & return results18response19response20create Group/Project PAT21PAT22request with PAT23preAuthHandler24auth with PAT25metadata + JWT26forward with JWT + metadata headers27validate JWT & return results28response29response30

Observability offered in Self-Managed GitLab

GOB SaaS behind cloud.gitlab.com (GitLab Observability Backend)CustomersDotSM GitLab (Rails)SM GitLab (WorkHorse)SM GitLab (UI)OpenTelemetry  SDK/CollectorSM userGOB SaaS behind cloud.gitlab.com (GitLab Observability Backend)CustomersDotSM GitLab (Rails)SM GitLab (WorkHorse)SM GitLab (UI)OpenTelemetry  SDK/CollectorSM userinstance-to-instance authIncludes IJWT scoped to eligible servicesloop[chron job]Configure SDK/CollectorSending OpenTelemetry Dataloop[Requests issued by SDK/Collector]GitLab UIDirect to APIsync Cloud Connector access data1access data + IJWT2store access data + IJWT3create Group/Project PAT4PAT5configure with PAT & Project endpoint6send opentelemetry data with PAT7preAuthHandler without body8auth with PAT9metadata + IJWT10send data with IJWT + metadata headers11validate IJWT & store data12response13response14view15request with cookies16preAuthHandler forward cookies17auth with cookies18metadata + IJWT19forward with IJWT + metadata headers20validate IJWT21response22response23create Group/Project PAT24PAT25request with PAT26preAuthHandler27auth with PAT28metadata + IJWT29forward with IJWT + metadata headers30validate IJWT31response32response33

Performance

It is critically important to keep the body of any observability requests from burdening Rails/Puma. All preAuthHandlers in Workhorse will ensure the body is not forwarded to Rails and will only be forwarded to GOB when auth is successful.

If we consider the daily data transmitted and stored concerning our observability of GitLab.com, and we extrapolate that across our self-managed customers who might have equally demanding needs, we can get an idea of how much data will pass through Workhorse and Cloud Connector. GitLab.com produces 150M+ metrics series, sampled every 30-60 seconds, and 18-22 TB of logs per day.

We could assume that any Ultimate tier root-level namespace on GitLab.com or any Ultimate tier self-managed instance could send a similar magnitude of data through cloud.GitLab.com.

To get an idea of what this equals to in requests/second and bytes/second, we walk through the following basic example.

Assumptions:

  • Compression: use this chart that illustrates compression ratios for observability data types
  • An example metric is 396 Bytes raw and translates to a md_metric_request in the compression table linked above:
    • {__name__="cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits", cluster="play-db-cluster", container="k6-grpc", instance="10.4.3.6:8080", job="integrations/Kubernetes/kube-state-metrics", namespace="play-backends", node="gke-raintank-dev-pla-raintank-dev-pla-3cd3aafc-hijt", pod="k6-grpc-5c74969fdc-6n2bn", resource="cpu", uid="56e073b7-ca28-4cf4-b3da-83dc97763bef", unit="core"}
  • Trace volume can easily exceed log volume, with similar context attached to each span as would be found in a log line. Tracing is more similar to debug logging, so sampling is used to reduce the overall volume to reasonable levels. Let’s assume for argument’s sake that tracing volume after sampling is equal to logging volume.
  • For compression ratios, assume md_log_request, md_trace_request and md_metric_request

Single large Ultimate tier Organization (via GitLab.com or self-managed):

  • Metrics volume: 150M active metrics series sampled every 30 seconds

    • Samples/sec = 150M / 30 = 5M
    • Bytes/sec = 5M * 396B / 2.21 compression ratio = 896MB/s
  • Logs volume: 18TB/day

    • Bytes/sec uncompressed = 18e+12 / ( 24 x 60 x 60 ) = 208MB/s
    • Bytes/sec compressed = 208 / 1.35 compression ratio = 154Mb/s
  • Traces volume: from assumptions above, assume to be the same as log volume

    • Bytes/sec compressed = 208 / 1.57 compression ratio = 132Mb/s

Treating this as an upper bound assumption that the total possible demand from an Ultimate tier customer could be 896 + 154 + 132 = 1.2GB/s for data ingestion. Data ingestion demands are always magnitudes higher than reading back observability data, so we can disregard read load for the sake of this back-of-napkin math.

It is also useful to consider the average requests per second that deliver the 1.2GB/s. This will allow us to infer the number of preAuthHandler requests against Rails.

The OpenTelemetry Batch Processor is the recommended way to send metrics, logs, and traces in configurable-sized batches. The default batch size is 8192 events. If we take the average compressed size of md_log_request, md_trace_request and md_metric_request from the Opentelemetry compression comparison and the Metric, Log, Trace Volume outlined above we can compute the average request/second using the default batch size.

  • Metrics req/s:

    • 896MB per sec / 145 compressed bytes per event = 6.2M events/sec
    • 6.2M events per sec / 8192 events per batch = 760 reqs/sec
  • Logs req/s:

    • 154MB per sec / 268 compressed bytes per event = 0.57M events/sec
    • 0.57M events per sec / 8192 events per batch = 70 reqs/sec
  • Traces req/s:

    • 132MB per sec / 288 compressed bytes per event = 0.46M events/sec
    • 0.46M events per sec / 8192 events per batch = 56 reqs/sec

Total requests per second using default batching is 890 requests/second for each large customer with 150M active metrics series, 18TB of daily logs, and 18TB of daily traces.

Observability for GitLab.com

For each customer of this magnitude on GitLab.com, there would be a demand of 1.2GB/s through Workhorse.

To maximize performance and GitLab service availability, a dedicated web fleet should be created for observability just like the recent addition for code suggestions. All incoming request paths starting with /api/v4/projects/:id/observability/ can be routed to the dedicated observability web fleet in .com. This mitigates the risk of observability requests saturating the main GitLab web fleet and allows for horizontal autoscaling of the observability fleet based on demand.

Observability for Self-Managed

The self-managed Workhorse throughput for this magnitude of data would be 1.2GB/s. Self-managed customers will have access to the appropriate configuration in our reference architecture documentation.

Authentication & Authorization

As outlined in the request flow, there are two main request legs for any interaction with GOB.

  1. User/Client to their GitLab instance
  2. GitLab instance to GOB

For all leg 1 requests, authentication and authorization will be enforced by leveraging a preAuthorizeHandler in Workhorse. This handler will call an internal Rails API to perform both authN and authZ using either the supplied PAT or browser cookies. The internal Rails APIs will first authenticate the request to ensure it was issued by Workhorse using the Workhorse/Rails JWT and shared signing secret. If successful, the endpoint will then perform the appropriate auth checks against the supplied PAT/browser cookies. If successful, Rails will return in the response back to Workhorse, all the headers that need to be sent to GOB including the IJWT (if instance is self-managed) and any metadata, in a similar manner to Workhorse/Rails websocket auth-data.

Workhorse will then forward the request to GOB with either the IJWT (self-managed) or the Workhorse JWT (SaaS). GOB will then verify the JWT to authenticate the request.

Both Rails and Workhorse use the same secret to sign JWTs so the JWKS available at the jwks_uri found at https://gitlab.com/.well-known/openid-configuration can be used to verify the identity of the JWT created from Rails or Workhorse, meaning requests to GOB could also be sent from Rails. For the IJWTs, the JWKS can be retrieved in accordance with this.

Enabling

Self-Managed users will enable the use of Cloud Connected GitLab Observability through the Customers Portal. All relevant licenses and subscription info will then be encoded into the IJWT which Workhorse, Rails, and GOB will use to manage access and enforce limits and quotas.

Rate-limits and Quota Management

GitLab Observability Backend already has rate limiting on its API gateway to prevent denial of service attacks.

In addition to these existing rate limits, we will also add rate limiting to Workhorse/Rails in the same idiomatic manner to which other GitLab APIs manage rate limits to ensure rate limits are enforced at the edge, mitigating any DOS attacks at the edge before calling on more downstream services.

GitLab Observability Quota management is enforced in GOB. GOB configures compute quota for each top-level namespace, depending on the customer’s license. In addition that, it will enforce storage quota appropriately based on the customer license by tracking storage and purging old data to make space for new data, ensuring storage is always kept below an upper bound for each top-level namespace. Eventually, this quota management will be mapped to the GitLab Organization construct instead of top-level namespace.

To ensure rate limiting and quota management in the GOB service is enforced correctly, GOB will use the IJWT token to extract relevant information about the customer license. Metadata headers will also be sent as part of the request leg from Workhorse to GOB that provides any additional information required by GOB to fulfill requests and enforce quotas and limits.

APIs

All observability APIs will proxy through Workhorse and will live under GitLabs project resource. HTTP will be supported initially and gRPC will be added at a later date.

System Monitoring

GitLab Observability Backend already has system monitoring in place that plugs into GitLabs centralized metrics and logging system. Additional metrics will also be captured in Workhorse and in Rails to provide visibility and alerting on the user-facing API.

Last modified August 23, 2024: Ensure frontmatter is consistent (e47101dc)