Cloud Connector architecture evolution
This page contains information related to upcoming products, features, and functionality.
It is important to note that the information presented is for informational purposes only.
Please do not rely on this information for purchasing or planning purposes.
The development, release, and timing of any products, features, or functionality may be
subject to change or delay and remain at the sole discretion of GitLab Inc.
Summary
This design doc covers architectural decisions and proposed changes to
Cloud Connector’s technical foundations.
Refer to the official architecture documentation
for an accurate description of the current status.
Motivation
Our “big problem to solve” is to bring feature parity to our SaaS and self-managed offerings.
Until now, SaaS and self-managed (SM) GitLab instances consume features only from the
AI gateway,
which also implements an Access Layer
to verify that a given request is allowed
to access the respective AI feature endpoint.
This approach has served us well because it:
- Required minimal changes from an architectural standpoint to allow SM users to consume AI features hosted by us.
- Caused minimal friction with ongoing development on GitLab.com.
- Reduced time to market.
However, the AI gateway alone does not sufficiently abstract over a wider variety of features,
as by definition it is designed to serve AI features only.
Goals
We will use this blueprint to make incremental changes to Cloud Connector’s technical framework
to enable other backend services to service self-managed/GitLab Dedicated customers in the same way
the AI gateway does today. This will directly support our mission of bringing feature parity
to all GitLab customers.
The major areas we are focused on are:
- Provide single access point for customers.
We found that customers are not keen on configuring their web proxies and firewalls
to allow outbound traffic to an ever growing list of GitLab-hosted services. We therefore decided to
install a global, load-balanced entry point at
cloud.gitlab.com
. This entry point can make simple
routing decisions based on the requested path, which allows us to target different backend services
as we broaden the feature scope covered by Cloud Connector.
- Status: done. The decision was documented as ADR-001.
- Remove OIDC key discovery.
The original architecture for Cloud Connector relied heavily on OIDC discovery to fetch JWT validation keys.
OIDC discovery is prone to networking and caching problems and adds complexity to solve a problem we don’t have.
Our proposed alternative to OIDC discovery is to package the public keys used for token validation from our well-known token issuers with Cloud Connector backends directly instead of fetching them over the network.
- Rate-limiting features.
During periods of elevated traffic, backends integrated with Cloud Connector such as
AI gateway or TanuKey may experience resource constraints. GitLab should apply a consistent strategy when deciding which instance
should be prioritized over others. This strategy should be uniform across all Cloud Connector services.
- Extract CloudConnector unit_primitive configuration and logic
We will implement a new unit primitive-based configuration system by extracting it to an external library (gitlab-cloud-connector) that will serve as the Single Source of Truth (SSoT).
This library will be available as both a Ruby gem and a Python package. The decision was documented as ADR-003
Decisions
Context
Problems with the current Unit Primitives configuration
Service abstraction
The service abstraction was introduced to simplify client permission checks and centralize permission logic. By organizing features around “services,” clients could determine if a user had access to a specific interface (for example, opening the GitLab Duo Chat window) without needing to understand underlying unit primitives or configurations.
Solved problems
Permission checks for UI elements
- Problem: Clients needed to check if the user had access to a UI element, not just the underlying feature.
- Solution: The service abstraction allowed clients to perform permission checks based on services like
duo_chat
or code_suggestions
, without needing to delve into unit primitives.
Shielding clients from internal changes
- Problem: Internal backend changes, such as splitting or deprecating unit primitives, could disrupt client functionality.
- Solution: The service abstraction hid these internal changes from clients, ensuring that such changes didn’t affect the interface presented to them.
Current problems
Despite its initial benefits, the services abstraction now presents several challenges.
Context
The original iteration of the blueprint suggested to stand up a dedicated Cloud Connector edge service,
through which all traffic that uses features under the Cloud Connector umbrella would pass.
The primary reasons for why we wanted this to be a dedicated service were to:
- Provide a single entry point for customers. We identified the ability for any GitLab instance
around the world to consume Cloud Connector features through a single endpoint such as
cloud.gitlab.com
as a must-have property.
- Have the ability to execute custom logic. There was a desire from product to create a space where we can
run cross-cutting business logic such as application-level rate limiting, which is hard or impossible to
do using a traditional load balancer such as HAProxy.
Decision
We decided to take a smaller incremental step toward having a “smart router” by focusing on
the ability to provide a single endpoint through which Cloud Connector traffic enters our
infrastructure. This can be accomplished using simpler means than deploying dedicated services, specifically
by pulling in a load balancing layer listening at cloud.gitlab.com
that can also perform simple routing
tasks to forward traffic into feature backends.
Context
We are exploring approaches to move away from OIDC discovery for Cloud Connector token validation for the following reasons:
- OIDC discovery is prone to networking and caching problems. If the endpoint or system is degraded or caches are stale, Cloud Connector backends cannot validate any AI requests anymore. With an increasing number of Cloud Connector backends, this issue is multiplied. An example is issue 480018 (internal for security reasons).
- OIDC adds complexity to solve a problem we don’t have. It primarily solves the problem of 3rd parties that don’t know or control each other to exchange identity and key information through a standardized web interface. This is not necessary for Cloud Connector, because all systems involved that dispense or authenticate tokens are built, operated and controlled by GitLab.
- OIDC discovery requires a callback to gitlab.com from all Cloud Connector backends. This means to support Cells, where customers can reside in different Cells, the application secret holding these keys must be managed either so as to be shared across all Cells, or sharded and the request routed accordingly for the backend to obtain the right key set. We can eliminate this problem entirely by simply not having gitlab.com publish these keys through OIDC endpoints and removing this callback. See issue 451149 for more information.
Our proposed alternative to OIDC discovery is to package the public keys used for token validation from our well-known token issuers with Cloud Connector backends directly instead of fetching them over the network, which currently requires 4 network calls to succeed whenever a server process expires its key cache.
Last modified November 22, 2024:
Update status of ADRs (d625ac32
)