Mutual authentication between Cell services

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
proposed	`daveyleach` `tkhandelwal3`	`sxuereb`	`andrewn` `glopezfernandez` `daveyleach` `tkhandelwal3`	devops tenant scale	2024-07-01

Pre-Reads

Internal TLS

Summary

We require that all communication between Cell services is secure and both parties identities are verified.

Goals

Ensure that all communication between Cell services is encrypted, uniquely identified, authenticated, and authorized on both sides.
Leverage the existing PKI Interface that is used for Internal TLS.
Provide a clear implementation path for both server and client services.
Enable secure service-to-service authentication without introducing unnecessary complexity.
Support Authorization based on mTLS certificates where appropriate.

Scope

This document focuses specifically on implementing mutual TLS (mTLS) for Authentication and Authorization between Cell and its services.

Out of Scope

Implementation of a service mesh (e.g., Istio) for encrypting traffic via mTLS is out of scope for the following reasons:
- We already leverage TLS inside a cell using Internal TLS, so extending this existing blueprint to support external services is more consistent with our architecture.
- While service mesh provides a transparent way for application developers to implement mTLS with external services, this approach introduces significant security risks when used as your sole authorization mechanism without additional application-level controls (such as JWTs). If a vulnerability exists, attackers could exploit it to use the client as a proxy to send unauthorized requests to the mTLS server. The fundamental issue is that service meshes authorize communications based solely on service identity rather than validating the legitimacy of individual requests within those services, making it inadequate as a complete authorization solution in the absence of request-level validation mechanisms.
While TLS is used to secure communications between CDNs/load balancers and their backends, as well as between internal services, this scope explicitly excludes:
- Communication from external clients to GitLab services.
- Communication between services inside a cell and those outside a cell that are not managed cells services

Implementation Principles

Developer Experience: mTLS implementation should be transparent to developers with minimal code changes.
Security: Only authorized services should be able to communicate with one another.
High Availability: Certificate rotation must occur automatically without service interruption.
Scalability: Solution must work for all Cell services.

Requirements

Requirement	Description	Priority
Security	Only authorized services can communicate with one another	high
High availability	Certificates can be rotated automatically without service interruption	high
Cells support	Can be used for all cells services	high
Authorization support	Application Developers can use mTLS for authorization	high
Provision certificates within seconds	We can create a new certificate in seconds	high
Multiple Protocol support	Support HTTP/1.1, HTTP/2, gRPC	high
Internal Traffic	All traffic between client and server services must remain within the cloud provider’s network	high
Gradual Adoption	First allow traffic without a certificate to be accepted	med
Auditable	Can validate all services are using secure authentication via mTLS	med
Cloud-managed	Can be integrated with cloud services	med

Non-Goals

mTLS should not be considered for managing user level authorization

Design and implementation details

mTLS Architecture

sequenceDiagram
    participant client as Client
    participant server as Server
    client->>server: Client connects to server
    server->>client: Server presents TLS certificate
    client->>client: Client verifies servers certificate
    client->>server: Client presents TLS certificate
    server->>server: Server validates TLS certificate
    client->>server: Communicates over encrypted TLS connection
    server->>client: Communicates over encrypted TLS connection

mTLS Implementation Flow

From the Internal TLS Blueprint, we obtain end-entity client certificates from GCP Secrets Manager for the GKE cluster.

Server/Producer Configuration

The external service host requires these changes:

Deploy the service behind an Internal Load Balancer to ensure the service is not publicly accessible
Configure mTLS Client Authentication on the Load Balancer
- Upload the Private Root CA Certificate to the Trust Config to enforce authenticated access only
Configure Private Service Connect with the Load Balancer as a backend
Set up the permissions for the Cell’s GCP project to connect to the Private Service Connect endpoint of the Server.

Client/Consumer Configuration

The service consuming the external API requires these changes:

Connect to the Private Service Connect endpoint using the VPC where the client service is deployed
Mount the certificate/key pair in the client application
Update client code to establish mTLS connections through the Private Service Connect endpoint

The diagram below illustrates the complete request flow between a Pod in a Cell and an external service, including the supporting infrastructure:

diagram showing how a Pod in Cell would be communication with outside Cell service

For detailed implementation examples and proof-of-concept documentation of this architecture, refer to: https://gitlab.com/gitlab-org/gitlab/-/issues/468640.

Authentication and Authorization with mTLS

Authentication

mTLS authentication in our Cell services architecture works through explicit certificate loading and connection setup rather than transparent proxying:

Certificate Loading: Each service must explicitly load its client certificate and private key from the filesystem. This is done in trusted code paths, as shown in the mTLS POC client code.
Connection Establishment: The service must explicitly add the TLS credentials to outgoing requests, as demonstrated in the request creation code.
Certificate Validation: The GCP Loadbalancer validates the client’s certificate against the trusted CA, ensuring only services with valid certificates can connect.

Example of loading and using TLS credentials in a Go client:

tlsCredentials, err := loadTLSCredentials()
if err != nil {
    log.Fatalf("Failed to load TLS credentials: %v", err)
}

// Create a connection with the TLS credentials
conn, err := grpc.Dial(serverAddr, grpc.WithTransportCredentials(tlsCredentials))

...
// loadTLSCredentials loads TLS credentials from file paths provided in environment variables
func loadTLSCredentials() (credentials.TransportCredentials, error) {
    // Get paths to TLS files from environment variables
    serverCACertPath := os.Getenv("SERVER_CA_CERT")
    if serverCACertPath == "" {
        return nil, fmt.Errorf("SERVER_CA_CERT environment variable is not set")
    }

    clientCertPath := os.Getenv("MTLS_CERT_CHAIN")
    if clientCertPath == "" {
        return nil, fmt.Errorf("MTLS_CERT_CHAIN environment variable is not set")
    }

    clientKeyPath := os.Getenv("MTLS_KEY")
    if clientKeyPath == "" {
        return nil, fmt.Errorf("MTLS_KEY environment variable is not set")
    }

    // Load server CA certificate
    serverCACertBytes, err := ioutil.ReadFile(serverCACertPath)
    if err != nil {
        return nil, fmt.Errorf("failed to read server CA certificate file: %v", err)
    }

    certPool := x509.NewCertPool()
    if !certPool.AppendCertsFromPEM(serverCACertBytes) {
        return nil, fmt.Errorf("failed to add server CA's certificate to pool")
    }

    // Load client certificate and key
    clientCert, err := tls.LoadX509KeyPair(clientCertPath, clientKeyPath)
    if err != nil {
        return nil, fmt.Errorf("failed to load client certificate and key: %v", err)
    }

    // Create the credentials and return it
    config := &tls.Config{
        Certificates: []tls.Certificate{clientCert},
        RootCAs:      certPool,
    }

    return credentials.NewTLS(config), nil
}

Source: mTLS POC client code

Authorization

Authorization in our mTLS implementation occurs after successful authentication and relies on client identity information:

Certificate-Based Identity: After authenticating the client connection, the server extracts identity information from the client’s certificate for authorization decisions.
GCP LoadBalancer Headers: We leverage custom mTLS headers passed by GCP LoadBalancer to the backend service, which contain pre-extracted certificate information.
Header Processing: The server extracts these headers from incoming requests to determine the client’s identity and permissions without needing to re-parse the certificate.
Access Control Enforcement: Based on the extracted identity (typically the Common Name), the server determines whether the client is authorized to access the requested resource.

Example of extracting and using certificate information for authorization in a gRPC server:

md, ok := metadata.FromIncomingContext(ctx)
if !ok {
    return nil, status.Error(codes.Internal, "failed to get metadata")
}

// Get the value of X-Client-Cert-Subject-Dn
// Header keys in gRPC metadata are lowercase
clientCertDNs := md.Get("x-client-cert-subject-dn")

var cellName string
if len(clientCertDNs) > 0 {
    // Call the function to extract common name
    commonName, err := extractCommonNameFromSubjectDN(clientCertDNs[0])

Function to extract the Common Name from the Subject DN:

func extractCommonNameFromSubjectDN(base64SubjectDN string) (string, error) {
    // Decode base64
    derBytes, err := base64.StdEncoding.DecodeString(base64SubjectDN)
    if err != nil {
        return "", fmt.Errorf("failed to decode base64: %w", err)
    }

    // Parse the DER-encoded subject DN
    var rdnSequence pkix.RDNSequence
    _, err = asn1.Unmarshal(derBytes, &rdnSequence)
    if err != nil {
        return "", fmt.Errorf("failed to parse ASN.1 DER encoding: %w", err)
    }

    // Convert to a Name
    var subject pkix.Name
    subject.FillFromRDNSequence(&rdnSequence)

    // Return the common name
    return subject.CommonName, nil
}

Source: mTLS Server Code

Advantages of mTLS over Alternative Authentication Methods

We chose mTLS as our primary mechanism for both authentication and authorization in our Cell services architecture for several key reasons:

Infrastructure-Managed Identity: mTLS allows us to leverage infrastructure-provided identity rather than application-managed tokens. This shifts the responsibility of identity management from application code to our infrastructure components, which are more specialized for this purpose.
Simplified Secret Management: Unlike token-based approaches (such as JWT tokens) that often require storing tokens as environment variables or in configuration files, mTLS certificates can be automatically provisioned, rotated, and managed by our existing PKI infrastructure. This provides several benefits:
- No hardcoded secrets in application code or environment variables
- Reduced risk of token leakage through logs or configuration dumps
- Standard certificate lifecycle management rather than custom token management
Multi-Service Compatibility: The mTLS approach scales effectively across multiple services without requiring service-specific implementation details. Each service follows the same pattern for authentication and authorization, providing a consistent security model across our entire Cell architecture.
Automated Certificate Rotation: Certificates can be rotated automatically without service interruption, which is often more complex with token-based approaches. Our existing certificate management infrastructure handles rotation seamlessly, reducing operational overhead.
Dual-Purpose Security: mTLS provides both encryption and authentication in a single mechanism, simplifying our security architecture compared to approaches that separate these concerns.

Security Considerations

If an attacker gains Remote Code Execution (RCE) on a pod, they could access certificates and keys stored on the filesystem. This is an inherent limitation, as RCE generally compromises all security boundaries within the pod.
The design focuses on preventing unauthorized service-to-service communication in scenarios where an attacker has limited access to manipulate network requests but not full system access.
Certificate rotation and proper secret management help mitigate risks associated with potential certificate compromise.

Private Root CA Regional Resilience

Regional Limitations of Certificate Authority Service

Google Cloud Certificate Authority Service is a regional service, meaning that CA Service resources are stored in specific geographical regions and cannot be moved or exported after creation. As documented in Google Cloud’s disaster recovery guidance, regional resources cannot withstand a regional failure, making them vulnerable to regional outages.

This regional limitation presents a specific risk for our Cell architecture:

New Cell Provisioning Disruption: Having a single Private Root CA means that during a regional outage, when we want to create a Cell in a different region, we won’t be able to as the Private Root CA would be unavailable, making Cell provisioning fail.
Existing Cell Operations: Importantly, existing Cells would continue to operate normally during a regional CA outage, as they rely on already-issued certificates that remain valid and functional.

Multi-Regional Private Root CA Strategy

To achieve regional resilience and ensure continuous certificate operations for new Cell provisioning, we implement a multi-regional Private Root CA strategy where each Cell uses the Private Root CA located in its deployment region.

Current Implementation

We currently provision Private Root CAs in the following regions:

us-east1 - For Cells deployed in the US East region
us-central1 - For Cells deployed in the US Central region

Planned Expansion

The strategy includes expanding Private Root CA deployment to cover all primary regions where Cells will be deployed:

Americas: us-east1, us-central1, us-west1, northamerica-northeast1
Europe: europe-west1, europe-west3, europe-north1
Asia Pacific: asia-northeast1, asia-southeast1, australia-southeast1

Implementation Architecture and Considerations

Each Cell is provisioned with certificates from the Private Root CA located in the same region:

Regional Alignment: Cells deployed in us-east1 use the Private Root CA in us-east1
Certificate Trust Configuration: All Cells, and the Server (Topology Service) must be configured to trust certificates issued by any of the regional Root CAs to maintain interoperability between Cells across regions
Monitoring and Alerting: Comprehensive monitoring ensures rapid detection of regional CA failures and provides visibility into provisioning capabilities across regions
Certificate Lifecycle Coordination: Ensure certificate rotation and lifecycle management works consistently across all regional CAs
Regional Capacity Planning: Monitor certificate issuance capacity and usage across regions to ensure adequate resources for Cell provisioning demands

Benefits of Multi-Regional Approach

New Cell Provisioning Resilience: Even if one region experiences an outage, new Cells can continue to be provisioned in other regions using their respective regional Private Root CAs
Reduced Latency: Each Cell uses a geographically close Root CA for optimal performance during certificate operations
Regional Independence: Each region operates independently, so a regional outage only affects new Cell provisioning in that specific region
Regional Isolation: A regional outage is contained to that region and doesn’t affect Cell operations or provisioning in other regions
Existing Operations Continuity: Existing Cells continue operating normally during regional outages, as they don’t require new certificate issuance for ongoing operations

This multi-regional strategy ensures that our Cell provisioning capabilities remain resilient to regional outages while maintaining the security posture and operational efficiency of our Cell architecture. While existing Cells continue to operate during regional CA outages, this approach ensures we can continue expanding our infrastructure in unaffected regions.

Certificate Lifecycle Management

Certificate TTLs

Certificate Time-To-Live (TTL) values are configured according to industry best practices and our internal security requirements. The following TTL values have been established for our PKI infrastructure (discussed in detail here):

Certificate Type	TTL
Root CA	10 years
Intermediate/Subordinate CA	5 years
End-entity certificates	13 months

These TTL values strike a balance between security requirements (limiting exposure time of compromised certificates) and operational overhead (frequency of rotation).

Certificate Rotation

To ensure continuous operation without interruption, certificates are proactively rotated before expiration:

Intermediate CA certificates: Rotated 90 days before expiry, aligned with the Internal TLS blueprint.
End-entity certificates: Rotated 60 days before expiry.

The rotation of intermediate and end-entity certificates is automated through the Instrumentor, which handles the provisioning and rotation of these certificates as implemented in:

Root CA Rotation

Root CA rotation requires more careful orchestration. The rotation is performed every 5 years (or as needed) following Google Cloud’s CA rotation best practices outlined in the Certificate Authority Service documentation.

The Root CA rotation process is summarized as follows:

Identify the CA pool containing the existing Root CA that is due to expire.
Create a new CA in the same CA pool in STAGED state.
Change the state of the new CA to ENABLED, enabling certificate issuance from both old and new CAs.
Update the Trust Store configuration on the Server (Topology Service) to accept requests authenticated with certificates issued by the new CA.
Update the Instrumentor to use the new Root CA for provisioning new Intermediate CA certificates.
Change the state of the old CA to DISABLED, preventing new certificate issuance while maintaining trust.
Wait until all clients have stopped using certificates issued from the old CA (either by waiting for maximum certificate lifetime or by monitoring client certificate usage).
Delete the old CA once all intermediate certificates have been issued using the new CA.

This carefully orchestrated process ensures zero downtime during Root CA rotation while maintaining the security integrity of the PKI infrastructure.

DNS Resolution for mTLS Server Communication

For mTLS to function correctly, clients must reach the server using the DNS name present in the server certificate’s Subject Alternative Name (SAN) field. In our Cell architecture with Private Service Connect, this presents a unique challenge as each Cell may have a different IP address for the same service.

Implementation Details

To ensure consistent DNS resolution across all Cells while maintaining proper certificate validation, we will implement the following approach:

sequenceDiagram
    participant Client as Client in Cell
    participant KubeDNS as KubeDNS
    participant CloudDNS as CloudDNS Private Zone
    participant PSCIP as Cell-specific PSC IP
    participant ILB as Internal Load Balancer
    participant Service as Server Service

    Note over PSCIP: PSC = Private Service Connect

    Client->>KubeDNS: DNS lookup request
    KubeDNS->>CloudDNS: Forward DNS query
    CloudDNS->>KubeDNS: Return Cell-specific PSC IP
    KubeDNS->>Client: Return PSC IP address
    Client->>PSCIP: Connect to PSC IP
    PSCIP->>ILB: Route to internal load balancer
    ILB->>Service: Route to appropriate service instance
    Service-->>ILB: Response
    ILB-->>PSCIP: Response
    PSCIP-->>Client: Response

Private CloudDNS Zone per Cell:
- Each Cell project will have its own CloudDNS Private Zone
- This zone will contain the same DNS name (e.g., topology-service.gitlab.net) for all Cells
- Each zone will resolve to the Cell-specific Private Service Connect IP that is dynamically reserved from the Cell’s VPC when the PSC endpoint is created
DNS Resolution Flow:
- Client services use the standard DNS name in their requests
- KubeDNS forwards the request to CloudDNS Private Zone
- CloudDNS resolves the name to the Cell’s specific Private Service Connect endpoint IP
- The request reaches the correct service through the Private Service Connect endpoint
Certificate Validation:
- The server certificate’s SAN includes the standard DNS name
- Clients validate the certificate against this name, ensuring proper mTLS authentication

Advantages

Consistent Naming: All Cells use the same DNS name to access services, simplifying configuration
Certificate Compatibility: The DNS name matches the certificate’s SAN, enabling proper mTLS validation
Isolation: Each Cell maintains its own DNS resolution to its specific Private Service Connect endpoint
Proven Solution: This approach is already implemented and tested in our Production environment for Vault services
Infrastructure as Code: All DNS configurations are managed through Terraform

Implementation Reference

This implementation leverages our existing infrastructure patterns:

Service Exposure: The internal Load Balancer is exposed through Private Service Connect via serviceAttachment
Access Control: Projects are dynamically configured to connect to the service
DNS Configuration: Private CloudDNS zones are created in each consumer project

This approach works seamlessly with KubeDNS as the DNS provider for the cluster without requiring additional permissions or switching to CloudDNS as the cluster’s DNS provider. This solution is already implemented and running in our Production environment for Vault services, with the following reference configurations:

Supported clients & servers

Client	Server
GitLab	Topology Service

Last modified June 9, 2025: docs(mtls): add note for Root CA rotation for regional outage (9ec1b073)

View page source - Edit this page - please contribute.