GitLab Dependency Scanning ADR 001: SBOM Vulnerability Scanner and Package Metadata Database

Context

GitLab’s dependency scanning required a fundamental architectural shift to enable scanning workflows beyond traditional CI pipelines. The legacy Gemnasium analyzer, while effective for CI-based scanning, created significant limitations for emerging security requirements:

Event-Driven Scanning Requirements: New security workflows like Continuous Vulnerability Scanning needed to react to external events (security advisory disclosures) rather than user-initiated CI pipeline runs. The Gemnasium analyzer’s CI-centric design couldn’t support this reactive scanning model.

Universal Deployment Constraints: GitLab needed scanning capabilities that worked immediately across all deployment types (SaaS, self-managed, dedicated, air-gapped) without requiring additional infrastructure that might not be available in all environments.

Decomposed Analysis Requirements: New security workflows required separating dependency detection from security analysis. The Gemnasium analyzer’s atomic approach tightly coupled dependency discovery with vulnerability detection, making it impossible to reuse security analysis logic across different contexts or integrate with external dependency detection tools.

Two critical architectural decisions needed to be made:

Scanning Engine Architecture: How to implement a vulnerability scanner that separates dependency detection from security analysis, enabling reuse across different dependency discovery mechanisms and scanning contexts
Vulnerability Data Access: How to provide the scanner with comprehensive, up-to-date vulnerability advisory data across all GitLab deployment types, including offline environments

Decision

We implemented the GitLab SBOM Vulnerability Scanner as a Rails-integrated component with access to locally synchronized Package Metadata Database (PMDB) data for vulnerability advisory information.

GitLab SBOM Vulnerability Scanner

Decomposed Analysis Architecture

The scanner operates as a stateless Rails service that implements decomposed dependency analysis:

Separation of Concerns: Our new architecture separates dependency detection (what components exist) from security analysis (which components have vulnerabilities). This enables different contexts to discover dependencies through various mechanisms while sharing identical vulnerability analysis logic through the scanner.

Component-Based Processing: Accepts standardized software component lists in SBOM format rather than raw dependency manifests, providing a clean integration point between dependency discovery and security analysis phases.

Stateless Operation: Performs pure vulnerability detection without maintaining state or prescribing result processing, allowing each scanning context to handle findings according to its specific requirements.

Semantic Version Processing

To perform the security analysis, the scanner must handle version matching across different package ecosystems. The legacy Gemnasium analyzer uses native subcommands, however this aproach is not suitable in the context of the GitLab rails application.

To achieve this, the GitLab SBOM vulnerability scanner incorporates GitLab’s custom Semantic Version library:

Cross-Ecosystem Support: Different package managers use varying semantic versioning schemes, requiring specialized parsing and comparison logic for accurate vulnerability matching.

Version Range Matching: Vulnerability advisories often specify affected version ranges using ecosystem-specific notation, necessitating precise version comparison algorithms.

Custom Dialect Support: GitLab-developed semantic version handling ensures accurate vulnerability detection across all supported package ecosystems without relying on external libraries that might not support specialized versioning schemes.

Core Scanning Process

The technical implementation follows a straightforward workflow that leverages the decomposed architecture:

flowchart LR
    %% Define styles
    classDef inputStyle fill:#d0e0ff,stroke:#3080ff,stroke-width:2px
    classDef processStyle fill:#fff0c0,stroke:#ffaa00,stroke-width:2px
    classDef dataStyle fill:#ffd0e0,stroke:#ff60a0,stroke-width:2px
    classDef outputStyle fill:#c0ffc0,stroke:#40a040,stroke-width:2px
    classDef internalStyle fill:#ffe0d0,stroke:#ff8040,stroke-width:2px

    %% Input
    SBOM_INPUT[SBOM Components:<br/>Package name, version, ecosystem]:::inputStyle

    %% Data Sources
    PMDB_DATA[(PMDB Advisory Data:<br/>Vulnerability title, afftected ranges, severity, etc.)]:::dataStyle

    %% Processing with internal component
    subgraph SCANNER_BOX[GitLab SBOM Vulnerability Scanner]
        direction TB
        SCANNER_LOGIC[Scanning Logic]:::processStyle
        VERSION_CHECK[Semver Dialects<br/>Version Range Matching]:::internalStyle
        SCANNER_LOGIC --> VERSION_CHECK
    end

    %% Output
    FINDINGS[Formatted Vulnerability Findings:<br/>CVE details, severity, affected components]:::outputStyle
    NO_FINDINGS[No Vulnerabilities Found]:::outputStyle

    %% Flow
    SBOM_INPUT --> SCANNER_LOGIC
    PMDB_DATA --> SCANNER_LOGIC
    VERSION_CHECK -->|Match Found| FINDINGS
    VERSION_CHECK -->|No Match| NO_FINDINGS

Package Metadata Database (PMDB)

PMDB operates as a sophisticated external service architecture that provides comprehensive security intelligence to GitLab instances:

External Service Architecture: PMDB runs as a standalone system outside GitLab, consisting of multiple specialized components deployed in Google Cloud Platform:

Data Ingestion Pipeline: Automated feeders collect data from multiple sources (National Vulnerability Database, GitLab Advisory Database, Trivy DB, CISA KEV, FIRST.org EPSS)
Processing Components: Dedicated processors handle license data, security advisories, and CVE enrichments through secure pub/sub messaging
Export System: Hourly exports aggregate all processed data into public GCP storage buckets for GitLab instance consumption

GitLab Instance Synchronization: Each GitLab installation maintains local PostgreSQL tables synchronized with PMDB data:

5-Minute Sync Cycle: Automated synchronization pulls updated data from public GCP buckets every 5 minutes
Local Database Storage: Vulnerability data, license information, and CVE enrichments stored locally for fast scanner access
Resilient Operation: Local storage ensures scanning operations continue even if external PMDB service becomes unavailable

Comprehensive Security Intelligence: PMDB provides enriched vulnerability data beyond basic advisories:

Multi-Source Advisories: Aggregates vulnerabilities from GitLab Advisory Database, Trivy DB, and other curated sources
EPSS Integration: Exploit Prediction Scoring System data enables vulnerability risk prioritization
KEV Catalog: Known Exploited Vulnerabilities from CISA for critical threat identification
CVE Enrichments: Additional context and metadata for comprehensive vulnerability assessment

Offline Environment Support: Air-gapped GitLab installations can access PMDB data through documented offline synchronization procedures, enabling vulnerability scanning without internet connectivity.

Scalable Data Pipeline: The external architecture supports growing security intelligence requirements:

Hourly Export Cycle: Hourly data aggregation balances freshness with system performance
Modular Processing: Separate components for different data types enable independent scaling and maintenance
Future Extensibility: Architecture supports additional data types through the same pipeline

References

Last modified October 31, 2025: Add the Dependency Scanning Engine architecture design document (138ff26f)

View page source - Edit this page - please contribute.