Code Indexing Architecture
Code Indexing ETL
This document outlines the approach of building the ETL pipeline for the Code Indexing in the gkg-indexer. The difference between code and SDLC entities is that we have to take into account that code versions can exist in parallel to each other via different branches.
If we want the Knowledge Graph to be able to ask questions about code, it needs to be able to understand the relationships at any given branch and at different commits.
The ETL pipeline will be responsible for:
- Reading the code from the GitLab repositories using Gitaly RPC calls
- Transforming the code into the desired format, including the call graph and the filesystem hierarchy
- Writing the entities and relationships to the Knowledge Graph ClickHouse database
What is a call graph?
Here is a simple example of a call graph between two files:
// fileA.ts
export class Foo {
hello() {
const bar = new Bar();
bar.myMethod();
}
}
// fileB.ts
export class Bar {
myMethod() {
console.log("Hello from Bar.myMethod()");
}
}
Explanation:
fileA.tsdefinesclass Foowith a methodhello().- Inside
hello(), it creates an instance ofBarand callsmyMethod(). fileB.tsdefinesclass Barwith themyMethod()implementation.
We create a call graph between the two files, and the result looks like this:
graph LR
%% Define subgraphs
subgraph DIR["my_directory -"]
subgraph FILEA["fileA.ts"]
A1["class Foo"]
A2["hello()"]
end
subgraph FILEB["fileB.ts"]
B1["class Bar"]
B2["myMethod()"]
end
end
%% Edges
FILEA --> A1
FILEB --> B1
A1 -->|"method"| A2
B1 -->|"method"| B2
A2 -.->|"calls"| B2
%% Styling
%% Lighter directory background
style DIR fill:#fafafa,stroke:#bbb,stroke-width:1px,color:#111,font-weight:bold
style FILEA fill:#dce9ff,stroke:#90c2ff,stroke-width:1px,color:#111,font-weight:bold
style FILEB fill:#d9f7d9,stroke:#90e090,stroke-width:1px,color:#111,font-weight:bold
%% Class and method nodes (colored distinctly)
style A1 fill:#cce0ff,stroke:#4080ff,stroke-width:1px,color:#003366
style A2 fill:#dde9ff,stroke:#4080ff,stroke-width:1px,color:#003366
style B1 fill:#ccffcc,stroke:#33aa33,stroke-width:1px,color:#004400
style B2 fill:#ddffdd,stroke:#33aa33,stroke-width:1px,color:#004400
%% Edge styling
linkStyle 0,1,2,3 stroke:#444,stroke-width:1.5px
linkStyle 4 stroke:#5b32ff,stroke-width:2.5px,stroke-dasharray: 4 2,background-color:#f0f0ff
Core components
gkg-indexer: The ETL pipeline for GitLab’s code.gkg-webserver: The REST + MCP interface to query the Knowledge Graph.NATS JetStream: The message broker for the Knowledge Graph.NATS KV: The key-value store for the Knowledge Graph.ClickHouse: The OLAP database for the Knowledge Graph.Gitaly: Handles all Git repository RPC file access for GitLab
Data storage
The Knowledge Graph code data is going to be stored in separate ClickHouse database.
- For
.comthis will probably be in a separate instance. - For small dedicated environments and self-hosted instances, this can be done in the same instance as the main ClickHouse database. This choice ultimately depends on what the operators think is best for their environment.
Some numbers
As of November 2025, the GitLab monolith has over 4000 branches considered “active” (committed to within the last 3 months) and even more that are considered “stale” (last committed more than 3 months ago).
Locally, with the limited support for Ruby. We currently index about 300,000 definitions and over 1,000,000 relationships.
For simplicity’s sake, let’s say we want to keep an active code index for branches that are considered “active”. This would require us to index (300,000 definitions 4000 branches) = 1.2 billion definitions and (1,000,000 relationships 4000 branches) = 4 billion relationships just for the GitLab monolith. This is simply not feasible if we extrapolate this to all the repositories in .com.
Use cases
To come up with a solution for the scale issue it’s probably best to outline some use cases where the Knowledge Graph can be used to answer questions about code.
- Does this merge request change the behavior of the existing code in unexpected ways?
- What is the impact of this merge request on the existing code?
- Perform code exploration to help understand the codebase and reveal architectural patterns.
- Provide guidance when a user wants to refactor the code or add a new feature.
- Identify the potential risks of a vulnerability in the codebase.
- Create queryable APIs for code exploration and analysis.
- Documentation generation for the codebase.
Of course, there are many more use cases that can be thought of, but these seem to be the most common ones. This raises the question: do we need to index all the code for every active branch for every repository? The answer is probably not.
Indexing the main branch
Let’s first focus on indexing the main branch for every repository. This should cover most of the use cases for the Knowledge Graph and then let’s think of a strategy to index the active branches if the need arises.
Extract
The extract phase involves listening to events from NATS and leveraging ClickHouse as both the data store and the mechanism for deriving project hierarchies and full paths.
Push events from GitLab’s PostgreSQL database are published to NATS JetStream subjects like:
gkg_siphon_stream.eventsgkg_siphon_stream.push_event_payloads
The indexing service subscribes to these NATS subjects and correlates events across tables:
- Events table: Contains
event_id,project_id,author_id, and push action. - Push payloads table: Contains
event_id,ref(branch name),commit_to(SHA) and ref type.
The indexer receives the events and confirms it’s a push to the main branch before proceeding with the indexing process. Then, the service acquires a lock on the project + branch + ref combination. This is to prevent other workers or pods from indexing the same branch at the same time.
Example NATS KV:
- Key:
/gkg-indexer/indexing/{project_id}/{branch_name}/{ref_type}/lock - Value:
{ "worker_id": String, "started_at": Instant } - TTL: 1 hour (estimated based on the amount of resources)
Once the service acquires the lock, it will make a direct RPC call to Gitaly to download the files temporarily to disk. The service will query ClickHouse as needed to build the namespace hierarchy and gather additional metadata to enrich both the project’s code graph and NATS locking.
Transform (Call Graph Construction)
The next phase is documented at https://gitlab.com/gitlab-org/rust/knowledge-graph, however, here’s a brief overview of the process:
FileProcessorchooses a language parser by extension and feeds the content through the analyzers to the GitLab Code Parser. Supported languages include:- Ruby
- Python
- Kotlin
- Java
- C#
- TypeScript/JavaScript
- Rust
- The parser output is normalized into
FileProcessingResultobjects that contain definitions, imports, and references with precise byte offsets. - For each file, the
AnalysisServiceturns those results into a graph:FileSystemAnalyzerbuilds directory and file nodes plus containment edges.- Language-specific analyzers emit definition and import nodes using consistent structs.
- References are stitched into call-graph edges (definition → definition) and dependency edges (definition → imported symbol, imported symbol → definition/file).
Load
Once the graph is constructed, the WriterService converts nodes and edges into Arrow batches and writes Parquet files (directory.parquet, file.parquet, definition.parquet, imported_symbol.parquet, plus one Parquet file per relationship table defined in RELATIONSHIP_TABLES).
The RepositoryIndexer::load_into_database method will then leverage the ClickHouse client to stream the parquet data into the appropriate node and edge tables. Loading the data should be done using upsert operations to prevent duplicate data. The nodes and edges that have been deleted should be soft-deleted so a subsequent cleaning job can remove them.
Once the process is complete, the finished database is ready for query services, MCP adapters, and AI tooling.
Flow visual representation
graph TB
subgraph "Data Sources"
GL[GitLab PostgreSQL]
end
subgraph "Event Streaming Layer"
NATS[NATS JetStream]
E1[gkg_siphon_stream.events]
E2[gkg_siphon_stream.push_event_payloads]
GL --> NATS
NATS --> E1
NATS --> E2
end
subgraph "Extract Phase"
IS[Indexing Service]
E1 --> IS
E2 --> IS
LOCK{Acquire Lock?}
IS --> LOCK
KV[(NATS KV Store<br/>TTL: 1 hour)]
LOCK -->|Lock Key| KV
GITALY[Gitaly RPC<br/>Download Files]
LOCK -->|Success| GITALY
end
subgraph "Transform Phase - Call Graph Construction"
FP[FileProcessor<br/>Choose Parser by Extension]
GITALY --> FP
GCP[GitLab Code Parser]
FP --> GCP
LANGS[Supported Languages:<br/>Ruby, Python, Kotlin<br/>Java, C#, TS/JS, Rust]
GCP -.-> LANGS
FPR[FileProcessingResult<br/>definitions, imports, references]
GCP --> FPR
AS[AnalysisService]
FPR --> AS
FSA[FileSystemAnalyzer<br/>directories, files, containment]
AS --> FSA
LA[Language Analyzers<br/>definitions, imports]
AS --> LA
CG[Call Graph Construction<br/>def→def edges<br/>def→import edges<br/>import→def/file edges]
FSA --> CG
LA --> CG
end
subgraph "Load Phase"
WS[WriterService]
CG --> WS
ARROW[Convert to Arrow Batches]
WS --> ARROW
PQ[Write Parquet Files:<br/>directory.parquet<br/>file.parquet<br/>definition.parquet<br/>imported_symbol.parquet<br/>+ relationship tables]
ARROW --> PQ
UPSERT[Stream to ClickHouse<br/>Upsert Operations<br/>Soft Delete Removed Nodes]
PQ --> UPSERT
CHDB[(ClickHouse Database<br/>Node & Edge Tables)]
UPSERT --> CHDB
end
subgraph "Consumers"
QS[Query Services]
MCP[MCP Adapters]
AI[AI Tooling]
CHDB --> QS
CHDB --> MCP
CHDB --> AI
end
style IS fill:#e1f5ff
style FP fill:#fff4e1
style AS fill:#fff4e1
style WS fill:#e8f5e9
style CHDB fill:#f3e5f5
Indexing the active branches
The problem
As discussed in the previous section, the main branch is the most common branch to index. However, it still feels relevant to document a strategy to index the active branches if the need arises. Let’s also not forget that the Knowledge Graph includes a local version that customers can use to query code against their local repository at any version.
To reiterate, the issue with indexing active branches is the sheer volume of data that would need to be indexed. We’re talking about billions of definitions and relationships for each GitLab-like repositories. This is a complex problem that takes effort away from releasing a first version of the Knowledge Graph service without providing clear value.
A future strategy
Once we deploy the initial version, if our metrics and customer feedback show that the ability to explore codebases at any version is valuable, we can then explore our options.
As stated above GitLab has the concept of a branch being “active” or “stale”. An active branch is one that has been committed to within the last 3 months. A stale branch is one that has not been committed to in the last 3 months.
For the amount of data and un-even query distribution (some branches are never going to be queried), it’s best we don’t keep the data against the main branches in the same database since that would result in a lot of wasted storage and compute resources.
Ideally, we would re-use the same indexing strategy as the main branch where we can index the active branches by listening to push events from NATS, but instead of loading the data into ClickHouse, we would store the data in cold storage (like S3 or GCS).
On request, we would load the data into ClickHouse from cold storage in materialized tables. This would allow us to then query the data in ClickHouse during the current session and then unload the data from ClickHouse after the session is complete (based on a variable TTL).
Flow visual representation
graph TB
subgraph "Active Branch Processing"
J[Push Event] --> K[NATS Event Stream]
K --> L[Index Data]
L --> M[Cold Storage<br/>S3/GCS]
end
subgraph "On-Demand Loading"
N[User Request] --> O{Data in ClickHouse?}
O -->|No| P[Load from Cold Storage]
O -->|Yes| Q[Query Data]
P --> R[Materialized<br/>in ClickHouse]
R --> Q
Q --> S[Session Complete]
S --> T[Unload Branch Data After TTL]
T --> M
end
style M fill:#2196F3
style P fill:#2196F3
Cleaning up
Once the branch either becomes stale or is deleted, we should clean up the data in our cold storage. This would be done by a separate job that would run periodically and clean up the data based on the latest state of the branches.
Alternative approach
An alternative approach if the time to first response is not critical is to index the active branches and then index the stale branches on demand. Depending on the indexing speed on the servers, this would allow us to save the temporary data in ClickHouse and then dispose of it after the session is complete or at a later time. This would eliminate the need to manage cold storage and the associated costs.
Indexing the stale branches
Stale branches are in most cases branches that have been abandoned by the original author. They are not actively being worked. If we were to index them, we could follow the same strategy as the alternative approach described for active branches.
Zero-Downtime Schema Changes
Code Indexing is going to follow the same schema migration strategy as the main branch as described in Zero-Downtime Schema Changes.
How Code Querying Works Today
- Purpose-built MCP tools
- The Knowledge Graph team originally built dedicated MCP tools, which include code-specific tools implemented under
crates/mcp/tools. Each tool wraps a focused workflow on top of the indexed call graph. Reference documentation lives atdocs/mcp/tools.
- The Knowledge Graph team originally built dedicated MCP tools, which include code-specific tools implemented under
- What the tools currently do
list_projectsenumerates indexed repositories for agent discovery.search_codebase_definitionssearches Definition nodes by name, FQN, or partial match and streams back signatures plus context.get_definitionresolves a usage line to its Definition or ImportedSymbol node by leveraging call graph edges such asCALLSandAMBIGUOUSLY_CALLS.get_referencespivots the other way through relationships likeDEFINES_IMPORTED_SYMBOLandFILE_IMPORTSto list every referencing definition with contextual snippets.read_definitionsbatches definition bodies so agents can retrieve implementations efficiently.repo_mapwalks the directory nodes and summarizes contained definitions, using the graph to stay.gitignore-aware.index_projectinvokes the repository indexer inside the MCP process, wiring the reindexing flow described earlier into an on-demand tool call.
- How they execute queries
- Tools rely on
database::querying::QueryLibrary(for example,search_codebase_definitionsdelegates toQueryingServicevia the shared query library) and on the same database connections managed bycrates/database. This keeps query plans consistent with the schema imported during indexing. - Many tools supplement database hits with filesystem reads (see
file_reader_utils) so responses include code snippets, respecting byte offsets captured in the graph.
- Tools rely on
- Other consumers
- The HTTP/GQL surfaces continue to use the shared schema metadata published at
docs/reference/schema; the MCP tools simply package the most common graph traversals for AI agents and IDE features while reusing the same underlying query service.
- The HTTP/GQL surfaces continue to use the shared schema metadata published at
Important Note: We intend to replace the above tools, where it makes sense, with our Graph Query Engine technology to enable agents and analytics to traverse the graph using tools that will be shared with SDLC querying. Agents will never write or execute raw queries themselves. They can only interact with the graph through these exposed, parameterized tools, which enforce security and access controls.
9c2b62bb)
