Repository X-Ray RAG
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
ongoing |
mikolaj_wawrzyniak
|
jprovaznik
maddievn
mkaeppler
|
devops create | 2024-04-23 |
Group ~“group::global search” is leading effort to build RAG at GitLab. Because this is a global effort, it is in the spirit of efficiency and collaboration values for ~“group::code creation” to join that effort and integrate Repository X-Ray data into GitLab RAG. Doing so should not only result in more efficient resource allocation, but also enable other AI features to integrate and reuse Repository X-Ray data, for example users could ask questions to GitLab Duo Chat that could be answered with X-Ray report data.
Goal
Integrate the existing Repository X-Ray scan flow with the RAG platform.
Proof of Concept
Proof of concept has been built at merge request 144715. This MR contains a large amount of information that can be helpful during implementation.
Implementation
Current state
Currently, Repository X-Ray does not use semantic search. The MVC approach naively selects first 50 entities from the X-Ray report and includes them into a code generation request. For more details, see the following diagrams.
Current state of Repository X-Ray diagrams
The Repository X-Ray scan is processed as shown on the following diagram:
sequenceDiagram actor USR as User participant RN as GitLab Runner participant GLR as GitLab Rails participant PG as GitLab PostgreSQL DB participant AIGW as AI Gateway USR->>GLR: commits changes <br> to a package manager file <br>eg. Gemfile.lock GLR->>+RN: triggers Repository X Ray CI scanner job loop for each batch of packages RN->>GLR: Request packages description by AI GLR->>AIGW: Forward request for packages description AIGW->>GLR: Packages description GLR->>RN: Forwards packages description end RN->>-GLR: Repository X Ray report GLR->>+GLR: triggers Repository X Ray ingestion background job GLR->>-PG: upserts xray_reports record
The report is later used as shown in the following diagram:
sequenceDiagram actor USR as User participant IDE participant PG as GitLab PostgreSQL DB participant GLR as GitLab Rails participant AIGW as AI Gateway USR->>+IDE: types: "#35; generate a function that transposes a matrix" IDE->>+GLR: trigger code generation for line ` "#35; generate function ` GLR->>PG: fetch X Ray report for project and language PG->>GLR: xray_reports record GLR->>GLR: include first 50 entities from xray report into code generation prompt GLR->>-AIGW: trigger code generation ` "#35; generate function `
Desired outcome
After this effort is completed, the Repository X-Ray scan is processed in the following manner:
sequenceDiagram actor USR as User participant RN as GitLab Runner participant GLR as GitLab Rails participant ES as Elasticsearch participant PG as GitLab PostgreSQL DB participant AIGW as AI Gateway USR->>GLR: commits changes <br> to a package manager file <br>eg. Gemfile.lock GLR->>+RN: triggers Repository X Ray CI scanner job loop for each batch of packages RN->>GLR: Request packages description by AI GLR->>AIGW: Forward request for packages description AIGW-->>GLR: Packages description GLR-->>RN: Forwards packages description end RN->>-GLR: Repository X Ray report GLR->>+GLR: triggers Repository X Ray ingestion background job GLR->>PG: Fetch currently stored packages for given repository and language PG-->>GLR: List of repository packages GLR->>GLR: Group packages into lists#58; "skip", "add", "remove" GLR->>PG: Insert new packages from "add" list for given repository GLR->>PG: Delete packages from "remove" list for given repository rect rgb(0, 223, 0, .1) note right of RN: Embeddings flow opt with Elasticsearch available on an instance GLR->>ES: Check if packages from "add" list already exists in ES index ES-->>GLR: List of existing packages from "add" list GLR->>GLR: Remove already indexed packages from "add" list loop for each remaining package in "add" list GLR->>AIGW: Request embeddings for package description AIGW-->>GLR: Embeddings for package description end GLR->>ES: Create documents for packages in "add" list in ES index GLR->>PG: Check if packages form "remove" list exists in any repository PG-->>GLR: List of packages from "remove" group still in use GLR->>GLR: Filter packages that are still in use from "remove" list GLR->>ES: Delete all remaining packages in "remove" list from ES index end end
Later on, the Repository X-Ray report is used as follows:
sequenceDiagram actor USR as User participant IDE participant GLR as GitLabRails participant ES as Elasticsearch participant PG as PostgreSQL participant AIGW as AI Gateway USR->>+IDE: types: "#35; generate method that fetches <br>top charts from Spotify" IDE->>+GLR: trigger code generation for "#35; generate method <br>that fetches top charts from Spotify" alt with Elasticsearch available on an instance rect rgb(0, 223, 0, .1) note left of GLR: new embeddings flow GLR->>+AIGW: fetch embedding for instruction "in utils.js generate method that ... AIGW-->>-GLR: embeddings vector for instruction GLR->>PG: fetch packages for given repository and programming language PG-->>GLR: List of repository package names for given language GLR->>ES: fetches packages documents using KNN for instruction embeddings vector <br> from filtered list using repository package names ES-->>GLR: spotify/web-api-ts-sdk - A package that wraps ... GLR->>AIGW: code generation request with prompt including <br>spotify/web-api-ts-sdk - A package that wraps... end else rect rgb(128, 128, 128, .1) note left of GLR: current flow as fallback GLR->>PG: fetch X Ray report for repository and language PG-->>GLR: xray_report_packages records GLR->>AIGW: code generation request with prompt first 50 <br> entities from xray report end end
Required changes
On X Ray write path
1. Create new AI Gateway endpoint that generate embeddings
Embeddings for GitLab documentation are requested directly from GitLab Rails. That approach limits embeddings availability to GitLab.com only, and this is also misaligned with the architecture blueprint that outlines AI Gateway as a provider of AI services for GitLab features (for more information, see the related AI Gateway epic 13024). To avoid similar problems for Repository X-Ray, a new endpoint should be added to AI Gateway API that accepts batches of strings, and responds with embeddings vectors.
In the following iterations, it is expected that the AI Gateway endpoint (as the central point which has complete overview of all connected instances traffic) will enforce rate limiting and manage token consumption. However, from the very start clients will be responsible for correctly handling rate limiting and token exhaustion type errors.
Open question:
Embeddings are generated with the textembedding-gecko
model (768 dimensions). When adding a new API endpoint, we might be able to change model if needed, if so we should decide which one.
Because Repository X Ray report data has small volume and size (at this moment there are 379 reports on GitLab.com), the decision to switch model and rebuild embeddings data is low cost and can be deferred in order to unblock iteration.
2. Store embeddings vectors for Repository X-Ray report
The current Elasticsearch framework uses ActiveRecord to keep the index up to date. It uses callbacks on create/update/delete to action the corresponding record in Elasticsearch.
Because xray_reports
table in PostgreSQL main database stores whole reports as JSON blobs in order to persist generated embeddings vectors for each of report item (representing library used in given repository)
we would need to either:
- Modify the current Elastic framework as defined in issue 442197.
- If, due to higher urgency, migrate
xray_reports
table to a new structure, where each record represents a single entity (package/library) in the Repository X-Ray report, which would be compatible with the current Elasticsearch upload pipeline.
The index in Elasticsearch that stores Repository X-Ray packages has the following format:
index: xray-packages
document: {
id:
name:
language:
description:
embedding:
}
This index would store:
- Description.
- Name.
- The embeddings generated for the concatenated package name and description.
- Programming language.
This index will be shared between all repositories in a given GitLab instance. However, upon search the list of relevant packages will be filtered using prefetched lists of names of packages that belong to the given repository in a given programming language. That approach should:
- Reduce required storage capacity.
- Improve search performance.
- Reduce AI consumption used to generate embeddings.
Open question
We need to check if descriptions vary significantly depending on version, and decide whether to store records per package and its version, or just one per package.
3. Modify storage on GitLab Rails layer
Right now Repository X-Ray packages are stored in xray_reports
table at the main PostgreSQL DB.
CREATE TABLE xray_reports (
id bigint NOT NULL,
project_id bigint NOT NULL,
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
lang text NOT NULL,
payload jsonb NOT NULL,
file_checksum bytea NOT NULL,
);
In order to support sharing embeddings between repositories whilst providing way to remove stale data from Elasticsearch, we must create a new table, xray_report_packages
.
CREATE TABLE xray_report_packages (
id bigint NOT NULL,
project_id bigint NOT NULL,
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
lang text NOT NULL,
name text NOT NULL,
version text NOT NULL,
description text, --nullable filed as with Elasticsearch available this file will not be in use
);
After the new table is created, all reports from xray_reports
should be migrated there, and xray_reports
should be removed.
4. Modify Repository X-Ray import pipeline
The Repository X-Ray report, after being generated during the CI job, is imported using a background job.
That job uses Ai::StoreRepositoryXrayService
to parse and save the report file into the xray_reports
table at the main PostgreSQL DB.
In order to support semantic search for Repository X-Ray, we must apply the following modifications:
- Load the new Repository X-Ray report from scanner into the list of
current_report_packages
. - Load the list of packages reported in previous report for the given project stored in
xray_report_packages
into the list ofprevious_report_packages
(For example,SELECT * FROM xray_report_packages WHERE lang = 'ruby' AND project_id = 123
). - Filter packages from the new X-Ray report into three groups:
skip
- a collection of unmodified packages that do not require any action.add
- a collection of new packages that should be added toxray_report_packages
.remove
- a collection of packages that were present in the old report but missing in the new one (previous_report_packages
-current_report_packages
).
- Insert the new records into the PostgreSQL
xray_report_packages
for theadd
list of packages. - From the
add
list, filter out packages that already exist in Elasticsearch. Those are packages that are being used by other repositories, and in this case embeddings and descriptions will be shared. - Update and insert all packages that are left in the
add
list of packages in Elasticsearch. - Delete the
remove
list of packages from PostgreSQLxray_report_packages
for a given project. - Create an
orphaned
list of packages from theremove
list by selecting ones of the samename
andlang
that are not being used by any other project (For example,SELECT 1 FROM xray_report_packages WHERE name = 'kaminari' AND lang = 'ruby' LIMIT 1
). - Remove the
orphaned
packages from Elasticsearch.
On X-Ray read path
- Retrieve code generation instruction either from:
- IDE:
- IDE / LS upon detecting generation should send a code generation instruction (for example, a content of a comment that triggered code generation).
- GitLab Rails Code Suggestions API needs to add optional string parameter
instruction
.
- GitLab Rails if due to different priorities the IDE will not be able to deliver instruction on time to unblock this effort, for the sake of the initial iteration, the code generation instruction can be also retrieved by GitLab Rails.
- IDE:
- GitLab Rails needs to detect if Elasticsearch is available and:
- When Elasticsearch is available:
- Fetch list of
names
of packages that are stored inxray_report_packages
PostgreSQL database for the given project and language. - Use k-nearest neighbor (kNN) search of Elasticsearch filtered with
names
list of projects packages to retrieve the most relevant context.
- Fetch list of
- When Elasticsearch is not available:
- Select the first 50 records from
xray_report_packages
in PostgreSQL database aligned with the current state diagram.
- Select the first 50 records from
- When Elasticsearch is available:
e47101dc
)