Codebase as Chat Context

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
proposed partiaga tgao3701908 jessieay dgruzd jordanjanes mnohr devops create 2025-04-02

Summary

We are introducing the capability to include Codebase as additional contexts to Duo Chat requests. This can refer to the entire repository or to a sub-directory under the repository.

To achieve this, we will index the codebase as vector embeddings, referred to as Code Embeddings.

When the user asks a question on Duo Chat, the system executes a semantic search over the Code Embeddings to retrieve relevant context from repositories, which is then processed by large language models to generate helpful responses.

This new feature will be available to GitLab Premium or Ultimate users with the Duo Pro or Duo Enterprise add-ons.

Epic: The work for this feature is tracked in this epic.

Motivation

Currently, we don’t do a great job of helping customers understand their repository and code base. Competitors support a broader aperture – a user can ask questions about an entire repository, or scope the context to multiple folders, multiple files, and portions of code. This functional gap is commonly mentioned by customers, and here’s a recent summary of research in this space. LLM’s are only as good as the context we give them, so it is important that we achieve parity with our competitors in this area.

This initiative aims to bridge this critical functional gap in GitLab’s Duo Chat offering by enabling users to interact with their entire codebase through natural language queries. This capability allows users to more effectively understand, navigate, and plan changes to their repositories – a feature already offered by competing products.

Goals

The main goal is to add Codebase as an additional context to Duo Chat. In this initiative, this is scoped to Repository and Directory. A semantic search will then be done over the Repository or Directory, with the results used to enhance the Chat prompt sent to the AI model.

To support semantic search, the creation of code embeddings is included in the initial scope of this work. Indexing of the default branch will be done for the first phase of the work, with feature branches indexed in the second phase.

We will only generate embeddings for projects or namespaces with Duo enabled.

Non-Goals

The following is out of scope for this initiative, but could theoretically be built upon it:

  • Codebase as additional context for Agentic Duo Chat.
  • Codebase as additional context for Duo Chat Slash Commands.
  • Codebase as additional context for Code Suggestions.
  • Support for indexing and querying locally changed files as vector embeddings.
  • A Knowledge Graph representation of the codebase as additional context to Duo Chat.

Please see Next Steps and Future Proofing for proposed plans regarding the above topics.

Proposal

In order to support Codebase as Chat Context, we need to:

  1. Introduce Code Embeddings

    • This is a vector representation of files in the codebase.
    • This includes a pipeline to index the codebase as vector embeddings
    • This gives the ability to perform a semantic search over the embeddings
    • This will be developed in 2 phases:
      • Phase 1: Support code embeddings on the main branch
      • Phase 2: Support code embeddings on feature branches
  2. Update Duo Chat to support codebase as additional context.

    • When asking a question on Chat, the user will have the option to include the following as contexts:
      • repository - refers to the entire codebase of a project
      • directory - this is a “subset” of the repository, and refers to a subfolder in a project
    • When a repository is selected as additional context
      • a semantic search is done over the Code Embeddings representation of the files in the repository. The search result is then used to enhance the Chat prompt sent to the AI model.
    • When a directory is selected as additional context:
      • a semantic search is done over the Code Embeddings representation of the files in the directory. The search result is then used to enhance the Chat prompt sent to the AI model.
      • as an additional consideration: if a directory only has a few files, the file contents are included directly as additional context to enhance the Chat prompt sent to the AI model.

Design and implementation details

Components

This initiative introduces or updates the following components:

Code Embeddings

Please refer to the Code Embeddings blueprint for the detailed design of this component.

This is a vector representation of files in the Codebase. We will introduce an indexing pipeline to generate embeddings when a change is pushed to a branch. We will also introduce the capability to search over these embeddings.

Indexing the Code Embeddings

  • The changes are done on both the Gitlab Elasticsearch Indexer and GitLab Rails.
  • On Rails, we will make use of the AI Context Abstraction Layer.
  • We will make use of a new Code Parser library to parse the code files into logical chunks before generating embeddings on those chunks.
    • The Code Parser lives in its own repository so that it can be used in different projects.
    • For further design and implementation details, please see the One Parser proposal.

Searching the Code Embeddings

Duo Chat

Duo Chat is already an existing AI feature on GitLab.

For details on the current Duo Chat workflow and architecture, please refer to the following documentations:

In this initiative, we will introduce changes on the GitLab Language Server, GitLab Rails, and the GitLab AI Gateway. For further implementation details, please proceed to the Adding the Codebase as Context on Duo Chat section below.

Adding the Codebase as Context on Duo Chat

The diagram below shows the workflow for including codebase (repository or directory) semantic search results as additional context. The areas highlighted in blue are where we’ll introduce new changes.

sequenceDiagram
    actor USR as User
    participant FE as IDE/Language Server
    box GitLab Rails
      participant GLRGQL as GraphQL API
      participant GLRLLMS as LLM Services
      participant GLRLLMCHAT as LLM Chat Module
      participant GLRLLMSEM as LLM Semantic Search Tool
      participant CES as Code Embeddings Search Service
    end
    participant CE as Code Embeddings Storage
    participant AIGW as AI Gateway
    participant LLM as LLM

    USR->>FE: Asks a question, indicating<br /> repository or directory<br /> as additional context

    FE->>GLRGQL: Sends chat request to `aiAction` mutation<br /> with repository or directory<br /> as additional context
    Note over FE: The Language Server will send the ID and category<br /> of the repository and directory additional contexts,<br /> but the content will still be empty at this point.
    GLRGQL->>GLRLLMS: Sends chat request, with<br /> repository or directory<br /> as additional context
    GLRLLMS->>GLRLLMCHAT: Sends chat request, with<br /> repository or directory<br /> as additional context

    rect rgb(240, 248, 255)
      GLRLLMCHAT->>GLRLLMSEM: Executes the Semantic Search Tool,<br /> with a repository or directory filter
      GLRLLMSEM->>CES: Performs the semantic search

      CES->>AIGW: Generates embeddings for the question
      AIGW-->>CES: Returns embeddings for the question
      CES->>CE: Queries Code Embeddings with<br /> the question embeddings as target<br /> and filtered by the given repository or directory
      CE-->>CES: Returns Code Embeddings
      CES-->>GLRLLMSEM: Returns semantic search result

      GLRLLMSEM-->>GLRLLMCHAT: Returns semantic search result
      GLRLLMCHAT->>GLRLLMCHAT: Adds the semantic search result as the content<br /> to the repository and directory additional context
      Note over GLRLLMCHAT: The content of the repository and directory<br /> additional contexts will be set here.

      GLRLLMCHAT->>AIGW: Request to /v2/chat/agent<br /> with repository or directory +<br /> semantic search result as additional context
      AIGW->>AIGW: Builds a prompt with the<br /> repository or directory +<br /> semantic search result<br /> as additional context
    end

    AIGW->>LLM: Sends prompt with the additional contexts

    LLM-->>AIGW: Returns the final answer
    AIGW-->>GLRLLMSEM: Returns the final answer
    GLRLLMSEM-->>GLRLLMCHAT: Returns the final answer
    GLRLLMCHAT-->>GLRLLMS: Returns the final answer
    GLRLLMS-->>GLRGQL: Returns the final answer
    GLRGQL-->>FE: Returns the final answer
    FE-->>USR: Shows the answer

Additional Context Category

We will add one additional context category: repository. If a directory is given, it will be considered a repository additional context, with the relative path of the directory specified in the metadata.

Unit Primitives

We will add the following unit primitives as part of this initiative:

Include Context

  • include_repository_context - used by both the Language Server and AIGW

Tool

Embeddings Generation

  • generate_embeddings_codebase - unit primitive used for the embeddings generation endpoint (/v1/proxy/vertex-ai)

Code Embeddings Search Service

This is a service class that handles the calls to the Code Embeddings. This makes use of the AI Context Abstraction Layer. For details on how this is done, please refer to the Search section in the Code Embeddings blueprint.

Duo Semantic Search Tool

This is a new Duo Chat tool that will be introduced in this initiative.

Similar to the Slash Command Tools, there is no need to have LLM infer whether the tool is needed. This tool will be called as long as the repository additional context is present.

On Rails, we will introduce an LLM Tool class that will call the Code Embeddings Search Service to fetch the matching embeddings of a Chat question. This new Semantic Search Tool will be called from the LLM Chat module, with the search results then included as the content of either the repository additional context. The Chat request is then sent to AIGW with the new additional contexts.

API Changes - Duo Chat Available Features

The Language Server calls the GraphQL query {currentUser { duoChatAvailableFeatures } } to fetch the list of available Duo Chat features.

As part of this initiative, we will add the include_repository_context unit primitive to this list of features.

API Changes - Chat Request

The GraphQL mutation used by Duo Chat (aiAction) already accepts additionalContext as a parameter for a chat input.

With the repository additional context category, the chat input to the aiAction mutation should then look like:

For repository as additional context

mutation newChatMessage {
  aiAction(
    input: {
      chat: {
        content: "the user question"
        additionalContext: [{
          category: "repository",
          id: "the-project-id",
          content: "", # should be empty
          metadata: {
            directory: "some/dir" # this is optional, specified when the user selects directory as an additional context
            branch: "some-branch" # including the branch is a second phase iteration
          }
        }]
      }
    }
  ) {
    requestId
  }
}

Duo Chat Changes - Frontend

See UI Design.

Evaluations

Codebase context enhancement can have different results depending on different factors such as the granularity of embeddings or the embeddings model used. Beyond the MVC iteration of this feature, we should evaluate the effectivity of different embeddings models, chunking granularity, and other approaches to embeddings.

Possible approaches for evaluation

Approach Description
Size-based chunking Split files into chunks of fixed size or token count
Tree-sitter chunking Parse code structure using AST to create semantically meaningful chunks
Whole File Embedding Generate embeddings for entire file contents (blob content)
Different Embedding Models Use purpose-built models for code vs. general text models

Next Steps and Future Proofing

Proposed steps for porting to the Agentic Chat architecture

Once we introduce the Agentic Chat architecture, either the Duo Workflow Service on the AI Gateway or the Duo Workflow Executor on the Language Server will need to query the vector embeddings.

In order to support this, we will introduce an API over the Code Embeddings Search Service to be called either from the Duo Workflow Service or the Duo Workflow Executor.

Codebase as additional context for Duo Chat Slash Commands

Slash commands include /refactor, /fix, /test.

The Slash commands can either make use of the Semantic Search Tool or directly call the Code Embeddings Search Service to search over Code Embeddings.

Alternatively, we can support codebase as additional context for Slash commands only in Agentic Chat.

Codebase as additional context for Code Suggestions

Code Completion or Code Generation can make use of the Code Embeddings Search Service, which abstracts all the logic needed for searching over the Code Embeddings.

Proposed steps for supporting local file indexing

TBA

Indexing the codebase as a Knowledge Graph

TBA

Alternative Solutions

In this solution, we would introduce a tool definition for the Codebase Semantic Search on the AIGW. This tool will then be made available to the LLM, which infers whether the tool is needed based on the question.

We decided not to go with this solution for the following reasons:

  • There is no need for the LLM to infer whether the codebase search tool is required. If a repository or directory additional context is included, then we can immediately do a codebase search.
  • The MVC proposal and the requirement from Product is that the user should be able to explicitly decided whether a codebase semantic search is needed. In this solution, while the user can specify repository or directory as additional context, the LLM still has the final decision on whether the codebase search is performed.
  • This solution means an additional round of requests between Rails and AIGW, making the latency higher.

PoC for the tool definition: MR: POC: Codebase search tool.

Workflow diagram:

sequenceDiagram
    actor USR as User
    participant FE as IDE/Language Server
    box GitLab Rails
      participant GLRGQL as GraphQL API
      participant GLRLLMS as LLM Services
      participant GLRLLMCHAT as LLM Chat Module
      participant GLRLLMSEM as LLM Semantic Search Tool
      participant CES as Code Embeddings Search Service
    end
    participant CE as Code Embeddings Storage
    participant AIGW as AI Gateway
    participant LLM as LLM

    USR->>FE: Asks a question, indicating<br /> codebase as additional context

    FE->>GLRGQL: Sends chat request to `aiAction` mutation<br /> with codebase as additional context
    GLRGQL->>GLRLLMS: Sends chat request, with<br /> codebase as additional context
    GLRLLMS->>GLRLLMCHAT: Sends chat request, with<br /> codebase as additional context
    GLRLLMCHAT->>AIGW: Request to /v2/chat/agent<br /> with codebase as additional context

    rect rgb(240, 248, 255)
      AIGW->>LLM: Sends chat request, indicating<br /> that codebase context is present
      LLM->>LLM: Determines that the<br /> semantic search tool is needed
      Note over LLM: The LLM will have a new available<br /> tool for semantic search.<br /> This tool will have instructions<br /> to perform a semantic search if<br /> the codebase context is present.
      LLM-->>AIGW: Returns response, indicating that<br /> the semantic search tool is needed
    end

    AIGW-->>GLRLLMCHAT: Returns response, indicating that<br /> the semantic search tool is needed

    GLRLLMCHAT->>GLRLLMSEM: Executes the Semantic Search Tool

    rect rgb(240, 248, 255)
      GLRLLMSEM->>CES: Performs the semantic search
      CES->>AIGW: Generates embeddings for the question
      AIGW-->>CES: Returns embeddings for the question
      CES->>CE: Queries Code Embeddings with<br /> the question embeddings as target
      CE-->>CES: Returns Code Embeddings
      CES-->>GLRLLMSEM: Returns semantic search result
    end

    GLRLLMSEM->>AIGW: Request to /v1/prompts/chat<br />with search results as additional context

    AIGW->>AIGW: Builds a prompt with the<br /> search results as additional context
    AIGW->>LLM: Sends prompt with the<br /> search results as additional context

    LLM-->>AIGW: Returns the final answer
    AIGW-->>GLRLLMSEM: Returns the final answer
    GLRLLMSEM-->>GLRLLMCHAT: Returns the final answer
    GLRLLMCHAT-->>GLRLLMS: Returns the final answer
    GLRLLMS-->>GLRGQL: Returns the final answer
    GLRGQL-->>FE: Returns the final answer
    FE-->>USR: Shows the answer

Introduce an “Ask Codebase” tool and prompt

In this solution, we will introduce an “Ask Codebase” tool:

  • On Rails, the LLM Chat module will determine that the tool is needed if there is a repository additional context
  • The “Ask Codebase” tool will use the “Codebase Embeddings Search Service” to get the semantic search results
  • The “Ask Codebase” tool will then send a request to the /v1/prompts/chat endpoint
  • On AIGW, there will be a new prompt for ask_codebase available through the /v1/prompts/chat endpoint
  • The workflow will essentially be similar to the Slash Command tools workflow
sequenceDiagram
    actor USR as User
    participant FE as IDE/Language Server
    box GitLab Rails
      participant GLRGQL as GraphQL API
      participant GLRLLMS as LLM Services
      participant GLRLLMCHAT as LLM Chat Module
      participant GLRLLMAC as LLM Ask Codebase Tool
      participant CES as Code Embeddings Search Service
    end
    participant CE as Code Embeddings Storage
    participant AIGW as AI Gateway
    participant LLM as LLM

    USR->>FE: Asks a question, indicating<br /> repository or directory<br /> as additional context

    FE->>GLRGQL: Sends chat request to `aiAction` mutation<br /> with repository or directory<br /> as additional context
    Note over FE: The Language Server will send the ID and category<br /> of the repository and directory additional contexts,<br /> but the content will still be empty at this point.
    GLRGQL->>GLRLLMS: Sends chat request, with<br /> repository or directory<br /> as additional context
    GLRLLMS->>GLRLLMCHAT: Sends chat request, with<br /> repository or directory<br /> as additional context

    rect rgb(240, 248, 255)
      GLRLLMCHAT->>GLRLLMAC: Executes the Ask Codebase Tool,<br /> with a repository or directory filter
      GLRLLMAC->>CES: Performs the semantic search

      CES->>AIGW: Generates embeddings for the question
      AIGW-->>CES: Returns embeddings for the question
      CES->>CE: Queries Code Embeddings with<br /> the question embeddings as target<br /> and filtered by the given repository or directory
      CE-->>CES: Returns Code Embeddings
      CES-->>GLRLLMAC: Returns semantic search result

      GLRLLMAC-->>GLRLLMCHAT: Returns semantic search result
      GLRLLMCHAT->>GLRLLMCHAT: Adds the semantic search result as additional context

      GLRLLMCHAT->>AIGW: Request to /v1/prompts/chat<br /> with repository or directory +<br /> semantic search result as additional context
      AIGW->>AIGW: Builds a prompt with the<br /> repository or directory +<br /> semantic search result<br /> as additional context
      Note over AIGW: This will be a new `ask_codebase` prompt
    end

    AIGW->>LLM: Sends prompt with the additional contexts

    LLM-->>AIGW: Returns the final answer
    AIGW-->>GLRLLMAC: Returns the final answer
    GLRLLMAC-->>GLRLLMCHAT: Returns the final answer
    GLRLLMCHAT-->>GLRLLMS: Returns the final answer
    GLRLLMS-->>GLRGQL: Returns the final answer
    GLRGQL-->>FE: Returns the final answer
    FE-->>USR: Shows the answer

Using different categories and unit primitives for repository and directory

Instead of using a single repository category for the repository and directory additional context, we will use repository and directory categories.

Note: we decided not to go with this option because a category has a 1-to-1 mapping to a unit primitive, and conceptually, we should only have 1 unit primitive for the “include codebase” context.

The chat input to the aiAction mutation should then look like:

For repository as additional context

mutation newChatMessage {
  aiAction(
    input: {
      chat: {
        content: "the user question"
        additionalContext: [{
          category: "repository",
          id: "the-project-id",
          content: "", // should be empty
          metadata: {'branch': 'some-branch'} // including the branch is a second phase iteration
        }]
      }
    }
  ) {
    requestId
  }
}

For directory as additional context

mutation newChatMessage {
  aiAction(
    input: {
      chat: {
        content: "the user question"
        additionalContext: [{
          category: "directory",
          id: "file:///home/user/workspace/src/dir",
          content: "", // should be empty
          metadata: {
            'relativePath': 'src/dir',
            'branch': 'some-branch' // including the branch is a second phase iteration
          }
        }]
      }
    }
  ) {
    requestId
  }
}

Code Embeddings
Code Embeddings Available tools for indexing Code Embeddings GitLab Active Context Gem A Ruby gem …