Code Suggestions Model Evaluation Guide

This document serves as a technical how-to guide for evaluating new Code Suggestions models.

Evaluation template

When starting a model evaluation process, you must create an issue using the Model Evaluation Template.

Evaluation criteria

Before supporting a model for Code Suggestions, we must evaluate that model against several criteria, including correctness and latency. For a more detailed list of criteria to consider, please refer to the evaluation template.

Evaluating correctness

To evaluate model correctness, use ELI5.

Evaluating latency

To evaluate model latency, use either ELI5 or the ai-model-latency-tester.

When evaluating by latency, it is recommended to check requests coming from different regions. The common regions to test are: North America, Europe, and APAC.

We can evaluate latency in the following ways:

Direct to provider
- Sending requests directly to the AI model provider, for example Vertex AI or Anthropic.
Routed through AIGW to provider
- Sending requests to the AIGW, which in turns sends requests to the provider.
- Before this can be done, you will need to implement the model in the AIGW. You can implement a model in the AIGW without making it generally available to GitLab users.

Evaluation methods

Evaluating by ELI5

ELI5 (Eval like I’m 5) provides a structured way to evaluate AI models using LangSmith. The ELI5 repository includes evaluation scripts, while the sample datasets and the result from the evaluations are stored in the LangSmith platform.

Running and analyzing evaluations on ELI5

For guidance on:

Running evaluations, see the ELI5 documentation.
Evaluating correctness and latency, see the Analyzing results documentation.

Running evaluations on a GCP instance

Running ELI5 evaluations on a GCP instance is ideal for getting consistent latency values that are not affected by your internet connection or your current location. Currently, there is no automated way to run evaluations on a GCP instance, so you must do this manually.

Please reach out to the #g_code_creation Slack channel for guidance.

Evaluating by AI Model Latency Tester

The AI Model/Provider Latency Tester automates the evaluation of latency of third-party AI service providers, using clients in different geographic regions to simulate the experience of geographically dispersed users. It aims to assist in making data-driven decisions regarding which models should power GitLab’s AI features.

See the Latency evaluations issue for further guidance and updates.

Evaluating by Load Tester

The AI Model/Provider Load Tester is designed to simulate production-like traffic, ensuring that the model provider can handle real-world workloads. During the model evaluation process, we should create and run the load test for the model.

For instructions on adding and running test scripts, please follow these steps.

Last modified April 11, 2025: add load tester section (f204c35a)

View page source - Edit this page - please contribute.