AI Model Validation Group

The Model Validation group is focused on supporting GitLab teams to make data-driven feature development decisions leveraging ML/AI.

📝 Mission

The AI Model Validation team mission is to support and improve the integrity, reliability, and effectiveness of Generative AI solutions through evaluation, validation, and research science processes for GitLab Duo Features. We offer a centralized evaluation framework that promotes data-driven decision-making and pragmatic refinement of AI features. We further explore methods and techniques on forward-deployed research science in the Gen AI space.

Direction

Here is the group direction page. We have two categories under this group: Category AI Evaluation and Category AI Research. This group is part of the AI-Powered stage.

Central Evaluation Framework (CEF)

Model validation is the primary maintainer of our Central Evaluation Framework (CEF). This tool supports the entire end-to-end process of AI feature creation, from selecting the appropriate model for a use case to evaluating the AI features’ output. AI Validation works in concert with other types of evaluation, such as SET Quality testing and diagnostic testing, but is specifically focused on the interaction with Generative AI.

Read about how we How we validate and test AI models at scale on the GitLab blog.

🚀 Team members

The team is composed of ML engineers focused on ML science and MLOps backend, and they are permanent members of the AI Model Validation Group.

Who	Role
Hongtao Yang	ML Engineer
Andras Herczeg	Senior ML Engineer
Stephan Rayner	Senior ML Engineer
Tan Le	Senior ML Engineer
Monmayuri Ray	Engineering Manager
Susie Bitters	Senior Product Manager

☎️ How to contact us

Tag a team member in a merge request or issue
Slack Channel: #g_ai_model_validation

Customer outcomes we are driving for GitLab Duo

The customer outcomes we are focused on can be divided into themes below:

Benchmark for Quality and Performance Metric of Foundational Model and Feature

We first assess the models and the feature on a large-scale dataset to understand and benchmark quality metrics from a set of metrics. We provide dashboards for diagnostic purposes as well as a continuous daily run dashboard so we can track how the features are performing based on the benchmark.

Evaluation as a Tool for Software Engineers to Experiment as They Iterate and Build AI Features

After the initial assessment, we have a dynamic dataset pull from scheduled runs so feature teams can run the datasets with every code and prompt change via CLI. This helps them understand how changes in code/prompt/system can impact quality based on the variance between control (before change) and test (after change) code on a primary metric of choice.

Documentation as We Use New Ways and Processes for GenAI Evaluation

We are further iterating and documenting an evaluation-centric way of building GenAI features. This is mainly for the internal team, and the epic to track this can be found here.

Our current customers include GitLab AI-powered Duo feature teams:

The immediate customers include:

🧪 Top FY25 Priorities

Data-driven evaluated AI solutions with every code change. We encompass two categories: AI Evaluation and AI Research. Our goal is to empower each team building AI features to confidently deliver meaningful and relevant features for GitLab customers. As a long-term initiative, we aim to expand our Centralized Evaluation Framework to assess various models, AI features, and components based on quality, cost, and latency. The primary decision factors for AI content quality are:

Is it honest? (consistent with facts)
Is it harmless? (does not include content that might offend and harm)
Is it helpful? (accomplishing the end goal of the user)

We also aim for AI Engineers to leverage the Centralized Framework for experimenting and expanding from prompt engineering, RAG, Agent, to model tuning. This can be achieved through the Framework’s API for the Prompt Library, recognizing that every code change significantly impacts the input and output of LLMs.

Further, there are novel research topics, and we would love for GitLab to be represented in the AI research community by publishing our approaches on evaluation.

📚 Prompt Library (Data)

We create large libraries (prompts as data) that serve as a proxy to production. We do this by understanding the various complexities of tasks and methods, allowing us to holistically evaluate a set of data beyond a few tests and more as a performance in production. We do this with a combination of industry benchmarks and customized dataset for various tasks. The current tasks we have included or are planning to include in the prompt library are as follows:

Code Completion
Code Generation
Code Explantation
Issue/Epic Question Answering
GitLab Documentation Question Answering Dataset
/slash Commands: /explain, /test, and /refactor (In Progress)
Vulnerability Explanation/ Resolve (In Progress)
Root Cause Analysis (In Progress)
Feature Summarization (To be added)

We further are planning to build customized workflow dataset particularly for System ( RAG , Agent) and Contextual Evaluation ( text follow up questions)

🔍 Metrics

There are a few different metrics that we use to assess. If we have established ground truth, we conduct an assessment with similarity and cross-similarity scores. If ground truth is not established, we use Consensus Filtering as an LLM-based evaluator through an Independent LLM Judge and a Collective LLM Judge. We are always iterating and evolving our metric pipeline.

Similarity Score

This metric evaluates the degree of similarity between an answer generated by a point solution and those produced by other LLMs, such as Claude-2, Text-Bison, and GPT-4, in response to the same question or to ground truth.

Independent LLM Judge

This metric involves soliciting evaluations from LLM Judges to assess the quality of answers provided given a specific question and context. Judges are tasked with assigning scores based on three key aspects: correctness, comprehensiveness, and readability. To enhance the credibility of these scores, multiple LLMs can participate as judges. For instance, if three judges unanimously agree that an answer is subpar, we can confidently conclude its quality.

Collective LLM Judge

This metric operates similarly to the “LLM Judge” metric but consolidates all answers generated by each answering model into a single prompt. Judges are then tasked with comparing these consolidated responses and assigning scores accordingly.

Performance Metrics (To be Added)

In addition to the similarity and consensus-based metrics, we also will track several performance metrics to evaluate the overall effectiveness of our AI models and features:

Latency: We measure the time it takes for a model to generate a response, ensuring that the latency is within acceptable limits for a seamless user experience.
Requests per Second (Concurrency): We monitor the number of requests our models can handle per second, allowing us to understand the scalability and capacity of our AI infrastructure.
Tokens per Second: We monitor the counts of the tokens rendered per second during LLM response streaming. This helps assesses the speed and efficiency of the LLM in generating and streaming responses, which is critical for user experience in real-time applications.

By continuously monitoring these performance metrics, we can make data-driven decisions to optimize the performance, reliability, and user experience of our AI-powered solutions through our synthetic CEF as proxy to production.

📦 Team Processes

We have a globally distributed team spanning across EMEA, AMER, and APAC. We hold two synchronous sessions weekly, emphasizing the team’s preferences on the schedule and periodically changing the time based on these preferences. We have meetings dedicated to milestone planning, as well as engineering discussions and ideation.

📆 Regular Team Meetings

Team Meetings

Weekly Team Sync
- When: Every Wednesday, 21:00 GMT
- What: This meeting is dedicated to working on the vision and roadmap. The Engineering Manager and Product Manager ideate, discuss, and assign work as needed for the entire team.
Weekly Engineering Sync
- When: Every Tuesday, 21:00 GMT
- What: This meeting is dedicated to the engineering team for the purpose of syncing up on progress, discussing technical challenges, and planning the upcoming week and milestones. This is also to ideate on future milestones and building validation as a product.
Quarterly Creative Destruction Labs
- When: Once in 10 weeks
- What: This is a 48-hour working session, comprised of both synchronous and asynchronous activities, where the team comes together under the AI research category as part of a lab. The goal is to take a topic, deconstruct the old approach, and rebuild it in a new way to rapidly iterate toward the product roadmap and vision.

🌍 Shared Calendars

AI-Powered Stage Calendar (Calendar ID: c_n5pdr2i2i5bjhs8aopahcjtn84@group.calendar.google.com)

🖖 Weekly and Quarterly Updates

Each week, the team publishes a report on the progress made during the past week and outlines the focus for the upcoming week. The report also includes a GenAI reading list to ensure that the engineers stay up-to-date in the ever-changing GenAI space here.

We also publish a quarterly report that summarizes how we perform in reference to OKRs, highlights our achievements, celebrates milestones, identifies opportunities, and shares learnings here.

📹 GitLab Playlist

We conduct regular walkthroughs as we add data, metrics, and evaluation workflows. GitLab AI Model Validation Playlist includes a list of these walkthroughs. Some videos published might be for internal purposes only.

🎯 Current OKR

Our current OKR can be viewed here (GitLab internal).

🔗 Epics and Themes

We have two major epics that can be subdivided into further sub-epics and issues. The themes are based on the Category AI Evaluation and Category AI Research as below.

🔄 How to work with us?

We have issue templates for requesting a new model evaluation or for evaluating a feature (Internal Only). Below are the request templates that can be used.

If a feature team would like a model to be evaluated for a certain task, here is the request template: Model Request
If a feature team would like to evaluate a certain use case, here is the request template: Use-Case Request
If a feature team finds bugs that impact their ability to use the framework, here is a template to request a fix in a new issue. Additionally encourage to post on #g_ai_model_validation

Further, we iterate and act more quickly with feedback and here is the best place to provide feedback.

📝 Dashboards and Additional Resources (Internal Only)

🔗 Additional Resources

Required labels

Group: ~group::ai model validation

Projects

Last modified July 16, 2024: Document AI Feature Model Output Quality Validation (97e9d650)

View page source - Edit this page - please contribute.