AI Model Validation Group

The Model Validation group is focused on supporting GitLab teams to make data-driven feature development decisions leveraging ML/AI.

Mission

The AI Model Validation team mission is to support and improve the integrity, reliability, and effectiveness of Generative AI solutions thorough evaluation, validation, and research science processes. We offer a centralized evaluation framework that promotes data-driven decision-making and pragmatic refinement of AI features.

Direction

Group direction

Team members

The following people are permanent members of the Model Validation Group:

Who Role
Hongtao Yang ML Engineer
Andras Herczeg Senior Backend Engineer
Stephan Rayner Senior ML Engineer
Tan Le Senior ML Engineer
Monmayuri Ray Engineering Manager
Susie Bitters Senior Product Manager

How to contact us

  • Tag a team member in a merge request or issue
  • Post a message in the #g_ai_model_validation Slack channel (internal only)

Customer outcomes we are driving for GitLab

If you are a team building or seeking to build an AI solution, you are our customer. Additionally, we provide dashboards, insights, and guidance to empower you to confidently communicate with YOUR customers using data throughout the process. Some examples might include:

  • How did you select the appropriate model for a use case?
  • How did you systematically evaluate your AI solution AT SCALE as a proxy to production?
  • What measures were taken for various prompt engineering techniques?
  • Could you explain some of the benchmark datasets you used for evaluation?
  • Do we have insight into whether RAG or the AI Agent is truly effective, and how?

And a lot more!

Our current customers include:

  1. AI Powered: Duo-Chat team
  2. Create: Code Creation team
  3. Govern:Threat Insights Vulnerability explanation team

Top FY25 Priorities

Data Driven evaluated AI Solutions with every code change.

We encompass two categories on AI Evaluation and AI Research. Our goal is to empower each team building AI features to confidently deliver meaningful and relevant features for GitLab customers. As a long-term initiative, we aim to expand our Centralized Evaluation Framework to assess various models, AI features, and components based on quality, cost, and latency. The primary decision factors for AI content quality are:

  • Is it honest? (consistent with facts)
  • Is it harmless? (not include content that might offend and harm)
  • Is it helpful? (accomplishing the end goal of the user)

We also aim for AI Engineers to leverage the Centralized Framework for experimenting and expanding from prompt engineering, RAG, Agent, to model tuning. This can be achieved through the Framework’s API for the Prompt Library, recognizing that every code change significantly impacts the input and output of LLMs.

Further there are novel research topics and we would love GitLab presented in the AI Research community by publishing our approaches on Evaluation!!

Metrics we love

Similarity Score

This metric evaluates the degree of similarity between an answer generated by a point solution and those produced by other LLMs, such as Claude-2, Text-Bison, and GPT-4, response to the same question or to ground truth.

LLM Judge

This metric involves soliciting evaluations from LLM Judges to assess the quality of answers provided given a specific question and context. Judges are tasked with assigning scores based on three key aspects: correctness, comprehensiveness, and readability. To enhance the credibility of these scores, multiple LLMs can participate as judges. For instance, if three judges unanimously agree that an answer is subpar, we can confidently conclude its quality.

Collective Judge

This metric operates similarly to the “LLM Judge” metric but consolidates all answers generated by each answering model into a single prompt. Judges are then tasked with comparing these consolidated responses and assigning scores accordingly.

Short term priorities

Our OKRs can be viewed here (GitLab internal)

Issues

Our team works across GitLab project including:

Required labels

  • Group: ~group::ai model validation