AI Model Validation Group
Mission
The AI Model Validation team mission is to support and improve the integrity, reliability, and effectiveness of Generative AI solutions thorough evaluation, validation, and research science processes. We offer a centralized evaluation framework that promotes data-driven decision-making and pragmatic refinement of AI features.
Direction
Team members
The following people are permanent members of the Model Validation Group:
Who | Role |
---|---|
Hongtao Yang | ML Engineer |
Andras Herczeg | Senior Backend Engineer |
Stephan Rayner | Senior ML Engineer |
Tan Le | Senior ML Engineer |
Monmayuri Ray | Engineering Manager |
Susie Bitters | Senior Product Manager |
How to contact us
- Tag a team member in a merge request or issue
- Post a message in the #g_ai_model_validation Slack channel (internal only)
Customer outcomes we are driving for GitLab
If you are a team building or seeking to build an AI solution, you are our customer. Additionally, we provide dashboards, insights, and guidance to empower you to confidently communicate with YOUR customers using data throughout the process. Some examples might include:
- How did you select the appropriate model for a use case?
- How did you systematically evaluate your AI solution AT SCALE as a proxy to production?
- What measures were taken for various prompt engineering techniques?
- Could you explain some of the benchmark datasets you used for evaluation?
- Do we have insight into whether RAG or the AI Agent is truly effective, and how?
And a lot more!
Our current customers include:
- AI Powered: Duo-Chat team
- Create: Code Creation team
- Govern:Threat Insights
Vulnerability explanation
team
Top FY25 Priorities
Data Driven evaluated AI Solutions with every code change.
We encompass two categories on AI Evaluation and AI Research. Our goal is to empower each team building AI features to confidently deliver meaningful and relevant features for GitLab customers. As a long-term initiative, we aim to expand our Centralized Evaluation Framework to assess various models, AI features, and components based on quality, cost, and latency. The primary decision factors for AI content quality are:
- Is it honest? (consistent with facts)
- Is it harmless? (not include content that might offend and harm)
- Is it helpful? (accomplishing the end goal of the user)
We also aim for AI Engineers to leverage the Centralized Framework for experimenting and expanding from prompt engineering, RAG, Agent, to model tuning. This can be achieved through the Framework’s API for the Prompt Library, recognizing that every code change significantly impacts the input and output of LLMs.
Further there are novel research topics and we would love GitLab presented in the AI Research community by publishing our approaches on Evaluation!!
Metrics we love
Similarity Score
This metric evaluates the degree of similarity between an answer generated by a point solution and those produced by other LLMs, such as Claude-2, Text-Bison, and GPT-4, response to the same question or to ground truth.
LLM Judge
This metric involves soliciting evaluations from LLM Judges to assess the quality of answers provided given a specific question and context. Judges are tasked with assigning scores based on three key aspects: correctness, comprehensiveness, and readability. To enhance the credibility of these scores, multiple LLMs can participate as judges. For instance, if three judges unanimously agree that an answer is subpar, we can confidently conclude its quality.
Collective Judge
This metric operates similarly to the “LLM Judge” metric but consolidates all answers generated by each answering model into a single prompt. Judges are then tasked with comparing these consolidated responses and assigning scores accordingly.
Short term priorities
Our OKRs can be viewed here (GitLab internal)
Issues
Our team works across GitLab project including:
Required labels
- Group:
~group::ai model validation
81ce5d69
)