Custom Models Group

The Custom Models group owns GitLab Duo’s customer-facing model operations/intelligence layer, which models are available and how they’re selected, the health and connectivity of the customer experience, and the gateway service surface (prompts, internal events, and AIGW billing).

Vision

The Custom Models group focuses on additional, custom models that power GitLab Duo functionality in support of our customers’ unique data and use-cases.

Mission, what we own

Custom Models is the end-to-end owner of GitLab’s customer-facing model intelligence layer: which models are available, how they’re selected, the health and connectivity of the customer experience, and the gateway service surface (prompts, internal events, and AIGW billing). One team, full stack, across Customer zero, SaaS, Self-Managed, and Dedicated.

We are responsible for all backend aspects of the product categories that fall under the Custom Models group of the AI Powered stage of the DevOps lifecycle. Our product direction is on the Category Direction, Custom Models Management page, and the features we work with are listed on the Features by Group page.

Organisation

The group is organised into three functional teams. Each functional team owns its scope end-to-end, including the Requests for Help (RFHs) and support escalations in its area (see Customer support). Staff Engineers are expected to support multiple functional teams as needed.

Model Selection

Owns which models are available and how they’re chosen, across .com and self-hosted.

Scope
Model lifecycle Model additions/removals across Customer Zero,.com, Self-Managed, and Dedicated
Selection Engine Unit Primitives, large/small models, agents
Selection UI Customer zero, SaaS, Self-Managed, Dedicated
Supporting Prompts, evals, and docs for newly added models

Evaluation results from Model Evaluation Infra feed selection decisions through a shared review cadence between the two groups (see What falls outside our scope).

Health & Connectivity

Owns the diagnostic surface and the wider class of customer-facing setup and connectivity issues.

Scope
Diagnostics Duo Health Check diagnostic surface
Setup & config Customer setup and configuration
Connectivity Connectivity across GitLab / AIGW / Models / DWS
Operations Debugging, version drift, and related support escalations
LLMOps Expanding Monitoring, observability, and tracking LLM performance, identifying errors, and optimizing models connectivity down the line

“Health” is deliberately broad: it covers the diagnostic surface and the failure modes those diagnostics are meant to surface, so the people building the diagnostics are the same people fielding the issues.

Gateway Services

Owns the event-driven surfaces that sit on top of the gateway, prompts flowing in, telemetry flowing through, and billing/credit events flowing out.

Scope
Prompt Registry Prompt management on the gateway
Telemetry Internal events and tracking
AIGW billing The AIGW side of billing, metering and billing events for self-hosted billing, SaaS billing, on-demand credits, and AWS Marketplace flows

Billing boundary: Gateway Services owns the AIGW side of billing (metering and billing events). The purchasing and subscription side, buying, plan management, is owned by the Fulfillment team. Instrumentation boundary: Analytics Instrumentation owns the tooling, internal events (Snowplow), billing-events tooling, and Service Ping collection. We own the domain-specific events, billing events, and metrics for the gateway/AIGW surface and instrument them using that tooling.

How we’re organised into functional teams

Acting Engineering Manager: Mohamed Hamda

Engineering Manager & Engineers

Name Role
Mohamed HamdaMohamed Hamda Backend Engineer, AI-Powered:Custom Models

Staff Engineer (cross-team): Manoj M J, contributes across functional teams, taking on initiatives wherever they’re most needed.

Functional team Members
Model Selection Julie Huang; Manoj M J (supporting)
Health & Connectivity Cindy Halim, Newick Lee; Manoj M J (main team)
Gateway Services Patrick Cyiza; Manoj M J (supporting)

Stable Counterparts

The following members of other functional teams are our stable counterparts:

Name Role
Jordan Janes Principal Product Manager

What falls outside our scope

To make boundaries clear for teams inside and outside the org, the following areas are owned by counterpart teams. When work touches these areas, loop in the listed owner.

Area Owner Relationship to Custom Models
Subscription purchase / buying flows, plan management Fulfillment Counterpart for Gateway Services. We own AIGW-side billing/metering; they own purchasing and subscriptions.
Product analytics tooling & instrumentation Analytics Instrumentation Counterpart for Gateway Services. We use their tooling for internal events and tracking.
Raw gateway infrastructure, routing, streaming, self-hosted AIGW AI Platform Engineering (Duo Service Infra) They own the gateway as infrastructure; we own the customer-visible service surface that runs on top of it.
Model evaluation infrastructure, CEF as a service, benchmark pipelines AI Platform Engineering (Model Evaluation Infra) They run and maintain evaluation infra; we consume the results to drive model selection.

How to reach us

Organisational Labels

Issues owned by the Custom Models group should have these labels, as appropriate:

  • ~"group::custom models"
  • ~"devops::ai-powered"
  • ~"section::data science"
  • ~"Category:Model personalization"
  • ~"Category:Self-Hosted models"

In addition, issues should contain the relevant ~type: and subtype labels.

How we work

Our operating framework keeps ownership visible, keeps the team honest on progress, and makes space to push back when priorities don’t make sense, without adding bureaucracy.

Directly Responsible Individuals (DRIs)

Every issue, feature, bug fix, or initiative the team is actively working on has a single named DRI. The DRI is not necessarily the person doing all the work, they are the person responsible for driving it forward and keeping it unblocked.

Being a DRI means:

  • You own the outcome, not just the task. If it’s stuck, you raise it, you don’t wait to be asked.
  • You update the planning issue with status at least once a week.
  • You can delegate work, but accountability stays with you.
  • You can (and should) push back if scope, priority, or timeline doesn’t make sense, with a clear explanation of why.

How DRIs get assigned:

  • New issues: discussed at the weekly sync; someone volunteers or is proposed. If no one volunteers, we discuss why, maybe it isn’t actually a priority, or the team is overloaded.
  • RFHs and escalations: the triage DRI in the relevant functional team either takes ownership or explicitly hands it off with context.
  • DRI assignment is tracked on the team’s planning issue, the single source of truth for who owns what.

Reassigning a DRI is a normal, low-friction thing, not a political event. It should happen explicitly (not by drift) and with enough context handed over that the new DRI isn’t starting from scratch.

Weekly status updates

Each active DRI posts a short async update on the team planning issue, a few lines, not a novel:

  • What moved forward this week
  • What’s blocked and what you need
  • What’s next for the coming week

Example:


Async Status Update YYYY-MM-DD

issue-title (link)

  • Progress: …
  • Blockers: …
  • Confidence for current milestone: ๐ŸŸก Slightly confident

issue-title (link)

  • Progress: …
  • Blockers: …
  • Confidence for current milestone: ๐ŸŸข Very confident

Confidence key: ๐Ÿ”ด Not confident ยท ๐ŸŸก Slightly confident ยท ๐ŸŸข Very confident


Async updates on the planning issue are the default and the single place to see the state of all active work. The existing engineering sync is used mainly to raise hands on blockers; attendance is optional, and nothing else changes if you can’t join (for example, timezones).

Raising hands early

If you’re stuck, overloaded, or realize the work is bigger than expected, raise your hand. This is how a healthy team operates, not a sign of failure:

  • Flag it in your weekly update.
  • Ping the team in Slack with a clear ask: “I need help with X” or “I need someone to take over Y because Z.”
  • Blocked items get discussed first at the weekly sync.

Pushing back, and quality over quantity

The team is empowered to push back on work that doesn’t align with current priorities or that would compromise quality:

  • Push back comes with a reason (“We can’t take this on this milestone because we’re committed to X and Y”).
  • It’s directed at the work, not the person requesting it.
  • It’s documented (a comment on the issue is enough) so there’s a record of the decision.

We resist the pressure to add new models or configurations until the existing ones are solid. New model onboarding goes through a lightweight readiness check:

  1. Is the documentation complete?

  2. Are known issues resolved?

  3. Is support equipped to handle tickets?

“We’re not adding this yet; here’s what we need to get right first” is a valid and encouraged answer.

Cadence

Activity Frequency Owner
Weekly sync (review blockers, assign DRIs) Weekly EM
Status update on planning issue Weekly (async) Each active DRI
Triage of new RFHs / escalations Ongoing Triage DRI per functional team
Retrospective on process health Monthly Team

Scoping work using Epics and Tech Leads

Epics are the primary definition of scope for any work item larger than a single issue, a new feature, a complex refactor, or a bug. The issues in the epic constitute the entire scope of the work item; when they are all closed the work is complete and the epic is closed. An epic should enclose an iteration that adds a clear improvement, but an epic does not necessarily represent a whole feature, which might require multiple epics.

Technical ownership of the work defined by an epic is delegated to a Tech Lead, an engineer who is assigned to the epic and ensures the scope is correct. The Tech Lead works with the EM, the PM, and other engineers. Any engineer on the team can work on the issues in the epic, self-assigned using the Kanban process, including the Tech Lead themselves.

Team Milestone Planning Process

Custom Models follows the Product Development Flow and Cross Functional Prioritization. The team uses a planning issue and boards to manage the planning process. Planning automation scripts are available to make this easier.

Planning issues for each milestone are created by the PM and used to coordinate upcoming work between the PM, EM, and stable counterparts.

During each milestone, planning is completed for the next milestone:

  1. Creation of planning issues and boards (EM or PM)
  2. Refinement issues created every week with automation
  3. Identification of candidate issues for the milestone and addition to the Planning Board (PM, EM)
  4. Team member capacity planning (EM)
  5. Estimation of effort using weights (Engineers and EM)
  6. Joint planning session to finalise the planning board (PM, EM)
  7. Assignment of work to engineers, addition of the ~Deliverable label, update to the planning issue (EM)

Planning Issue

Each month a planning issue is created by the PM using automation and the Custom Models Planning template. This is the discussion area for the planning team members (PM, EM) for a specific milestone and links to the Planning and Build Boards.

Planning Board

The Planning Board is created for each milestone by the PM and is a curated list of issues by category. It can be overloaded with issues; the excess is moved to the next milestone or to the Next 1-3 Milestones board during the planning call.

The PM marks issues with ~workflow::planning breakdown, signaling to the EM to request engineers to review the issue description and ensure it’s clear and ready for development. The engineer then assigns a weight and applies the ~workflow::ready for development label.

Ready for Development Status

Issues ready to be worked on are labeled workflow::ready for development. Only issues with this label should be assigned to an engineer as a Deliverable. If research is required, the ~spike label is assigned; the scope of the spike should be clearly stated, and the outcome might be code or a refined issue.

Capacity Planning

The EM maintains a method for calculating team capacity and assigning work lanes to the release based on priority. The EM posts the team capacity and DRIs on the planning issue.

Build Board

The EM selects issues from the Planning Board based on previous-milestone slippage, PM preference, weight, and priority. The EM then applies the ~Deliverable label to each issue in the release and assigns it to an engineer. Issues are tracked throughout the release with the Build Board.

Say / Do Ratio

The Say / Do ratio is calculated using Completed Issues / Assigned Issues:

  • Issues added to the Build Board with the ~Deliverable label are the Assigned Issues.
  • Issues closed by the end of the milestone are the Completed Issues.

Issue Weights

A weight is assigned to each issue as an estimate of work to close it. A weight of 1 is approximately 2 working days of effort. Issues are generally not weighted above 3, larger weights indicate the issue should be broken down further.

Planning and Delivery Boards

All workflow statuses in the Product Development Flow are valid. The Next 1-3 and Next 4-6 milestones boards house issues which need refinement or are ready to be worked on.

Board Filters Columns
Planning Board Milestone, ~group::custom models ~type::bug, ~type::maintenance, ~type::feature
Build Board Milestone, ~group::custom models, ~Deliverable ~workflow::ready for development, ~workflow::in dev, ~workflow::in review, ~workflow::awaiting security release, ~workflow::blocked
Next 1-3 Milestones %Next 1-3 Milestones ~workflow::problem validation, ~workflow::design, ~workflow::solution validation, ~workflow::planning breakdown, ~workflow::ready for development
Next 4-6 Milestones %Next 4-6 Milestones Same as Next 1-3 Milestones

Issue Milestones

  • Issues are assigned the current or next milestone if they are planned to be or are currently being worked on.
  • %Backlog is assigned if issues are not intended to be worked on, although they may be addressed by a community contribution.
  • %Awaiting Customer Feedback may be worked on, pending customer interest.

The issue triage report highlights issues that need a milestone assignment.

Customer support & Requests for Help

To better support calls with customers (existing and prospective), Custom Models provides engineers who prioritise customer support requests, so that load and knowledge are shared across the team.

RFHs are owned by the functional team whose scope they fall under, Gateway Services handles gateway RFHs, Health & Connectivity handles setup/connectivity RFHs, and so on. Each functional team assigns a triage DRI who either drives the RFH to resolution or explicitly hands it off with context. Support requests should be acknowledged within 24h.

To request help for a customer, create a request for help issue and share it in #g_custom_models. Don’t hesitate to ask for help from other team members in the same channel.

Responsibilities of the triage DRI in support

  • Triage Requests for Help.
  • Monitor incoming requests in #g_custom_models and #custom_model_rfh.
  • Make sure request-for-help issues are created.
  • Answer support questions on Slack, redirecting to documentation whenever possible.
  • Join customer calls led by Solution Architects or Sales reps when needed, and own communication with the customer until it’s resolved or handed to a support engineer.
  • Act on outcomes:
    1. Can we add documentation to help SAs and customers be more self-sufficient?
    2. Could better tooling help? Create an issue with the changes needed.
    3. Was it a bug we didn’t catch? How do we avoid it next time?
  • Notify the EM in advance if you won’t be available.
  • Hand over the necessary context to the next engineer in support.

It is not expected for engineers in support to:

  • Be available outside their preferred working hours, though some requests may be urgent and should be tackled first thing on the next working day. Consult the EM and PM in those situations.
  • Lead customer calls, unless agreed for a specific case.
  • Present demos, unless agreed for a specific case.

Communication

Custom Models communicates based on the following guidelines:

  1. Always prefer async communication over sync meetings.
  2. Don’t shy away from arranging a sync call when async is proving inefficient, but try to record it to share with team members.
  3. Transparency by Default.
  4. The primary channel for work-related communication is the #g_custom_models Slack channel.
  5. Internal team issues and projects are namespaced under gitlab-org/ai-powered/custom-models.

Acknowledgement of Pings

If you are pinged by name in either Slack or GitLab, please acknowledge it with either a threaded comment or an emoji.

Time Off

Team members should add any Paid Time Off in the “Workday” Slack app, so the EM can use the proper number of days off during capacity planning. Where possible, add time off a full milestone in advance.

There can always be last-minute, unplanned PTO needs. Please take any time you need, but enter it into Workday and communicate with the EM as soon as you can.

Ad-hoc sync calls

We operate using async communication by default. There are times when a sync discussion is beneficial, and we encourage team members to schedule sync calls with the required people as needed.

Blog Posts

Blog posts written by Custom Models team members: