GitLab Dedicated IMOC Response Team

Workflows, responsibilities, and processes for the GitLab Dedicated Incident Manager On Call (IMOC) response team

Overview

The GitLab Dedicated Incident Manager On Call (IMOC) response team staffs the Incident Lead role for incidents affecting GitLab Dedicated customer tenants. As part of the GitLab Dedicated Platform Leadership Escalation rotation, Dedicated IMOC provides leadership and coordination during high-severity incidents, serving as the escalation point for GitLab Dedicated Engineers On Call (GDEOC).

This page documents Dedicated-specific workflows, tools, and procedures for Engineering Managers serving in the Dedicated IMOC rotation.

Key Context:

Role Staffed: Incident Lead
Platform: AWS infrastructure with customer-specific single-tenant
Customer Relations: Tenants in Dedicated have CSMs (Customer Success Managers) and ASE (Assigned Support Engineers). These roles add an extra communication venue. They are kept up to date in Switchboard.
Rotation: Dedicated Platform Leadership escalation providing 24/7 coverage
Tooling: incident.io for incident management, Switchboard for customer communications

When to Engage the Dedicated IMOC

Scenario	Engagement Method	Urgency
S1/S2 incidents	Automatic or manual escalation	Immediate
GDEOC non-responsive (30 min exceeded)	Automatic PagerDuty escalation	Immediate
Critical decisions (Geo failover, emergency maintenance)	Manual escalation by GDEOC	Immediate
Complex coordination needed	Manual escalation by GDEOC	As needed
Security vulnerabilities (high/critical)	SIRT engages IMOC	Immediate

How to Page:

In PagerDuty incident, click “Escalate” → Select Level 2
Pages GitLab Dedicated Platform Leadership Escalation schedule
IMOC acknowledges within 30 minutes

Incident Lead Responsibilities (Dedicated Context)

As Incident Lead for Dedicated incidents, the IMOC provides coordination and decision-making during incidents. This section describes how these responsibilities are executed in the Dedicated environment.

1. Incident Leadership & Coordination

IMOC leads, does not solve. Your role is coordination and decision-making, not technical troubleshooting.

Key Actions:

Provide strategic oversight for S1/S2 incidents
Make critical decisions when GDEOC needs escalation support
Coordinate resources across teams and external partners
Determine when to escalate to Director/VP leadership

Example Questions:

“What’s your mitigation plan? What do you need from me?”
“Who else should we bring into this incident?”
“Based on customer impact, let’s proceed with [option X]”

2. Customer Communication Oversight

Communication Flow: GDEOC → IMOC → GDCMOC → CSM/ASE → Customer

Critical Timing:

S1: The Comms Lead sends customer updates every 60 minutes maximum. Incident.io has a reminder for this. Updates may be sent through GDCMOC zendesk ticket or Switchboard Incident Comms.
S2: Regular updates based on impact level
Emergency maintenance: Customer informed (not approval-gated) before changes

Your Actions:

Ensure GDCMOC paged immediately for S1/S2 incidents
Provide clear status updates GDCMOC can relay to customers
Review external RCAs before customer delivery
Engage with customers as needed, as specified in Communications Lead Role in Customer Calls
Approve cost-impacting mitigations as needed (follow Incident Mitigation process)

3. Technical Decision-Making

Decision Type	When Required	Authority
Geo Failover	S1 outage, Geo-enabled tenant	IMOC decision
Emergency Maintenance	Changes outside maintenance window	EM+ (IMOC)
Cost-Impacting Mitigations	Infrastructure scaling	Follow Incident Mitigation process
Severity Adjustment	Impact level changes	IMOC + GDEOC

Emergency Maintenance Protocol:

Always “customer informed” - never wait for approval
Engage GDCMOC immediately
Proceed with changes while notification is in progress

4. Escalation Management

As IMOC, you have several escalation paths available to resolve incidents quickly. This section outlines when and how to engage each escalation point for prompt issue resolution.

GDCMOC (GitLab Dedicated Communications Manager On Call)

When: S1/S2 customer impact, emergency maintenance, any customer notification needed
How: /pd trigger → Service: “Incident Management - GDCMOC”

Dedicated Group Technical Escalation

When: Need Dedicated-specific technical expertise
How: PagerDuty escalation: “Dedicated Group Technical Escalation”

AWS Enterprise Support

When: AWS infrastructure issues
How: Follow AWS Enterprise Support escalation process
Reference: AWS Enterprise Support Sheet

Tier 2 / Dev Escalation

When: GitLab app bugs (Database, Gitaly, Rails)
How: /inc escalate → tier2: [team] OR /devoncall <incident-url> in #dev-escalation

SIRT (Security Incident Response Team)

When: High/critical security vulnerabilities
How: SIRT engages you; support with Dedicated-specific mitigation

5. Post-Incident Oversight

Required:

S1: Internal RCA + External RCA (within 1 week)
S2/S3/S4: Internal RCA only if requested by leadership

Checklist:

Internal RCA completed within 1 week
External RCA reviewed and approved (S1)
Corrective actions created (/inc follow-up)
Incident follow-ups documented and linked

GitLab Dedicated Platform Context

Dedicated Platform Characteristics

Single-Tenant:

Dedicated AWS infrastructure in customer’s account
Customer-specific maintenance windows
Isolated from other customers

Geo-Enabled:

Secondary region for disaster recovery
Geo failover: 30-45 min (5K Reference Architecture)
Post-failover requires manual failback

US Dedicated for Gov (PubSec):

FedRAMP-authorized, strict compliance
Cannot use incident.io - PagerDuty only
No customer protected data in PagerDuty

Maintenance Windows

Standard: Weekly customer-specific windows for planned changes. See Dedicated Maintenance Windows for details.

Emergency:

Critical security patches, S1 mitigations, infrastructure failures
Customer informed (not approval-gated)
Requires EM+ approval (IMOC has authority)

Key Tools

Switchboard (Dedicated-specific)

Customer notification tool
GDCMOC sends notifications (IMOC provides status updates)
Templates: Investigation start, Update, Escalated response, Mitigation in progress, Resolved
Contains: Customer contacts (ASE and CSM emails), Maintenance Window information, GitLab versions in Tenant information

incident.io (same as .com)

All incidents except Dedicated for Gov (PubSec)
Automatic Slack channel and GitLab issue creation
Works alongside PagerDuty for paging and escalations

Key Differences: Dedicated vs. .com IMOC

Aspect	GitLab.com	Dedicated
Communication	StatusPage (public)	Switchboard (direct to customer)
Comm Manager	CMOC (company-wide)	GDCMOC (Dedicated-specific)
Scope	Platform-wide outages	Customer-specific incidents
Infrastructure	GCP, continuous deployment	AWS, maintenance windows
Emergency Approval	[.com process]	IMOC approval
DR/Failover	[.com approach]	Geo failover (30-45 min)
Compliance	Standard	Dedicated for Gov (PubSec) FedRAMP restrictions
Infra Escalation	[.com path]	AWS Enterprise Support

Critical Decision Scenarios

Note: Geo failover decisions require prior training to understand the procedure, tooling, and operational impact. Review the following materials before initiating or approving a failover:

Geo Failover Runbook: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/blob/main/runbooks/geo-failover.md
Geo Failover Fire Drill Training: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/blob/main/engineering/disaster_recovery/fire_drills/drill-failover-operator.md

Scenario 1: S1 Complete Outage - Geo Failover

Situation: Customer’s GitLab instance completely unavailable.

Actions (First 15 minutes):

Acknowledge PagerDuty, confirm with GDEOC: “Is GitLab completely unavailable?”
Page GDCMOC: /pd trigger → “S1 complete outage, investigating mitigation”
Ask GDEOC: “Is tenant Geo-enabled? Preconditions met?”
Decision: Geo failover vs. other recovery?
If yes: “Approve Geo failover. GDEOC, proceed. GDCMOC, notify customer.”
Monitor progress, ensure 60-min updates to GDCMOC, escalate to Director if >1 hour without mitigation

Key Lesson: Geo failover is a complex procedure. Evaluate pros and cons with GDEOC and make a balanced decision.

Scenario 2: Emergency Maintenance - Security Patch

Situation: SIRT identifies critical vulnerability requiring patch outside maintenance window.

Actions:

Confirm with SIRT: “Risk if we wait?” If high: proceed with emergency maintenance
Page GDCMOC: One-line summary + estimated downtime
Approve as IMOC: “What’s deployment + rollback plan?”
Critical: Customer informed while deployment in progress (not approval-gated)
Monitor, provide updates to GDCMOC, verify success
Ensure RCA within 1 week

Key Lesson: Emergency maintenance needs to be evaluated against risk. It requires resources and coordination against normal maintenance windows. Use the communication tools abundantly to inform customers and next oncallers in the week.

Scenario 3: GDEOC Non-Responsive

Situation: PagerDuty escalates to you. GDEOC hasn’t acknowledged (30 min exceeded).

Actions:

Acknowledge alert, check who’s current GDEOC
Slack DM: “Hi [name], you have a PagerDuty alert. Can you respond?”
If no response (10 min): Reassign to another Dedicated SRE in current timezone and post in #gitlab-dedicated-team
If still stuck: Page all Dedicated Management (@fviegas, @o-lluch, @denhams, @nitinduttsharma)
Document, let GDEOC’s manager follow up on missed page

Key Lesson: Find coverage first, address missed page later.

Communication Expectations

S1 Incidents

GDCMOC sends customer updates every 60 minutes maximum. Updates may be sent through GDCMOC zendesk ticket or Switchboard Incident Comms.
Include: Current status, actions, ETA if available
Follow template comms style in Switchboard for updates. For manual updates to GDCMOC avoid internal details, give an impact directed statement.
Escalate to GitLab Dedicated Director if: >1 hour without a mitigation in sight, or significant coordination needed, or customer escalation

S2 Incidents

Regular updates based on impact
IMOC engagement lighter than S1
Monitor for escalation to S1

S3/S4 Incidents

IMOC typically not required
May engage for complex coordination

Quick Reference

S1 Incident Checklist

☐ Acknowledge PagerDuty within 30 min
☐ Engage GDCMOC immediately

☐ Confirm severity and impact with GDEOC
☐ Ask: "What is the mitigation plan? What do you need?"
☐ For outage: "Geo-enabled? Consider failover?"
☐ Make critical decisions (failover, emergency maintenance, costs)
☐ Ensure GDCMOC gets 60-min updates
☐ Approve emergency changes (EM+ authority)
☐ If >1 hour without mitigation: Escalate to Dedicated Director
☐ Post-incident: RCA within 1 week + external RCA

Who to Page

Need	Who	How
Customer communication	GDCMOC	`/pd trigger` → “Incident Management - GDCMOC”
Dedicated tech help	Dedicated Technical Escalation	PagerDuty escalation
AWS infrastructure	AWS Enterprise Support	AWS Support Sheet
GitLab app bug	Tier 2 / Dev Escalation	`/inc escalate` or `/devoncall`
Security vulnerability	SIRT	They engage you
Management support	Dedicated Management	@fviegas, @o-lluch, @denhams, @nitinduttsharma

Leading vs. Doing

IMOC SHOULD:

Ask: “What’s your mitigation plan?” “What is our next step?” “What options do we have?” “Who do we need?”
Focus: Keep team focused on mitigation, not fixes or root causing
Decide: “Based on impact, proceed with Geo failover”
Coordinate: “I’ll engage AWS Enterprise Support”
Ensure process: “Has GDCMOC been updated?”
Communicate: Ensure updates are ready and sent to customers
Remove blockers: “I’ll approve the cost increase”

IMOC SHOULD NOT:

Jump into debugging: “Let me SSH and check logs”
Take over: “I’ll handle this”
Contact customers directly: “I’ll email the customer”
Get lost in details: Spending 30 min reading logs

Exception: Step in tactically if no GDEOC available or incident escalating rapidly.

Common Pitfalls

Not engaging GDCMOC early: Engage within 10 min for S1/S2 customer impact

Waiting for perfect info: Make decisions with 70% information - speed matters

Taking over vs. leading: Ask, don’t solve. Trust GDEOC as technical expert

Forgetting post-incident: Set reminders for RCA (1 week), use /inc follow-up

Not documenting decisions: Update Slack with rationale: “Decision: Geo failover based on 30-45 min recovery vs. 2-4 hour backup restore”

Glossary

GDEOC: GitLab Dedicated Engineer On Call
GDCMOC: GitLab Dedicated Communications Manager On Call
CSM: Customer Success Manager
ASE: Account Solution Engineer (also known as Assigned Support Engineers)
Switchboard: Customer notification tool (Dedicated)
Geo-enabled tenant: Instance with secondary region for DR
EM+: Engineering Manager or above
SIRT: Security Incident Response Team
FedRAMP: Federal Risk and Authorization Management Program

Resources

Required:

Dedicated-Specific:

Training:

Video: Incident Response Roles vs Response Teams (Lyle Kozloff): https://youtu.be/vmK9-7roDFM
Geo Failover Runbook: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/blob/main/runbooks/geo-failover.md
Geo Failover Fire Drill Training: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/blob/main/engineering/disaster_recovery/fire_drills/drill-failover-operator.md
LevelUp course: “GitLab Dedicated IMOC Training” (20-25 min) - to be created
Video walkthrough: “Switchboard Customer Notifications” (5 min) - to be created

Help:

Current IMOC: PagerDuty “GitLab Dedicated Platform Leadership Escalation” schedule
EMs: @fviegas, @o-lluch, @denhams, @nitinduttsharma
Slack: #gitlab-dedicated-team, #incident-management

Maintained by: GitLab Dedicated Infrastructure Team

Last modified January 8, 2026: docs: Add GitLab Dedicated IMOC Response Team handbook page (d68f162a)

View page source - Edit this page - please contribute.