Support Incident Response Framework

Delivering customer results during critical moments through collaboration, iteration, and transparency—transforming incidents into opportunities for trust.

Support Incident Response Framework

Support Engineers regularly encounter customer-facing incidents that require rapid, coordinated responses while maintaining technical excellence. The Support Incident Response Framework provides structure for these critical moments, helping us balance customer advocacy with effective problem-solving. It provides a structured approach that complements broader organisational incident processes while focusing on the unique requirements of customer advocacy and technical problem-solving in support contexts.

The framework delivers practical benefits through clear processes, defined roles, and systematic knowledge sharing. By transforming individual experiences into team wisdom, it reduces cognitive load during stressful situations and improves our coordination. This ultimately leads to faster resolutions for our customers while supporting our professional growth as Support Engineers.

Note

If you are seeking information to help GitLab ensure incidents are reported, investigated and handled, you might be looking for the Incident Response Guidance.

If an incident requires a non-standard workflow or communication from Support, create the Support Response issue if this has not yet been done.

If you are looking for details about being on-call in Support, you might find what you need here: GitLab Support On-Call Guide

In Scope

Customer-reported production emergencies
GitLab.com service disruptions requiring support coordination
Security incidents requiring support response
Mass-impact product issues
Post-release customer impact scenarios

Out of Scope

Pure infrastructure incidents without customer impact
Internal system outages
Individual customer service requests
Feature requests and bug reports
Non-emergency support inquiries

Integration Points

flowchart LR
A[Infrastructure] --> B[Incident]
C[Security/SIRT] --> B
B --> D[Customer Success]
B --> E[Support Incident Response]
B --> F[Product]    
D --> G[Coordinated Response]
E --> G
F --> G

sequenceDiagram
participant Detection
participant Triage
participant Response
participant Communication
participant Resolution
Detection->>Triage: Incident Reported
Triage->>Response: Severity Assessment
Response->>Communication: Initial Response
Communication->>Resolution: Status Updates
Resolution->>Detection: Incident Closure

Working Principles

Working principles are behaviors that empower team members to carry out incident response work in alignment with the needs of our customers and our wider business Incident Response. They help illustrate what applying GitLab’s core values and operating principles to your work as a Support Engineering Incident Responder will look like. These working principles are complementary to, and should be subordinate to, GitLab’s core values and operating principles. In case of a conflict between the two, please create an MR to propose a change to or removal of the working principles.

Customer-First Response

Incident decisions prioritize customer impact above technical considerations. Customer experience metrics serve as primary success indicators. Response strategies target minimal workflow disruption and fastest path to resolution. Resource allocation focuses on customer-impacting components first.

Clear Accountability

Incidents operate with defined RACI matrices and explicit role assignments. Decision authority follows documented hierarchies to prevent ambiguity. Escalation thresholds trigger specific notification protocols. Critical path tasks maintain designated owners throughout incident lifecycle.

Continuous Improvement

Incidents generate standardized postmortem analysis with tracked action items. Process reviews occur at scheduled intervals with defined completion criteria. Performance metrics drive framework enhancements through data validation. Process changes undergo controlled testing before full implementation.

Incident Response Guidance

This Support Incident Response Framework is designed to complement existing organisational security and infrastructure incident response processes and should branch off established organisational workflows defined in the Incident Response Guidance.

As improvement efforts progress across the organisation, any content that appears duplicated between the Support Incident Response Framework and other incident frameworks will be identified. When duplication is detected, please create a Merge Request (MR) to remove the duplicated content from this framework and integrate it into the wider organisational incident framework.

This consolidation effort aims to create a unified approach to incident management while preserving the specialised workflows needed for customer-facing Support incidents. The Support Incident Response Framework will evolve accordingly, with emphasis on the unique aspects of customer support during incidents rather than replicating general incident procedures.

If you notice areas of duplication or opportunities for integration, please create an issue or MR in the appropriate project to help facilitate this alignment work.

Key Roles and Their Responsibilities

The Role Structure and Responsibilities component defines who does what during customer-impacting incidents, establishing clear lines of authority, communication paths, and accountabilities.

Clearly defining these roles, responsibilities, and interfaces, eliminates confusion during critical incidents and high-pressure situations, ensures comprehensive coverage of all necessary functions, and provides a foundation for continuous improvement in our incident response.

Support-Specific Roles

CMOC (Communications Manager On-call)

Primary Focus: Customer impact management and support coordination

Handbook: CMOC Workflows

Assess the scope and nature of customer impact through tickets and monitoring
Coordinate support team resource allocation based on incident severity
Develop and execute the customer communications strategy
Review and approve all customer-facing messaging for clarity and accuracy
Create and apply incident-specific tags in Zendesk for tracking
Handle bulk ticket responses for incident-related inquiries
Manage support documentation and macros specific to the incident
Coordinate with regional support teams to ensure 24/7 coverage
Track evolving customer impact patterns throughout the incident

SMOC (Support Manager On-Call)

Primary Focus: Escalation management and support team resource coordination

Handbook: Support Manager On-Call

Handle Support Ticket Attention Requests during incidents
Make definitive determinations on emergency qualification
Find additional coverage when multiple emergencies occur simultaneously
Act as notification point for security incidents affecting support
Lead emergency calls with customers when needed
Assist with particularly difficult customer communications
Prevent SLA breaches through proactive intervention
Find Support Manager DRI for Account Escalations
Review and validate support team response strategies

Cross-Functional Coordination

These roles are described in further detail in various handbook pages. The definitions below provide summary context for Support Engineering team members.

Incident Manager On-Call (IMOC)

Coordinates overall incident response and technical aspects
Manages status.io updates and public communications
Facilitates cross-team collaboration during resolution
Determines incident severity and closure timing

Infrastructure Team

Provides technical resolution for platform issues
Gives technical status updates to support teams
Estimates resolution timeframes for customer communications
Collaborates on post-incident analysis

Customer Success Team

Manages communications with strategic customers
Provides context on customer-specific needs
Joins customer calls when appropriate
Helps measure post-incident customer satisfaction

Product Team

Assists with product-specific incidents and bugs
Provides product expertise for customer communications
Prioritizes fixes based on customer impact data
Collaborates on bug-related messaging

Security Incident Response Team (SIRT)

Provides security expertise during security-related incidents
Determines appropriate information disclosure restrictions
Guides support messaging for security incidents
Reviews security-related customer communications

Role Interfaces and Handoffs

The framework defines clear interaction points between roles:

CMOC <-> IMOC

IMOC provides technical status for customer communications
CMOC provides customer impact details to inform technical priorities
Joint approval of public-facing status updates
Regular sync points at defined intervals based on severity

CMOC <-> SMOC

SMOC provides guidance on complex support scenarios
CMOC escalates resource needs and complex customer situations
Joint decisions on emergency qualification
Collaboration on support team resource allocation

SMOC <-> IMOC

IMOC provides technical context for support escalations
SMOC provides support impact details to inform response
Collaboration on incident severity determinations
Joint review of customer impact assessment

Regional Handoffs

Defined documentation requirements for cross-region transfers
Structured handoff calls at region boundaries
Common tools and templates for consistency
Clear escalation paths across time zones

Support-Role Engagement and Exit

Engagement Triggers

CMOC: Multiple customers affected OR bulk communications needed OR support resource coordination required
SMOC: Complex customer impact OR resource conflicts OR SLA risk OR SIRT involvement

Exit Criteria

Customer communications stable
Support queue normalized
No new impact patterns
Regular ticket flow resumed
Documented final status
Handback to regular support flow

Future considerations

Measuring Role Effectiveness

PROPOSED: Each role has specific KPIs to evaluate performance | ISSUE: TBC

CMOC Metrics

Time to first customer communication
Customer satisfaction during incidents
Communication consistency across incidents
Support resource utilization efficiency

SMOC Metrics

Time to resolve escalations
Resource allocation effectiveness
SLA compliance during incidents
Escalation appropriateness

Implementation and Training

PROPOSED | ISSUE: TBC

Role-specific training curricula
Regular simulation exercises
Shadowing opportunities for new team members
Continuous skill development pathways

Metrics & Success Indicators

PROPOSED | ISSUE: TBC

Response Time
- Description: Time from detection to initial response
- Target: < ____ minutes for SEV1/SEV2
Resolution Time
- Description: Time from detection to resolution
- Target: Varies by severity
Customer Satisfaction
- Description: CSAT scores for incident handling
- Target: > 90%

Handover Summary Templates

Summary templates as code blocks for various communication scenarios where CMOC and SMOC roles need to share information with other stakeholders in Slack channels, and/or issues.

CMOC Communication Templates

Initial Status Update Template

## Incident #[number] - [title]
**Status:** In Progress
**Severity:** [SEV1/SEV2/SEV3]
**Time Detected:** [YYYY-MM-DD HH:MM UTC]

### Issue Summary
[Brief description of the issue - 1-2 sentences]

### Customer Impact
- Systems/services affected: [list affected services]
- Impact type: [complete outage/degraded performance/feature unavailability]
- Estimated affected customers: [number/percentage if known]

### Current Actions
- [Bullet points of what the team is currently doing]

### Next Update
Next status update expected by [time] UTC

Regular Status Update Template

## Incident #[number] - [title] - UPDATE #[X]
**Status:** In Progress
**Severity:** [SEV1/SEV2/SEV3]
**Time Detected:** [YYYY-MM-DD HH:MM UTC]
**Last Updated:** [YYYY-MM-DD HH:MM UTC]

### Current Status
[Brief description of the current state - 1-2 sentences]

### Progress Since Last Update
- [Bullet points of actions taken and progress made]

### Ongoing Customer Impact
- [Updated impact assessment]
- Current ticket volume: [number]
- Notable patterns: [describe any patterns in customer reports]

### Next Steps
- [Bullet points of planned actions]

### Next Update
Next status update expected by [time] UTC

Resolution Update Template

## Incident #[number] - [title] - RESOLVED
**Status:** Resolved
**Severity:** [SEV1/SEV2/SEV3]
**Time Detected:** [YYYY-MM-DD HH:MM UTC]
**Time Resolved:** [YYYY-MM-DD HH:MM UTC]
**Duration:** [X hours Y minutes]

### Resolution Summary
[Brief description of how the issue was resolved]

### Final Impact Assessment
- Systems/services affected: [list affected services]
- Total customers impacted: [number/percentage if known]
- Total tickets received: [number]

### Follow-up Actions
- [Any post-incident actions customers should take]
- [Any monitoring customers should perform]

### Additional Information
A full post-incident review will be conducted and findings shared as appropriate.

For any additional questions, please contact support referencing Incident #[number].

Regional Handoff Template

## Incident #[number] - [title] - HANDOFF
**Status:** In Progress
**Current Region:** [EMEA/AMER/APAC]
**Handoff To:** [EMEA/AMER/APAC]
**Handoff Time:** [YYYY-MM-DD HH:MM UTC]

### Current Situation
[Brief summary of current status - 2-3 sentences]

### Customer Impact Status
- Active tickets: [number]
- Pending responses: [number]
- Escalated issues: [number]

### Communication Status
- Last status.io update: [time] UTC
- Next scheduled update: [time] UTC
- Draft status update: [link or text]

### Priority Actions for Next Shift
1. [Most important action]
2. [Second priority action]
3. [Additional actions as needed]

### Key Stakeholders
- [List of key contacts involved]

### Handoff Acknowledgement
Please acknowledge receipt of this handoff in the incident channel.

SMOC Communication Templates

Support Resource Allocation Template

## Incident #[number] - [title] - SUPPORT RESOURCES
**Status:** In Progress
**Time:** [YYYY-MM-DD HH:MM UTC]
**Resource Request Type:** [Initial/Update/Release]

### Current Support Load
- Active incident-related tickets: [number]
- Current response time: [time]
- Queue health status: [Healthy/Strained/Critical]

### Resource Allocation
- AMER: [X] engineers allocated to incident
- EMEA: [X] engineers allocated to incident
- APAC: [X] engineers allocated to incident

### Priority Guidelines
1. [Top priority issue type]
2. [Second priority issue type]
3. [Standard handling for other issues]

### Special Handling Instructions
- [Any special routing or handling instructions]
- [Any customer-specific considerations]

### Actions Needed
- [Team leads]: [specific action requested]
- [Regional managers]: [specific action requested]
- [Other stakeholders]: [specific action requested]

### Duration Estimate
This resource allocation is expected to remain in place for approximately [time period].

Customer Impact Report Template

## Incident #[number] - [title] - CUSTOMER IMPACT REPORT
**Status:** [In Progress/Resolved]
**Time:** [YYYY-MM-DD HH:MM UTC]

### Impact Summary
[Brief description of customer impact - 2-3 sentences]

### Impact Metrics
- Total customers reporting issues: [number]
- Percentage of customer base: [estimated percentage]
- Geographic distribution: [regions affected]
- Customer segments affected: [Enterprise/SMB/Personal]

### Common Issues Reported
1. [Most common issue] - [X] reports
2. [Second most common issue] - [X] reports
3. [Third most common issue] - [X] reports

### Customer Sentiment
- Current CSAT trending: [Stable/Declining/Improving]
- Notable customer concerns: [list major themes]

### Recommended Actions
- [Technical team]: [recommended action]
- [Communications team]: [recommended action]
- [Customer success]: [recommended action]

### Additional Information
[Any other relevant details about customer impact]

CMOC Activation Template

## Incident #[number] - [title] - CMOC ACTIVATION
**Status:** In Progress
**Activation Time:** [YYYY-MM-DD HH:MM UTC]
**Requested By:** [Name/Role]

### Activation Criteria Met
- [List specific criteria that triggered activation]

### Current Support Status
- Active tickets: [number]
- Affected customers: [number/types]
- Current response time: [time]

### Initial CMOC Actions
1. [First immediate action]
2. [Second immediate action]
3. [Ongoing monitoring focus]

### Resource Requirements
- Personnel needed: [specific roles/numbers]
- Tools/access needed: [specific requirements]
- Stakeholder engagement needed: [specific teams]

### Actions Needed
- [CMOC]: Acknowledge activation and implement response plan
- [Regional managers]: [specific action requested]
- [Technical teams]: [specific action requested]

### Communication Plan
- Initial customer communication to be sent by: [time]
- Coordination meeting scheduled for: [time]
- Reporting cadence: [frequency]

Post-Incident Support Summary Template

## Incident #[number] - [title] - SUPPORT SUMMARY
**Status:** Resolved
**Incident Duration:** [start time] to [end time] UTC
**Report Time:** [YYYY-MM-DD HH:MM UTC]

### Support Response Summary
[Brief overview of the support response - 3-4 sentences]

### Key Metrics
- Total tickets handled: [number]
- Peak tickets per hour: [number]
- Average response time: [time]
- Support resources utilized: [number of staff]

### Customer Impact Analysis
- Most affected customer segments: [details]
- Geographic distribution: [details]
- Common workarounds provided: [list]

### Effectiveness Assessment
- What worked well: [bullet points]
- Improvement areas: [bullet points]
- Tool/process gaps identified: [bullet points]

### Follow-up Actions
- [Specific action items with owners and timelines]

### Lessons Learned
[Key takeaways for future incident response]

Cross-Functional Templates

Technical-to-Support Handoff Template

## Incident #[number] - [title] - TECHNICAL TO SUPPORT HANDOFF
**Status:** [In Progress/On Hold/Resolved]
**Time:** [YYYY-MM-DD HH:MM UTC]

### Technical Summary
[Brief technical explanation of the issue - keep simple and customer-focused]

### Customer-Facing Impact
- What customers are seeing: [observable symptoms]
- Affected components/features: [specific details]
- Scope of impact: [broad/limited/specific customers]

### Workaround Instructions
[Step-by-step workaround if available]

### Customer Communication Guidance
- Key points to communicate: [bullet points]
- Points to avoid mentioning: [bullet points]
- Technical accuracy verified by: [name]

### Expected Resolution
- Estimated time to resolution: [timeframe if known]
- Fix delivery method: [hotfix/regular release/etc.]

### Actions Needed
- [Support team]: [specific guidance on ticket handling]
- [CMOC]: [guidance on status.io messaging]
- [Other teams]: [any other coordination needed]

Executive Update Template

## Incident #[number] - [title] - EXECUTIVE SUMMARY
**Status:** [In Progress/On Hold/Resolved]
**Time:** [YYYY-MM-DD HH:MM UTC]

### Situation Overview
[Concise explanation of the incident - 1-2 sentences]

### Business Impact
- Customer impact: [High/Medium/Low] - [brief description]
- Revenue impact: [Yes/No/Unknown] - [brief description if Yes]
- Reputation risk: [High/Medium/Low] - [brief explanation]

### Response Status
- Technical response: [On track/Delayed/Blocked] - [brief status]
- Support response: [On track/Delayed/Blocked] - [brief status]
- Communications: [On track/Delayed/Blocked] - [brief status]

### Key Metrics
- Duration so far: [time]
- Estimated time to resolution: [time or unknown]
- Support tickets: [number]
- Affected customers: [number/percentage]

### Critical Decisions Needed
[List any decisions requiring executive input]

### Next Update
Next executive update scheduled for: [time] UTC

These templates provide structured frameworks for different types of communications that CMOC and SMOC roles would need to share with stakeholders during an incident. They’re designed to be clear, actionable, and adaptable to different incident scenarios while maintaining consistency in format.

Last modified June 3, 2025: Add Support Incident Response framework to handbook (662913b4)

View page source - Edit this page - please contribute.