Support Incident Response Framework

Delivering customer results during critical moments through collaboration, iteration, and transparency—transforming incidents into opportunities for trust.

Support Incident Response Framework

Support Engineers regularly encounter customer-facing incidents that require rapid, coordinated responses while maintaining technical excellence. The Support Incident Response Framework provides structure for these critical moments, helping us balance customer advocacy with effective problem-solving. It provides a structured approach that complements broader organisational incident processes while focusing on the unique requirements of customer advocacy and technical problem-solving in support contexts.

The framework delivers practical benefits through clear processes, defined roles, and systematic knowledge sharing. By transforming individual experiences into team wisdom, it reduces cognitive load during stressful situations and improves our coordination. This ultimately leads to faster resolutions for our customers while supporting our professional growth as Support Engineers.

In Scope

  • Customer-reported production emergencies
  • GitLab.com service disruptions requiring support coordination
  • Security incidents requiring support response
  • Mass-impact product issues
  • Post-release customer impact scenarios

Out of Scope

  • Pure infrastructure incidents without customer impact
  • Internal system outages
  • Individual customer service requests
  • Feature requests and bug reports
  • Non-emergency support inquiries

Integration Points

flowchart LR
A[Infrastructure] --> B[Incident]
C[Security/SIRT] --> B
B --> D[Customer Success]
B --> E[Support Incident Response]
B --> F[Product]    
D --> G[Coordinated Response]
E --> G
F --> G
sequenceDiagram
participant Detection
participant Triage
participant Response
participant Communication
participant Resolution
Detection->>Triage: Incident Reported
Triage->>Response: Severity Assessment
Response->>Communication: Initial Response
Communication->>Resolution: Status Updates
Resolution->>Detection: Incident Closure

Working Principles

Working principles are behaviors that empower team members to carry out incident response work in alignment with the needs of our customers and our wider business Incident Response. They help illustrate what applying GitLab’s core values and operating principles to your work as a Support Engineering Incident Responder will look like. These working principles are complementary to, and should be subordinate to, GitLab’s core values and operating principles. In case of a conflict between the two, please create an MR to propose a change to or removal of the working principles.

Customer-First Response

Incident decisions prioritize customer impact above technical considerations. Customer experience metrics serve as primary success indicators. Response strategies target minimal workflow disruption and fastest path to resolution. Resource allocation focuses on customer-impacting components first.

Clear Accountability

Incidents operate with defined RACI matrices and explicit role assignments. Decision authority follows documented hierarchies to prevent ambiguity. Escalation thresholds trigger specific notification protocols. Critical path tasks maintain designated owners throughout incident lifecycle.

Continuous Improvement

Incidents generate standardized postmortem analysis with tracked action items. Process reviews occur at scheduled intervals with defined completion criteria. Performance metrics drive framework enhancements through data validation. Process changes undergo controlled testing before full implementation.

Incident Response Guidance

This Support Incident Response Framework is designed to complement existing organisational security and infrastructure incident response processes and should branch off established organisational workflows defined in the Incident Response Guidance.

As improvement efforts progress across the organisation, any content that appears duplicated between the Support Incident Response Framework and other incident frameworks will be identified. When duplication is detected, please create a Merge Request (MR) to remove the duplicated content from this framework and integrate it into the wider organisational incident framework.

This consolidation effort aims to create a unified approach to incident management while preserving the specialised workflows needed for customer-facing Support incidents. The Support Incident Response Framework will evolve accordingly, with emphasis on the unique aspects of customer support during incidents rather than replicating general incident procedures.

If you notice areas of duplication or opportunities for integration, please create an issue or MR in the appropriate project to help facilitate this alignment work.

Key Roles and Their Responsibilities

The Role Structure and Responsibilities component defines who does what during customer-impacting incidents, establishing clear lines of authority, communication paths, and accountabilities.

Clearly defining these roles, responsibilities, and interfaces, eliminates confusion during critical incidents and high-pressure situations, ensures comprehensive coverage of all necessary functions, and provides a foundation for continuous improvement in our incident response.

Support-Specific Roles

CMOC (Communications Manager On-call)

Primary Focus: Customer impact management and support coordination

Handbook: CMOC Workflows

  • Assess the scope and nature of customer impact through tickets and monitoring
  • Coordinate support team resource allocation based on incident severity
  • Develop and execute the customer communications strategy
  • Review and approve all customer-facing messaging for clarity and accuracy
  • Create and apply incident-specific tags in Zendesk for tracking
  • Handle bulk ticket responses for incident-related inquiries
  • Manage support documentation and macros specific to the incident
  • Coordinate with regional support teams to ensure 24/7 coverage
  • Track evolving customer impact patterns throughout the incident

SMOC (Support Manager On-Call)

Primary Focus: Escalation management and support team resource coordination

Handbook: Support Manager On-Call

  • Handle Support Ticket Attention Requests during incidents
  • Make definitive determinations on emergency qualification
  • Find additional coverage when multiple emergencies occur simultaneously
  • Act as notification point for security incidents affecting support
  • Lead emergency calls with customers when needed
  • Assist with particularly difficult customer communications
  • Prevent SLA breaches through proactive intervention
  • Find Support Manager DRI for Account Escalations
  • Review and validate support team response strategies

Cross-Functional Coordination

These roles are described in further detail in various handbook pages. The definitions below provide summary context for Support Engineering team members.

Incident Manager On-Call (IMOC)

  • Coordinates overall incident response and technical aspects
  • Manages status.io updates and public communications
  • Facilitates cross-team collaboration during resolution
  • Determines incident severity and closure timing

Infrastructure Team

  • Provides technical resolution for platform issues
  • Gives technical status updates to support teams
  • Estimates resolution timeframes for customer communications
  • Collaborates on post-incident analysis

Customer Success Team

  • Manages communications with strategic customers
  • Provides context on customer-specific needs
  • Joins customer calls when appropriate
  • Helps measure post-incident customer satisfaction

Product Team

  • Assists with product-specific incidents and bugs
  • Provides product expertise for customer communications
  • Prioritizes fixes based on customer impact data
  • Collaborates on bug-related messaging

Security Incident Response Team (SIRT)

  • Provides security expertise during security-related incidents
  • Determines appropriate information disclosure restrictions
  • Guides support messaging for security incidents
  • Reviews security-related customer communications

Role Interfaces and Handoffs

The framework defines clear interaction points between roles:

CMOC <-> IMOC

  • IMOC provides technical status for customer communications
  • CMOC provides customer impact details to inform technical priorities
  • Joint approval of public-facing status updates
  • Regular sync points at defined intervals based on severity

CMOC <-> SMOC

  • SMOC provides guidance on complex support scenarios
  • CMOC escalates resource needs and complex customer situations
  • Joint decisions on emergency qualification
  • Collaboration on support team resource allocation

SMOC <-> IMOC

  • IMOC provides technical context for support escalations
  • SMOC provides support impact details to inform response
  • Collaboration on incident severity determinations
  • Joint review of customer impact assessment

Regional Handoffs

  • Defined documentation requirements for cross-region transfers
  • Structured handoff calls at region boundaries
  • Common tools and templates for consistency
  • Clear escalation paths across time zones

Support-Role Engagement and Exit

Engagement Triggers

  • CMOC: Multiple customers affected OR bulk communications needed OR support resource coordination required
  • SMOC: Complex customer impact OR resource conflicts OR SLA risk OR SIRT involvement

Exit Criteria

  • Customer communications stable
  • Support queue normalized
  • No new impact patterns
  • Regular ticket flow resumed
  • Documented final status
  • Handback to regular support flow

Future considerations

Measuring Role Effectiveness

PROPOSED: Each role has specific KPIs to evaluate performance | ISSUE: TBC

CMOC Metrics

  • Time to first customer communication
  • Customer satisfaction during incidents
  • Communication consistency across incidents
  • Support resource utilization efficiency

SMOC Metrics

  • Time to resolve escalations
  • Resource allocation effectiveness
  • SLA compliance during incidents
  • Escalation appropriateness
Implementation and Training

PROPOSED | ISSUE: TBC

  • Role-specific training curricula
  • Regular simulation exercises
  • Shadowing opportunities for new team members
  • Continuous skill development pathways
Metrics & Success Indicators

PROPOSED | ISSUE: TBC

  • Response Time

    • Description: Time from detection to initial response
    • Target: < ____ minutes for SEV1/SEV2
  • Resolution Time

    • Description: Time from detection to resolution
    • Target: Varies by severity
  • Customer Satisfaction

    • Description: CSAT scores for incident handling
    • Target: > 90%

Handover Summary Templates

Summary templates as code blocks for various communication scenarios where CMOC and SMOC roles need to share information with other stakeholders in Slack channels, and/or issues.

CMOC Communication Templates

Initial Status Update Template

## Incident #[number] - [title]
**Status:** In Progress
**Severity:** [SEV1/SEV2/SEV3]
**Time Detected:** [YYYY-MM-DD HH:MM UTC]

### Issue Summary
[Brief description of the issue - 1-2 sentences]

### Customer Impact
- Systems/services affected: [list affected services]
- Impact type: [complete outage/degraded performance/feature unavailability]
- Estimated affected customers: [number/percentage if known]

### Current Actions
- [Bullet points of what the team is currently doing]

### Next Update
Next status update expected by [time] UTC

Regular Status Update Template

## Incident #[number] - [title] - UPDATE #[X]
**Status:** In Progress
**Severity:** [SEV1/SEV2/SEV3]
**Time Detected:** [YYYY-MM-DD HH:MM UTC]
**Last Updated:** [YYYY-MM-DD HH:MM UTC]

### Current Status
[Brief description of the current state - 1-2 sentences]

### Progress Since Last Update
- [Bullet points of actions taken and progress made]

### Ongoing Customer Impact
- [Updated impact assessment]
- Current ticket volume: [number]
- Notable patterns: [describe any patterns in customer reports]

### Next Steps
- [Bullet points of planned actions]

### Next Update
Next status update expected by [time] UTC

Resolution Update Template

## Incident #[number] - [title] - RESOLVED
**Status:** Resolved
**Severity:** [SEV1/SEV2/SEV3]
**Time Detected:** [YYYY-MM-DD HH:MM UTC]
**Time Resolved:** [YYYY-MM-DD HH:MM UTC]
**Duration:** [X hours Y minutes]

### Resolution Summary
[Brief description of how the issue was resolved]

### Final Impact Assessment
- Systems/services affected: [list affected services]
- Total customers impacted: [number/percentage if known]
- Total tickets received: [number]

### Follow-up Actions
- [Any post-incident actions customers should take]
- [Any monitoring customers should perform]

### Additional Information
A full post-incident review will be conducted and findings shared as appropriate.

For any additional questions, please contact support referencing Incident #[number].

Regional Handoff Template

## Incident #[number] - [title] - HANDOFF
**Status:** In Progress
**Current Region:** [EMEA/AMER/APAC]
**Handoff To:** [EMEA/AMER/APAC]
**Handoff Time:** [YYYY-MM-DD HH:MM UTC]

### Current Situation
[Brief summary of current status - 2-3 sentences]

### Customer Impact Status
- Active tickets: [number]
- Pending responses: [number]
- Escalated issues: [number]

### Communication Status
- Last status.io update: [time] UTC
- Next scheduled update: [time] UTC
- Draft status update: [link or text]

### Priority Actions for Next Shift
1. [Most important action]
2. [Second priority action]
3. [Additional actions as needed]

### Key Stakeholders
- [List of key contacts involved]

### Handoff Acknowledgement
Please acknowledge receipt of this handoff in the incident channel.

SMOC Communication Templates

Support Resource Allocation Template

## Incident #[number] - [title] - SUPPORT RESOURCES
**Status:** In Progress
**Time:** [YYYY-MM-DD HH:MM UTC]
**Resource Request Type:** [Initial/Update/Release]

### Current Support Load
- Active incident-related tickets: [number]
- Current response time: [time]
- Queue health status: [Healthy/Strained/Critical]

### Resource Allocation
- AMER: [X] engineers allocated to incident
- EMEA: [X] engineers allocated to incident
- APAC: [X] engineers allocated to incident

### Priority Guidelines
1. [Top priority issue type]
2. [Second priority issue type]
3. [Standard handling for other issues]

### Special Handling Instructions
- [Any special routing or handling instructions]
- [Any customer-specific considerations]

### Actions Needed
- [Team leads]: [specific action requested]
- [Regional managers]: [specific action requested]
- [Other stakeholders]: [specific action requested]

### Duration Estimate
This resource allocation is expected to remain in place for approximately [time period].

Customer Impact Report Template

## Incident #[number] - [title] - CUSTOMER IMPACT REPORT
**Status:** [In Progress/Resolved]
**Time:** [YYYY-MM-DD HH:MM UTC]

### Impact Summary
[Brief description of customer impact - 2-3 sentences]

### Impact Metrics
- Total customers reporting issues: [number]
- Percentage of customer base: [estimated percentage]
- Geographic distribution: [regions affected]
- Customer segments affected: [Enterprise/SMB/Personal]

### Common Issues Reported
1. [Most common issue] - [X] reports
2. [Second most common issue] - [X] reports
3. [Third most common issue] - [X] reports

### Customer Sentiment
- Current CSAT trending: [Stable/Declining/Improving]
- Notable customer concerns: [list major themes]

### Recommended Actions
- [Technical team]: [recommended action]
- [Communications team]: [recommended action]
- [Customer success]: [recommended action]

### Additional Information
[Any other relevant details about customer impact]

CMOC Activation Template

## Incident #[number] - [title] - CMOC ACTIVATION
**Status:** In Progress
**Activation Time:** [YYYY-MM-DD HH:MM UTC]
**Requested By:** [Name/Role]

### Activation Criteria Met
- [List specific criteria that triggered activation]

### Current Support Status
- Active tickets: [number]
- Affected customers: [number/types]
- Current response time: [time]

### Initial CMOC Actions
1. [First immediate action]
2. [Second immediate action]
3. [Ongoing monitoring focus]

### Resource Requirements
- Personnel needed: [specific roles/numbers]
- Tools/access needed: [specific requirements]
- Stakeholder engagement needed: [specific teams]

### Actions Needed
- [CMOC]: Acknowledge activation and implement response plan
- [Regional managers]: [specific action requested]
- [Technical teams]: [specific action requested]

### Communication Plan
- Initial customer communication to be sent by: [time]
- Coordination meeting scheduled for: [time]
- Reporting cadence: [frequency]

Post-Incident Support Summary Template

## Incident #[number] - [title] - SUPPORT SUMMARY
**Status:** Resolved
**Incident Duration:** [start time] to [end time] UTC
**Report Time:** [YYYY-MM-DD HH:MM UTC]

### Support Response Summary
[Brief overview of the support response - 3-4 sentences]

### Key Metrics
- Total tickets handled: [number]
- Peak tickets per hour: [number]
- Average response time: [time]
- Support resources utilized: [number of staff]

### Customer Impact Analysis
- Most affected customer segments: [details]
- Geographic distribution: [details]
- Common workarounds provided: [list]

### Effectiveness Assessment
- What worked well: [bullet points]
- Improvement areas: [bullet points]
- Tool/process gaps identified: [bullet points]

### Follow-up Actions
- [Specific action items with owners and timelines]

### Lessons Learned
[Key takeaways for future incident response]

Cross-Functional Templates

Technical-to-Support Handoff Template

## Incident #[number] - [title] - TECHNICAL TO SUPPORT HANDOFF
**Status:** [In Progress/On Hold/Resolved]
**Time:** [YYYY-MM-DD HH:MM UTC]

### Technical Summary
[Brief technical explanation of the issue - keep simple and customer-focused]

### Customer-Facing Impact
- What customers are seeing: [observable symptoms]
- Affected components/features: [specific details]
- Scope of impact: [broad/limited/specific customers]

### Workaround Instructions
[Step-by-step workaround if available]

### Customer Communication Guidance
- Key points to communicate: [bullet points]
- Points to avoid mentioning: [bullet points]
- Technical accuracy verified by: [name]

### Expected Resolution
- Estimated time to resolution: [timeframe if known]
- Fix delivery method: [hotfix/regular release/etc.]

### Actions Needed
- [Support team]: [specific guidance on ticket handling]
- [CMOC]: [guidance on status.io messaging]
- [Other teams]: [any other coordination needed]

Executive Update Template

## Incident #[number] - [title] - EXECUTIVE SUMMARY
**Status:** [In Progress/On Hold/Resolved]
**Time:** [YYYY-MM-DD HH:MM UTC]

### Situation Overview
[Concise explanation of the incident - 1-2 sentences]

### Business Impact
- Customer impact: [High/Medium/Low] - [brief description]
- Revenue impact: [Yes/No/Unknown] - [brief description if Yes]
- Reputation risk: [High/Medium/Low] - [brief explanation]

### Response Status
- Technical response: [On track/Delayed/Blocked] - [brief status]
- Support response: [On track/Delayed/Blocked] - [brief status]
- Communications: [On track/Delayed/Blocked] - [brief status]

### Key Metrics
- Duration so far: [time]
- Estimated time to resolution: [time or unknown]
- Support tickets: [number]
- Affected customers: [number/percentage]

### Critical Decisions Needed
[List any decisions requiring executive input]

### Next Update
Next executive update scheduled for: [time] UTC

These templates provide structured frameworks for different types of communications that CMOC and SMOC roles would need to share with stakeholders during an incident. They’re designed to be clear, actionable, and adaptable to different incident scenarios while maintaining consistency in format.