Incident Response Lifecycle
The Incident Response Lifecyle working group is intended to document a shared incident response protocol and knowledgebase.
Attributes
Property | Value |
---|---|
Date Created | February 01, 2022 |
Target End Date | April 30, 2022 |
Slack | wg-incident-respose-management-framework |
Google Doc | Incident Response Management Working Group (internal) |
Issue Label | WG-IRM (gitlab-com/-org) |
Business Goal
- Increase efficiency through common incident response, analysis, documentation, ongoing management and reporting methods.
- Increase transparency through improving visibility and communication of incidents to business and e-group
- Support results by building our clients’ confidence in GitLab’s ability to quickly resolve and communicate incidents when they occur
- Align Incident Management activities and priorities with those of the business
- Prepare materials for the creation of training modules for the Engineer Department on the Incident Management Process at GitLab
- Highlight dogfooding opportunities
Exit Criteria
- Single source of truth documenting incident response management that will be applicable to all areas of Engineering and teams who provide Incident Response
- Each functional area of Engineering will develop their own Incident management requirements for identifying and reacting to service outages or security threats.
- Create a comprehensive knowledge base for GitLab team members to help them understand how incident response teams implement the IR process
Outcome
- Help teams across GitLab lower MTTR
Other Investigations
- Improvements in feedback and learnings from Incident to build resiliance
- Service Catalog
What do other companies do?
How is IR done today?
- SIRT
- On-call
- Reliability
- Support
- How to Perform CMOC Duties
- Contacting Customers
- Sending Notices (small number of users)
Noted issues
Related Issues
Roles and Responsibilities
Working Group Role | Person | Title |
---|---|---|
Facilitator | Anna Liisa Moter | Manager Reliability |
Exec Sponsor | Steve Loyd | VP Infrastructure |
Member | Anthony Fappiano | Manager Reliability |
Development Functional Lead | Dan Croft | Senior Engineering Manager, Ops |
Member | Sam Goldstein | Director of Engineering, Ops |
Member (CMOC) | Kenneth Chu | Support team |
Member | Kevin Chu | Group Manager of Product, Monitor |
Requirements and Considerations
Actors
- Reliability Engineers
- SIRT Engineers
- Development Team
- Quality Team
- Support Team
General
- As a GitLab team member who can raise an incident, I know how incident can be initiated
- As a GitLab team member who can raise an incident, I have a general understanding about incident severity levels
- As a GitLab team member who can raise an incident, I understand the high level process of Incident Management and its importance to the business
- As a GitLab team member who can raise an incident, I can contact the right team via dedicated slack channel.
- As a GitLab team member who can raise an incident, I can easily find a page in the handbook that documents the Incident Response Procedures
SIRT Engineer
- As a SIRT Engineer I know how to pull relevant resources from other teams when I need assistance
- As a SIRT Engineer I can easily categorized the incident
- As a SIRT Engineer I can identify triggers and indicators
- As a SIRT Engineer I know where to document the incident details
- As a SIRT Engineer I know when to transitions from Incident identification, to mitigation, to remediation, and post to incident activities
- As a SIRT Engineer I can follow a reporting process to handoff incidents, or provide updates to Management
Reliability Engineers
- As a Reliability Engineer, I know how to level an incident in a manner that is consistent across the org
- As a Reliability Engineer, I know how to engage the other roles during an incident
- As a Reliability Engineer, I know when to transition from Incident identification, to mitigation to resolution and post-incident activities
Development Team
- As a leader in Development who is part of the Incident Manager rotation, I am clear on the role’s responsibilities and how the role supports the Incident Management process.
Quality Teams
Support Team
- As a Support Engineer, I know how to create a status page
- As a Support Engineer, I know the differences between the Incident Status states on the status page
- As a Support Engineer, I know how frequently to update the status page
- As a Support Engineer, I know how to engage the Incident Manager or EOC when asking about feedback for an update I am about to post on the status page
- As a Support Engineer, I know how notify the stakeholders
- As a Support Engineer, I know how to find related tickets in Zendesk and the GitLab issue tracker to help access the impact of an incident
- As a Support Engineer, I know how to contact users if their usage of GitLab SaaS was restricted due to an incident
Last modified October 29, 2024: Fix broken links (
455376ee
)