Incident Roles - Incident Lead
Responsibilities of the Incident Lead
During the incident
- Responsible for posting regular status updates using the
/incident updatein the incident Slack channel. These updates should summarize the current customer impact of the incident and actions we are taking to mitigate the incident. This is the most important section of the incident timeline. It will be referenced to status page updates and should provide a summary of the incident and impact that can be understood by the wider community. - Ensure that the incident issue has all of the required fields applied. If not set them using
/incident fieldcommand from the incident slack channel - Ensure that the incident issue is appropriately restricted based on data classification, to mark the issue as confidential use
/incident fieldand set theKeep GitLab Issue Confidentialtotrue - The Incident Lead should not consider immediate work on an incident completed until the Incident Summary is filled out with useful information to describe all the key aspects of the Incident.
- Ensuring that the Timeline section of the incident in the
post-incidenttab is accurate and complete with the start and end of the customer impact. - Ensuring that the root cause is stated clearly and plainly in the incident description by updating the
causessection in the/incident summary, or can be alternatively shared as an internal status update using emoji reactions. Reacting with the:pushpin:will post a public comment on the GitLab incident issue, reacting with a:star:will add an internal comment to the GitLab incident issue. - Ensure all follow-up items are properly documented and assign initial owners when possible.
- Be available for customer interactions when requested by the Communications Lead. See Communications Lead - Customer Call Management.
After the incident
- Review automatically created follow-up issues within one business day. Issues pasted into the incident channel are automatically linked as follow-ups, so it is possible some of these are not valid follow-up items.
- Verify each follow-up issue has appropriate context from the incident.
- Move follow-up issues from the follow-up issues project to the correct project for the responsible team (typically this will be
gitlab-org/gitlaborproduction-engineering. - Apply appropriate labels such as team and group to follow-up issues.
- The Incident Lead should review the comments and ensure that the corrective actions are added to the issue description, regardless of the incident severity.
- For all Severity 1 and Severity 2 incidents, initiate an async incident review and inform the Engineering Manager of the team owning the root cause that they may need to initiate the Feature Change Lock process.
Special Handling for S1 / S2 Incidents
When paged, the IMOC has the following responsibilities during a Sev1 or Sev2 incident and should be engaged on these tasks immediately when an incident is declared:
- In the event of an incident which has been triaged and confirmed as a clear Severity 1 impact:
- Notify Infrastructure Leadership by typing
/incident escalatein Slack. In theOn-call teamsdrop-down menu, selectdotcom leadership escalationwith the appropriate message in theNotification Message. This notification should happen 24/7. - In the case of a large scale outage where there is a serious disruption of service, the IMOC should check in with Infrastructure Leadership whether a senior member should be brought into the incident to coordinate and manage recovery efforts. This is to ensure that the person in charge of coordinating multiple parallel recovery efforts has a deeper understanding of what is required to bring services back online.
- Notify Infrastructure Leadership by typing
- Consider engaging the release-management team if a code change related issue is identified as a potential cause and we need to explore rollbacks or expedited deployment. This can be done by using their slack handle
release-managers - Ensure that necessary public communications are made accurately and in a timely fashion by the Communications Lead. Be mindful that, due to the directive to err on the side of declaring incidents early and often, we should first confirm customer impact with the Incident Responder prior to approving customer status updates.
- If necessary, help the Incident Responder to engage development using the InfraDev escalation process.
- If applicable, coordinate the incident response with business contingency activities.
- Following the first significant Severity 1 or 2 incident for a new member of the IMOC, schedule a feedback coffee chat with the Engineer On Call, Communications Manager On Call, and (optionally) any other key participants to receive actionable feedback on your engagement.
The IMOC is the DRI for all of the items listed above, but it is expected that they will do it with the support of the Incident Lead, Incident Responder, or others who are involved with the incident. If an incident runs beyond a scheduled shift, the IMOC is responsible for handing over to the incoming IMOC member.
The IMOC won’t be engaged on these tasks unless they are paged, which is why the default is to page them for all Sev1 and Sev2 incidents. In other situations, page the IMOC to engage them.
Paging Infrastructure Leadership
To page the Infrastructure Leadership directly, run /inc escalate and choose the dotcom leadership escalation from the Oncall Teams drop-down menu
Paging the Infrastructure Liaison
During a verified Severity 1 Incident the IMOC will page the Infrastructure Liaison.
To page the Infrastructure Liaison directly, run /pd trigger and choose the Infrastructure Liaison as the impacted service.
2205fc50)
