How to Perform GitLab Dedicated CMOC Duties
Introduction
The GitLab Dedicated Communications Manager on Call (GDCMOC) is an async role with the purpose of keeping GitLab Dedicated customers up-to-date about their environments. It involves liaising with Dedicated infrastructure team members on Slack or GitLab issues, and then relaying the information to the customer.
The GDCMOC rotation currently uses the GitLab.com CMOC rotation to determine who is oncall. When you go oncall as a GitLab.com CMOC, you will also be the GDCMOC. The Communications Lead is currently staffed by CMOC and GDCMOC, with plans to evolve this structure in the future.
Guidelines for the Role
- There is no expectation on the GDCMOC to be performing troubleshooting responsibilities.
- GDCMOCs do not need to put all their focus to actively monitor the relevant threads or issues. As a guideline, check every 30 minutes on existing communication threads for updates that need to be shared with the customer.
Modes of Communication
The GDCMOC role involves two types of customer communication, each serving a different purpose and using different tools. When paged, the GitLab Dedicated SRE can advise which method is needed based on whether you need to inform customers or gather information from them. There is no expectation on the GDCMOC to be performing troubleshooting responsibilities.
Mode One: Switchboard Notifications
- One-way broadcast communication for notifying customers of incident status or emergency maintenances planned, impacting one or more customer environments
- Requires creating a notification using pre-approved templates in Switchboard
- Notifications are sent to the Operational Email Addresses list of the tenant and Switchboard Users with email notifications enabled
- This mode of communication replaces the previous manual process of creating individual support tickets for each tenant, to provide scalable and compliant customer communication during incidents. See STM#6768
- Watch an internal demo of this feature using the GitLab Unfiltered YouTube account
- → Follow Sending Notifications using Switchboard
Mode Two: Contact Request using Zendesk
- Used for two-way communication for information gathering, and when no appropriate Switchboard template exists
- Requires creating a Zendesk ticket
- → Follow Initiating a Contact Request
Engaging the GDCMOC
The GDCMOC can be paged using Slack or directly using PagerDuty.
- Slack: Using the
/pd triggercommand in Slack, selectIncident Management - GDCMOCin the Impacted Service modal. Fill in the Title and click the Add Details button. Add a description with a link to the issue or Slack channel where you need the GDCMOC’s attention, then click Create. - PagerDuty: From the Incident Management - GDCMOC page, click New Incident. Fill in the Title, add a desscription with a link to the issue or Slack channel where you need the GDCMOC’s attention, and click Create Incident.
The Description field is optional, however it is the only way to inform the on-call support engineer about what is required or where they are needed, so please ensure it is filled in.
There is additional information about engaging the GDCMOC in the on-call runbook for the GitLab Dedicated team.
Incident Management for GDCMOC
Acknowledge the PagerDuty Page
Mark the page as acknowledged. This can be done through the mobile app, web interface or PagerDuty App in the #support_gitlab-dedicated Slack channel.
Dedicated SREs will reach out when customer communication is needed. The description in the PagerDuty alert should contain details about an issue, or a Slack thread you need to follow. Follow any communication threads, and let the Dedicated Incident team know you are available to assist. If you’re unsure, check the GitLab Dedicated incidents issue tracker or ask in the #g_dedicated-team Slack channel.
Understand What Action to Take
Understand from the Dedicated SRE what type of communication is required:
- Create an Incident Status Notification,
- Create an Emergency Maintenance Notification or
- Initiate a Contact Request
Sending Notifications Using Switchboard
Creating an Incident Status Notification Using Switchboard
- Log in to Switchboard
- Select
Customer notificationsfrom the top-right drop down menu where you see your email address displayed. - Click
+ New notification - Select the impacted Tenant(s)
- Select the relevant template for incidents:
Incident investigation startis used at the beginning of the incident. It is the most generic template availableIncident investigation updateis used as an update to show we are working on the incident and have information to shareIncident escalated responseis used to show we are giving the incident maximum priorityIncident mitigation in progressis used to show we are actively working on mitigating the incidentIncident resolvedis sent to close out the incident when a fix or mitigation is deployed
- For templates 2-4: If known, select an
Investigation focus areaand/orAffected components. - For templates 2-4: Optional, if the customer has reached out regarding the impact they are seeing, and it aligns with the incident, check the box
Include customer reported impactand include it.- This freetext box should only be used for customer-reported impact.
- The goal is to confirm with the customer that we are aligned by sharing the details of the impact that they have shared with us.
- Preview the notification to ensure it is as expected
- Click
Send - After sending the initial Switchboard notification, mark the PagerDuty alert as Resolved. The alert’s purpose is specifically to engage the GDCMOC to start communication.
- Continue to provide ongoing incident updates to the customer
Providing Ongoing Incident Updates Using Switchboard
Update the customer on the incident status by creating a new incident status notification. If the last update hasn’t changed, use the same information.
Ensure to provide an update every 60 minutes, or whenever the incident progresses to a new stage (Investigation Start → Investigation Update → Mitigation in Progress → Resolved), whichever comes first.
Handling Customer-created Zendesk Tickets during Incidents
After creating incident notifications on Switchboard, customers may open new Zendesk tickets seeking information about the incident. Inform them that the incident is being actively investigated and updates will be provided through Switchboard notifications as progress is made, or at least every 60 minutes.
Continue regular notification updates using Switchboard: Responding to Zendesk tickets does not replace updating Switchboard notifications. Continue to provide ongoing incident updates using Switchboard.
Viewing Past Notifications on Switchboard
All customer notifications are logged in Switchboard. To view past notifications:
- Click on your profile in the top left corner
- Select
Customer notifications - Click on the Title of the relevant notification to view the message and its recipients
Creating an Emergency Maintenance Notification on Switchboard
A security vulnerability fix might result in emergency maintenance for GitLab Dedicated environments.
NOTE: “Emergency maintenance” refers exclusively to security-related maintenance. Maintenance that happens outside of the weekly scheduled maintenance window are referred to as “out-of-band maintenance”, and this workflow does not apply.
Follow the steps in Creating an Incident Status Notification Using Switchboard, and select the templates for maintenance:
Emergency maintenance plannedis used for advance notice for emergency maintenance due to critical vulnerabilityEmergency maintenance completedis used to confirm that the emergency maintenance finished successfully
Initiating a Contact Request on Zendesk
Use this workflow when you need to gather additional information from customers for incident investigation or when no pre-existing Switchboard template is available for the communication.
Locate the customer’s contact email in Switchboard, then create a customer support ticket in Zendesk using the contact information.
Locating Customer Email Addresses in Switchboard
- Log in to Switchboard
- You should see the
Tenantspage when logged in. Find the relevant tenant and clickManage. - Expand the
Cloud Account Configsection, and look for thePrimary Region. This should tell us which region the customer is based in. See the AWS docs if you’re unsure of the AWS region code. Make a note of the region. - Search for the
Contact informationsection, and expand it. You should see values forOperational email addressesandCustomer Success Manager CSM.
Creating a Zendesk Ticket
- Follow the instructions here to create a Zendesk ticket for the outbound request.
- For the subject of the ticket, use the following template:
GitLab Dedicated Notice: <description>. - Apply the macro
General::Outbound Contact Request - For the ticket requestor, use the first Operational Email Address listed.
- CC the other Operational Email Addresses and the Customer CSM and ASE (if any).
- Set the Preferred Region for Support to the region similar to where the tenants’
Primary Regionis located. - Add a
dedicated_contacted_requesttag to the ticket. - Set the “Support Resolution Codes” to Incident.
- For the subject of the ticket, use the following template:
- Assign the ticket to yourself.
- After sending the initial outreach message to the customer, mark the PagerDuty alert as resolved. The alert’s purpose is specifically to engage the GDCMOC to start communication.
Closing the Zendesk Ticket
Before closing the Zendesk ticket, you should:
- Send a final update to the customer confirming the completion.
- Close the outreach ticket.
- Add a brief internal note summarizing the communication timeline (optional).
Note: If the customer responds with follow-up questions after closure, create a new ticket to handle those inquiries separately from the original outreach communication.
Keep the Customer Informed
- Work with the customer to set expectations about the frequency of updates, especially if you are the GDCMOC within the same region as the customer. They will likely expect more updates during their regional business hours.
- If we proceed with lower frequency updates, the important thing is that we communicate our expected update frequency to them. For example, we can let the customer know that during their regional business hours, we will provide an update every 1-2 hours, and during their non-regional hours we will update them if there is anything substantial to share.
- Keep in mind the information that we should not share with the customer
- If you’d like a second pair of eyes to review messages before sending them out to customers,
refer to the table below to find an appropriate DRI.
- Approval of message content is required for security-related communications.
- Approval is optional for all other communication.
| Communication type | Who reviews content? | Who approves content? |
|---|---|---|
| Non-security out-of-band maintenance | SRE | Optional |
| Security-related out-of-band maintenance | SIRT | SIRT |
| Incident communication | SRE / Incident manager | Optional |
| Other urgent communication | It depends | Optional |
Getting Paged for Concurrent Incidents
Support Engineers are not expected to manage multiple incidents. If a concurrent GitLab.com incident or GitLab Dedicated contact request comes in, engage with the Support Manager oncall to help find cover for the new incident.
You can ping the Support Manager oncall in Slack with @support-manager-oncall.
GDCMOC Handover
Follow the End of Shift Handover Procedure from the CMOC workflows. Make the ingress GDCMOC aware of any Switchboard notifications sent out, issues, Slack threads or tickets they should CC themselves on. Assign the Zendesk ticket used for communication to the next CMOC.
2b82e440)
