How to Perform GitLab Dedicated CMOC Duties

Describes the role and responsibilities for the GitLab Dedicated CMOC rotation in Support Engineering

Introduction

The GitLab Dedicated Communications Manager on Call (GDCMOC) is an async role with the purpose of keeping GitLab Dedicated customers up-to-date about their environments. It involves liaising with Dedicated infrastructure team members on Slack or GitLab issues, and then relaying the information to the customer.

The GDCMOC rotation currently uses the GitLab.com CMOC rotation to determine who is oncall. When you go oncall as a GitLab.com CMOC, you will also be the GDCMOC. The Communications Lead is currently staffed by CMOC and GDCMOC, with plans to evolve this structure in the future.

Guidelines for the Role

  • There is no expectation on the GDCMOC to be performing troubleshooting responsibilities.
  • GDCMOCs do not need to put all their focus to actively monitor the relevant threads or issues. As a guideline, check every 30 minutes on existing communication threads for updates that need to be shared with the customer.

Modes of Communication

The GDCMOC role involves two types of customer communication, each serving a different purpose and using different tools. When paged, the GitLab Dedicated SRE can advise which method is needed based on whether you need to inform customers or gather information from them. There is no expectation on the GDCMOC to be performing troubleshooting responsibilities.

Mode One: Switchboard Notifications

  • One-way broadcast communication for notifying customers of incident status or emergency maintenances planned, impacting one or more customer environments
  • Requires creating a notification using pre-approved templates in Switchboard
  • Notifications are sent to the Operational Email Addresses list of the tenant and Switchboard Users with email notifications enabled
  • This mode of communication replaces the previous manual process of creating individual support tickets for each tenant, to provide scalable and compliant customer communication during incidents. See STM#6768
  • Watch an internal demo of this feature using the GitLab Unfiltered YouTube account
  • → Follow Sending Notifications using Switchboard

Mode Two: Contact Request using Zendesk

  • Used for two-way communication for information gathering, and when no appropriate Switchboard template exists
  • Requires creating a Zendesk ticket
  • → Follow Initiating a Contact Request

Engaging the GDCMOC

The GDCMOC can be paged using Slack or directly using PagerDuty.

  • Slack: Using the /pd trigger command in Slack, select Incident Management - GDCMOC in the Impacted Service modal. Fill in the Title and click the Add Details button. Add a description with a link to the issue or Slack channel where you need the GDCMOC’s attention, then click Create.
  • PagerDuty: From the Incident Management - GDCMOC page, click New Incident. Fill in the Title, add a desscription with a link to the issue or Slack channel where you need the GDCMOC’s attention, and click Create Incident.

The Description field is optional, however it is the only way to inform the on-call support engineer about what is required or where they are needed, so please ensure it is filled in.

There is additional information about engaging the GDCMOC in the on-call runbook for the GitLab Dedicated team.

Incident Management for GDCMOC

Acknowledge the PagerDuty Page

Mark the page as acknowledged. This can be done through the mobile app, web interface or PagerDuty App in the #support_gitlab-dedicated Slack channel.

Dedicated SREs will reach out when customer communication is needed. The description in the PagerDuty alert should contain details about an issue, or a Slack thread you need to follow. Follow any communication threads, and let the Dedicated Incident team know you are available to assist. If you’re unsure, check the GitLab Dedicated incidents issue tracker or ask in the #g_dedicated-team Slack channel.

Understand What Action to Take

Understand from the Dedicated SRE what type of communication is required:

Sending Notifications Using Switchboard

Creating an Incident Status Notification Using Switchboard

  1. Log in to Switchboard
  2. Select Customer notifications from the top-right drop down menu where you see your email address displayed.
  3. Click + New notification
  4. Select the impacted Tenant(s)
  5. Select the relevant template for incidents:
    1. Incident investigation start is used at the beginning of the incident. It is the most generic template available
    2. Incident investigation update is used as an update to show we are working on the incident and have information to share
    3. Incident escalated response is used to show we are giving the incident maximum priority
    4. Incident mitigation in progress is used to show we are actively working on mitigating the incident
    5. Incident resolved is sent to close out the incident when a fix or mitigation is deployed
  6. For templates 2-4: If known, select an Investigation focus area and/or Affected components.
  7. For templates 2-4: Optional, if the customer has reached out regarding the impact they are seeing, and it aligns with the incident, check the box Include customer reported impact and include it.
    • This freetext box should only be used for customer-reported impact.
    • The goal is to confirm with the customer that we are aligned by sharing the details of the impact that they have shared with us.
  8. Preview the notification to ensure it is as expected
  9. Click Send
  10. After sending the initial Switchboard notification, mark the PagerDuty alert as Resolved. The alert’s purpose is specifically to engage the GDCMOC to start communication.
  11. Continue to provide ongoing incident updates to the customer

Providing Ongoing Incident Updates Using Switchboard

Update the customer on the incident status by creating a new incident status notification. If the last update hasn’t changed, use the same information.

Ensure to provide an update every 60 minutes, or whenever the incident progresses to a new stage (Investigation Start → Investigation Update → Mitigation in Progress → Resolved), whichever comes first.

Handling Customer-created Zendesk Tickets during Incidents

After creating incident notifications on Switchboard, customers may open new Zendesk tickets seeking information about the incident. Inform them that the incident is being actively investigated and updates will be provided through Switchboard notifications as progress is made, or at least every 60 minutes.

Continue regular notification updates using Switchboard: Responding to Zendesk tickets does not replace updating Switchboard notifications. Continue to provide ongoing incident updates using Switchboard.

Viewing Past Notifications on Switchboard

All customer notifications are logged in Switchboard. To view past notifications:

  1. Click on your profile in the top left corner
  2. Select Customer notifications
  3. Click on the Title of the relevant notification to view the message and its recipients

Creating an Emergency Maintenance Notification on Switchboard

A security vulnerability fix might result in emergency maintenance for GitLab Dedicated environments.

NOTE: “Emergency maintenance” refers exclusively to security-related maintenance. Maintenance that happens outside of the weekly scheduled maintenance window are referred to as “out-of-band maintenance”, and this workflow does not apply.

Follow the steps in Creating an Incident Status Notification Using Switchboard, and select the templates for maintenance:

  1. Emergency maintenance planned is used for advance notice for emergency maintenance due to critical vulnerability
  2. Emergency maintenance completed is used to confirm that the emergency maintenance finished successfully

Initiating a Contact Request on Zendesk

Use this workflow when you need to gather additional information from customers for incident investigation or when no pre-existing Switchboard template is available for the communication.

Locate the customer’s contact email in Switchboard, then create a customer support ticket in Zendesk using the contact information.

Locating Customer Email Addresses in Switchboard

  1. Log in to Switchboard
  2. You should see the Tenants page when logged in. Find the relevant tenant and click Manage.
  3. Expand the Cloud Account Config section, and look for the Primary Region. This should tell us which region the customer is based in. See the AWS docs if you’re unsure of the AWS region code. Make a note of the region.
  4. Search for the Contact information section, and expand it. You should see values for Operational email addresses and Customer Success Manager CSM.

Creating a Zendesk Ticket

  1. Follow the instructions here to create a Zendesk ticket for the outbound request.
    1. For the subject of the ticket, use the following template: GitLab Dedicated Notice: <description>.
    2. Apply the macro General::Outbound Contact Request
    3. For the ticket requestor, use the first Operational Email Address listed.
    4. CC the other Operational Email Addresses and the Customer CSM and ASE (if any).
    5. Set the Preferred Region for Support to the region similar to where the tenants’ Primary Region is located.
    6. Add a dedicated_contacted_request tag to the ticket.
    7. Set the “Support Resolution Codes” to Incident.
  2. Assign the ticket to yourself.
  3. After sending the initial outreach message to the customer, mark the PagerDuty alert as resolved. The alert’s purpose is specifically to engage the GDCMOC to start communication.

Closing the Zendesk Ticket

Before closing the Zendesk ticket, you should:

  1. Send a final update to the customer confirming the completion.
  2. Close the outreach ticket.
  3. Add a brief internal note summarizing the communication timeline (optional).

Note: If the customer responds with follow-up questions after closure, create a new ticket to handle those inquiries separately from the original outreach communication.

Keep the Customer Informed

  • Work with the customer to set expectations about the frequency of updates, especially if you are the GDCMOC within the same region as the customer. They will likely expect more updates during their regional business hours.
    • If we proceed with lower frequency updates, the important thing is that we communicate our expected update frequency to them. For example, we can let the customer know that during their regional business hours, we will provide an update every 1-2 hours, and during their non-regional hours we will update them if there is anything substantial to share.
  • Keep in mind the information that we should not share with the customer
  • If you’d like a second pair of eyes to review messages before sending them out to customers, refer to the table below to find an appropriate DRI.
    • Approval of message content is required for security-related communications.
    • Approval is optional for all other communication.
Communication type Who reviews content? Who approves content?
Non-security out-of-band maintenance SRE Optional
Security-related out-of-band maintenance SIRT SIRT
Incident communication SRE / Incident manager Optional
Other urgent communication It depends Optional

Getting Paged for Concurrent Incidents

Support Engineers are not expected to manage multiple incidents. If a concurrent GitLab.com incident or GitLab Dedicated contact request comes in, engage with the Support Manager oncall to help find cover for the new incident.

You can ping the Support Manager oncall in Slack with @support-manager-oncall.

GDCMOC Handover

Follow the End of Shift Handover Procedure from the CMOC workflows. Make the ingress GDCMOC aware of any Switchboard notifications sent out, issues, Slack threads or tickets they should CC themselves on. Assign the Zendesk ticket used for communication to the next CMOC.