Incident Response Matrix

A guide for incidents that may occur on the Marketing site

Overview

This page offers insights into incidents occurring on the GitLab marketing site, delineates methods to assess severity levels, and outlines avenues for obtaining support.

At the outset, it’s important to note the diverse composition of our marketing site, consisting of multiple projects. While all deployments converge into the same GCP bucket, they employ varying technologies for website generation.

  1. The Marketing site is composed of multiple repositories: the blog, www, navigation, slippers and buyer experience.

  2. www-gitlab-com, Buyer Experience, and the Blog generate pages during the build process and upload these artifacts to a single GCP bucket. Upon pipeline execution, all artifacts are consolidated within the /public directory on our GCP bucket.

What level is this incident?

The following are the questions to consider when determining incident severity:

  1. What’s the impact level of the marketing site outage?
  2. Have you monitored the #digital-experience-team or #dex-alerts Slack channels for any ongoing incidents?
  3. How extensive is the incident? It’s crucial to assess beyond the number of affected individuals, considering: -The total number of impacted users.
    • The potential impact on various categories of our key stakeholders.
    • Whether the incident affects significant customers or partners, regardless of the scale.
  4. Are any affected individuals influential among our key audiences or stakeholders?
  5. Does the incident directly affect our core business operations?
  6. Have we encountered a similar incident in the past? In essence, is this a recurring issue for the company?
  7. Is the incident linked to broader industry challenges or trends? Are competitors or others facing similar issues?
  8. Are vital business pages currently accessible?

Incident Matrix

Level 1 Level 2 Level 3
High risk Medium risk Low risk
Leaked mission-critical keys + env variables Mission-critical / legal content errors (ex: incorrect pricing, drastic typos or verbiage errors on our high converting pages) Section of site is missing (ex: events, press releases)
Major vendor failures related to infrastructure (GCP + Contentful). Integration failures (6sense, GA, etc) Performance issues
Mission-critical pages are missing (ex: Homepage, missing, primary navigation) Significant performance issues
See reporting an incident below. Create an issue and post in #digital-experience Slack channel Create an issue and post in #digital-experience Slack channel

Reporting an incident

Point person: Nathan Dubord - Working hours: 9am - 6pm Eastern

  1. Post in the #digital-experience Slack channel and tag @digital-experience.
  2. If there is no response within five minutes, please text or phone the following people:
    1. Eastern Timezone (UTC−5):
      1. Nathan Dubord
      2. Laura Duggan
    2. Central Timezone (UTC−6):
      1. Megan Filo
    3. Pacific Timezone (UTC−8):
      1. Lauren Barker
  3. Incident issue created by DEX team member here using the root cause analysis incident issue template.

Call on the phone if no response within 15 minutes

PagerDuty

In the case of an emergency, Digital Expereince Engineers have the ability to create a pager duty incident and trigger alerts on each others mobile devices. These pager duty incidents can also be triggered by GitLab team members with access to Pager Duty (IT Security, Reliability, SIRT, ect)

To trigger a pager duty incident for the Digital Experince team following these steps:

  1. Report an incident by typing /pd trigger anywhere in Slack.
  2. Select about.gitlab.com as the impacted service
  3. Complete fields - priority, description, Urgency
  4. This will create an incident and notify PagerDuty to alert members of the Digital Experience team.
  5. PagerDuty will continuously escalate until contact with a team member has been made.