Incident Response Matrix

A guide for incidents that may occur on the Marketing site

Overview

This page offers insights into incidents occurring on the GitLab marketing site, delineates methods to assess severity levels, and outlines avenues for obtaining support.

At the outset, it’s important to note the diverse composition of our marketing site, consisting of multiple projects. While all deployments converge into the same GCP bucket, they employ varying technologies for website generation.

  1. The Marketing site is composed of multiple repositories: the blog, www, navigation, slippers and buyer experience.

  2. www-gitlab-com, Buyer Experience, and the Blog generate pages during the build process and upload these artifacts to a single GCP bucket. Upon pipeline execution, all artifacts are consolidated within the /public directory on our GCP bucket.

What level is this incident?

The following are the questions to consider when determining incident severity:

  1. What’s the impact level of the marketing site outage?
  2. Have you monitored the #digital-experience-team or #dex-alerts Slack channels for any ongoing incidents?
  3. How extensive is the incident? It’s crucial to assess beyond the number of affected individuals, considering: -The total number of impacted users.
    • The potential impact on various categories of our key stakeholders.
    • Whether the incident affects significant customers or partners, regardless of the scale.
  4. Are any affected individuals influential among our key audiences or stakeholders?
  5. Does the incident directly affect our core business operations?
  6. Have we encountered a similar incident in the past? In essence, is this a recurring issue for the company?
  7. Is the incident linked to broader industry challenges or trends? Are competitors or others facing similar issues?
  8. Are vital business pages currently accessible?

Incident Matrix

Level 1 Level 2 Level 3
High risk Medium risk Low risk
Leaked mission-critical keys + env variables Mission-critical / legal content errors (ex: incorrect pricing, drastic typos or verbiage errors on our high converting pages) Section of site is missing (ex: events, press releases)
Major vendor failures related to infrastructure (GCP + Contentful). Integration failures (6sense, GA, etc) Performance issues
Mission-critical pages are missing (ex: Homepage, missing, primary navigation) Significant performance issues
See reporting an incident below. Create an issue and post in #digital-experience Slack channel Create an issue and post in #digital-experience Slack channel

Reporting an incident

We now use incidents instead of using issues for documenting site outages. Incidents behave similarly to issues, allowing us to use templates tailored for outage documentation. This method provides more insights into our site reliability and ensures tracking and resolution of downtime events across our projects.

Point person: Nathan Dubord - Working hours: 9am - 6pm Eastern

  1. Post in the #digital-experience Slack channel and tag @digital-experience.
  2. If there is no response within five minutes, please text or phone the following people:
    1. Eastern Timezone (UTC−5):
      1. Nathan Dubord
      2. Laura Duggan
    2. Central Timezone (UTC−6):
      1. Megan Filo
    3. Pacific Timezone (UTC−8):
      1. Javier Garcia
  3. Incident created by DEX team member based on the project. For example, an outage in the Buyer Experience project would be created here. Note: Any time we would create an issue for an outage, create an incident instead. Make sure we open incidents in the appropriate project as this affects our reporting and metrics. As a general rule, an incident should be created if we are circumventing the triage process, there is no existing open issue, and our site uptime is affected.
    1. Consider filling the Severity, and Timeline Events when appropriate.
    2. After the incident is resolved, you may close the incident. This is what affects our time to resolve outage metric(Time to Resolve = Time Incident was closed - Time Incident was opened).

Call on the phone if no response within 15 minutes

PagerDuty

In the case of an emergency, Digital Expereince Engineers have the ability to create a pager duty incident and trigger alerts on each others mobile devices. These pager duty incidents can also be triggered by GitLab team members with access to Pager Duty (IT Security, Reliability, SIRT, ect)

To trigger a pager duty incident for the Digital Experince team following these steps:

  1. Report an incident by typing /pd trigger anywhere in Slack.
  2. Select about.gitlab.com as the impacted service
  3. Complete fields - priority, description, Urgency
  4. This will create an incident and notify PagerDuty to alert members of the Digital Experience team.
  5. PagerDuty will continuously escalate until contact with a team member has been made.
Last modified February 10, 2025: Update team dex team members (21c5397a)