Disaster Recovery Working Group

The Disaster Recovery Working Group improves the disaster recovery mechanism for GitLab SaaS and Self-Hosted Products.

Attributes

Property Value
Date Restarted August 1, 2022
Date Created November 11, 2020
End Date TBD
Slack #wg_disaster-recovery (only accessible from within the company)
Google Doc Working Group Agenda (only accessible from within the company)
Issue Board Working Group Issue Board
Epic Link
Overview & Status Main Epic, Internal Handbook (more specific)

Scope and Definitions

In the context of this working group:

  1. Recovery Point Objective (RPO): maximum duration of time in which data might be lost due to an incident.
  2. Recovery Time Objective (RTO): maximum duration of time that a service is unavailable due to an incident.

Exit criteria

The exit criteria and target goals for the working group are defined here in the internal handbook.

Sequence Order Of Deliverables and Exit Criteria

Planned:

  1. Complete an assessment of zonal outage and identify next step iterations towards 4 hour recovery goal (Epic: gitlab.com&1900). DRI: John Jarvis
  2. Improve node snapshot capabilities DRI: John Jarvis
  3. Define a medium to long term strategy for DR capabilities for GitLab Dedicated and Cells via Geo. DRI: Sampath Ranasinghe

Completed:

Roles and Responsibilities

Working Group Role Person Title
Executive Stakeholder Jörg Heilig CTO
Facilitator/DRI Andras Horvath Engineering Manager, Gitaly
Product Management DRI Mark Wood Senior Product Manager, Gitaly
Member Ethan Guo Director, Infrastructure Technical Program Management
Member Gerardo Lopez-Fernandez Engineering Fellow, Infrastructure
Member Chun Du Director of Engineering, Enablement
Member Juan Silva Fullstack Engineering Manager, Geo
Member Sampath Ranasinghe Senior Product Manager, Geo
Member John Jarvis Staff SRE, Infrastructure
Member Michele Bursi Engineering Manager, Delivery
Member Sami Hiltunen Senior Backend Engineer, Gitaly
Member Joshua Lambert Director of Product Management, Enablement
Member Ahmad Sherif Senior SRE, Infrastructure
Member Fabian Zimmer Director of Product Management, SaaS Platforms
Member Nick Westbury Senior Software Engineer in Test, Geo
Member Sean Carroll Engineering Manager, Source Code
Last modified January 17, 2024: Update file disaster-recovery.md (a495e71e)