Production Engineering Ops Team

Mission

The Ops team is an infrastructure team under Infrastructure Platforms that focuses on improving processes that are vital to the succesfull operations of GitLab.

Vision

The Ops teams vision is to enable service onwers to operate their own services using standardized processes, frameworks, architectures and tools. Some of those processes and tools will be built by the Ops team, but many will be from other Infrastructure teams.

Ownership and Responsibilities

There are two areas that are the Ops team primary focus:

  1. Incident Management - Ops is responsible for improving the processes GitLab uses for incident management
  2. Disaster Recovery - Ops is responsible for managing our disaster recovery processes with a particular focus on reducing our RTO
  3. Patching Processes - Ops is responsible for defining and maintaining the GitLab.com patching process

Getting Assistance

Should you require assistance from the Ops team, please open an issue in the Production engineering tracker and add make sure to add the label ~“team::Ops”

  • We also have team handles that ping the full team
    • GitLab: @gitlab-org/production-engineering/ops

How We Work - Prioritization

Project Management

The Ops team top level Epic can be found here. We follow the Infrastructure Platforms Project Management practices as outlined in the Handbook.

OKRs

For Objectives and Key Results, we align with Platforms guidance for creation and structure.

Epics

In addition to the format described in the platforms project management page, these sections may be helpful


## Administrative

<!-- A copy paste section for creating child epics/issues, ensuring that they relate to the current epic and have the correct labels -->

\`\`\`
/epic [current epic]
/labels ~"group::Production Engineering" ~"Sub-Department::SaaS Platforms" ~"team::Ops" ~"workflow-infra::Triage" ~"Production Engineering::P2"
\`\`\`

## References

<!-- Links to related OKRs, Epics or issues, external resources etc -->

## Demos

| Demo Date | Demo Link | Highlights |
|-----------|-----------|------------|

## Decision log
<!-- A collapsible section to aggregate any decisions made along the way. Be sure to include "why" in addition to "what". -->

<details  >
  <summary>Log</summary>
  
<details  >
  <summary>date</summary>
  <p>[decision taken and why]</p>
</details>

</details>
  • Apply any applicable service labels.
  • Make sure to give good context for the status and progress of the project in the weekly status update. If the epic is not on-track, please provide a plan for getting back on-track when possible.

Epic Status Updates

Project status is maintained in the description of the top-level epic so that it is visible at a glance. This is auto-generated using the epic issues summary project. You can watch a short demo of this process to see how to use status labels on the epics to make use of this automation.

Issues

Open planned work for our team is located in the Production Engineering project. Issues should be updated whenever significant work occurs. New issues are expected to:

  • Link to a related Epic.

  • Include the following Labels (update the priority as needed):

    /labels ~"group::Production Engineering" ~"Sub-Department::SaaS Platforms" ~"team::Ops" ~"workflow-infra::Triage" ~"Production Engineering::P4"
    
  • If there is a service label that is applicable, also apply that.

Processes

Monthly Availability Updates

The Ops team is responsible for ensuring the published Monthly Availability Updates are maintained. This is currently a manual process. Items to update include:

  1. Historical Service Level Availability including maintenance windows from the month in the comments

Each of these items should be updated to reflect the most recent month. (Sample MR).

Latest results are on the GitLab.com General SLA Dashboard (internal only)

The Ops team coordinates the monthly process to identify incident and pager trends across the engineering organization. This is an async process with the following objectives:

  • Identify actions to address issues identified in the Reliability Team Monthly Availability Reports.
  • Generate action items based on the review of key metrics for incidents and pages.
  • Generate and delegate action items to the relevant teams based on the review process. This includes:

These efforts are coordinated asynchronously via the GitLab Incident and Pager Trends Monthly Review Agenda

The process is scheduled on the Ops Team Calendar to kick off on the first Tuesday of each month.

The DRI kicking off the process and ensuring its progress is rotated among members of the Ops Team.

All our welcome to participate in the process of identifying trends. EOCs, especially, are encouraged to participate.

  1. Add a new section to the agenda for the current month.
  2. Announce that the process is kicking off in #infrastructure-lounge and #reliability-lounge on Slack and solicit feedback.
  3. Week 1: Review the agenda and respond to any questions or comments
  4. Week 2: Reply to the announcement thread and solicit additional feedback.
  5. Week 2: Review the agenda and respond to any questions or comments
  6. Week 3: Review the Identified Trends section of the agenda and coordinate the creation of any required Corrective Actions, Infradev Issues, or Infrastructure Improvement Issues.
  7. Week 4: Reply to the announcement thread that the process is coming to a close
  8. Week 4: Add an item to the Reliability Leadership Sync Agenda and include a summary of action items created. Please include severity for each item.
  9. Week 4: Send a final reply to the announcement thread indicating that the process is closed for the month.

System patching notifications

The Ops team maintains a project patching-notifier that automates the creation of GitLab issues when security problems are detected on our VM based infrastructure.

Details relating to the operation of this notification system can be found in our runbooks.

Continuous Disaster Recovery Testing and Practice

The Ops team creates, manages, and coordinates regular DR Practices (or “Gamedays”) to test and measure our Disaster Recovery processes.

Our Disaster Recovery Gameday process can be found here.

Team Members

Name Role
Kam KyralaKam Kyrala Engineering Manager, Production Engineering
Alex HanselkaAlex Hanselka Senior Site Reliability Engineer
Anton StarovoytovAnton Starovoytov Site Reliability Engineer
Cameron S McFarlandCameron S McFarland Senior Site Reliability Engineer
Davis BickfordDavis Bickford Backend Engineer
Devin SylvaDevin Sylva Senior Site Reliability Engineer
Ermia QasemiErmia Qasemi Site Reliability Engineer
Igor WiedlerIgor Wiedler Staff Site Reliability Engineer, Scalability
Joe ShawJoe Shaw Senior Backend Engineer
Rehab HassaneinRehab Hassanein Site Reliability Engineer
Silvester WainainaSilvester Wainaina Site Reliability Engineer
Shreya ShahShreya Shah Junior Site Reliability Engineer
Tomasz MaczukinTomasz Maczukin Senior Backend Engineer
Zoe BraddockZoe Braddock Site Reliability Engineer

Roadmaps

The Production Engineering Ops team maintains roadmaps for our key focus areas:

Team Impact Overviews


Disaster Recovery Practice (DR Gamedays)
Purpose There are many reasons to test and practice our disaster recovery process for GitLab.com. …
EOC Onboarding Buddies
Introduction Engineers on-call (EOC) onboarding buddies play a crucial role in ensuring a positive …
EOC Shadow and EOC Buddy Expectations
The engineer on-call (EOC) shadowing process is designed to provide new engineers with practical, …
On-call handover
On-call handover The on-call-handovers project contains issues for each SRE’s on-call shift. …
Production Engineering Ops Team Roadmaps
Overview This section contains the roadmaps for the Production Engineering Ops team, organized by …
SRE Onboarding
Onboarding Template SRE onboarding is mostly handled by two issue templates: Machine setup Gather …
Last modified June 13, 2025: Moving Ops team pages (1cc4124d)