Production Engineering Networking and Incident Management Team

We manage both the networking platform that controls traffic into our systems, and GitLab’s incident response process

Mission

We provide protection for GitLab from two vectors:

We provide a networking platform to provide teams with the first line of defense in how traffic is allowed into our systems
We manage the response system through our incident management process and tooling for when GitLab needs to respond to any incidents.

We seek to build and evolve the networking infrastructure that powers GitLab with focus on developing innovative networking solutions that scale with GitLab’s growth. We empower teams at GitLab to feel confident in responding to incidents involving their services.

Vision

Excellence in networking infrastructure We will drive GitLab’s networking capabilities for GitLab forward by building scalable, secure, and efficient solutions. This includes evolving our edge services, load balancing, rate limiting, and network security to meet the growing demands of all GitLab platforms. Through centralized networking tooling and infrastructure, we create a foundation that supports GitLab’s continued growth and innovation.
Service ownership and standardized incident response We will enable teams to operate their own services confidently by providing teams with the frameworks and tooling to confidently respond to any problems as they arise.

Ownership and Responsibilities

The Networking and Incident Management team focuses on:

Incident Management - we are responsible for improving the processes GitLab uses for incident management.
Disaster Recovery - we are responsible for managing our disaster recovery processes with a particular focus on reducing our Recovery time objective (RTO).
Networking infrastructure - we actively work to improve and expand our services that manage traffic from the edge of our network to the application layer.

Getting Assistance

Slack: #g_networking_and_incident_management
Request for help project: New issue

Team Members

Name	Role
Steve Abrams	Engineering Manager, Infrastructure Platforms - Networking and Incident Management
Alex Hanselka	Senior Site Reliability Engineer
Devin Sylva	Senior Site Reliability Engineer
Donna Alexandra	Senior Site Reliability Engineer
Sarah Walker	Site Reliability Engineer
Shreya Shah	Junior Site Reliability Engineer

How We Work

We follow Infrastructure Platforms Project Management practices.

As a new team within Production Engineering, we are currently establishing our workflows and processes. We’ll continue to update this page as our team evolves.

Labels

For incoming requests use ~"NIM::Requests". These are requests coming from outside the team.
For keeping the lights on (KTLO) issues use ~"NIM::KTLO". These are issues related to maintaining our areas of ownership that may not be a full project.
For project work use ~"NIM::Project Work". This should be applied to any issues that are part of epics being surfaced in the top level epic.
For issues related to team processes (retros, planning, NIM team process changes) use ~"NIM::Meta".
For access requests:
- ~"NIM::Todo" - This is applied automatically on all of the baseline entitlement templates for Cloudflare access.
- ~"NIM::Doing" - [Optional] Use this if it is going to take some time to action and you want to signal to others it’s already been picked up.
- ~"NIM::Done" - Once an access request is actioned, change it to this label.

Many issues templates already apply these labels.

Recurring Task Delegation

The team manages several recurring tasks that require regular attention with a small time commitment (typically around an hour per week). Individual team members own these tasks and are responsible for finding coverage when they are unavailable.

Current Recurring Tasks

Actioning Cloudflare access requests - weekly - Requests ready to action can be found with this link. DRI: Shreya
Reminding folks to do overdue post-incident tasks in incident.io - weekly - Following up on incomplete post-incident action items. DRI: Alex
Reliability reports - monthly - Published in https://gitlab.com/gitlab-com/gl-infra/reliability-reports/-/issues. DRI: Devin / Sarah
Dealing with incident followup issues - weekly - Managing and tracking resolution of issues identified during incidents. DRI: Steve
EOC coordinator - ongoing - Rotation lead for the Tier 1 SRE oncall rotation. DRI: Sarah
IM coordinator - ongoing - Rotation lead for the Incident Manager oncall rotation. DRI: Devin
Issue triage - weekly - Currently owned by the Engineering Manager, involves triaging, delegating, and scheduling incoming issues. DRI: Steve

Task Ownership Expectations

Each task has a DRI who is accountable for its completion on a regular cadence.
Task owners must arrange coverage when they will be out of office.
Coverage arrangements should be communicated in a PTO coverage issue.

View page source - Edit this page - please contribute.

Production Engineering Networking and Incident Management Team

Mission

Vision

Ownership and Responsibilities

Getting Assistance

Team Members

How We Work

Labels

Recurring Task Delegation

Current Recurring Tasks

Task Ownership Expectations

Common Links

Disaster Recovery Practice (DR Gamedays)

EOC Onboarding Buddies

EOC Shadow and EOC Buddy Expectations

On-call handover

Production Engineering Networking and Incident Management Team AI prompts

SRE Onboarding