The Ops team is an infrastructure team under SaaS Platforms that focuses on improving processes that are vital to the succesfull operations of GitLab.
Vision
The Ops teams vision is to enable service onwers to operate their own services using standardized processes, frameworks, architectures and tools. Some of those processes and tools will be built by the Ops team, but many will be from other Infrastructure teams.
Ownership and Responsibilities
There are two areas that are the Ops team primary focus:
Incident Management - Ops is responsible for improving the processes GitLab uses for incident management
Disaster Recovery - Ops is responsible for managing our disaster recovery processes with a particular focus on reducing our RTO
Patching Processes - Ops is responsible for defining and maintaining the GitLab.com patching process
Getting Assistance
Should you require assistance from the Ops team, please open an issue in the Production engineering tracker and add make sure to add the label ~“team::Ops”
We also have team handles that ping the full team
GitLab: @gitlab-org/production-engineering/ops
How We Work - Prioritization
Project Management
The Ops team top level Epic can be found here.
We follow the Infrastructure SaaS Platforms Project Management practices as outlined in the Handbook.
OKRs
For Objectives and Key Results, we align with Platforms guidance for creation and structure.
## Administrative
<!-- A copy paste section for creating child epics/issues, ensuring that they relate to the current epic and have the correct labels -->\`\`\`
/epic [current epic]
/labels ~"group::Production Engineering" ~"Sub-Department::SaaS Platforms" ~"team::Ops" ~"workflow-infra::Triage" ~"Production Engineering::P2"
\`\`\`
## References
<!-- Links to related OKRs, Epics or issues, external resources etc -->## Demos
| Demo Date | Demo Link | Highlights |
|-----------|-----------|------------|
## Decision log
<!-- A collapsible section to aggregate any decisions made along the way. Be sure to include "why" in addition to "what". --><details><summary>Log</summary><details><summary>date</summary><p>[decision taken and why]</p></details></details>
Apply any applicable service labels.
Make sure to give good context for the status and progress of the project in the weekly status update. If the epic is not on-track, please provide a plan for getting back on-track when possible.
Epic Status Updates
Project status is maintained in the description of the top-level epic so that it is visible at a glance. This is auto-generated using the epic issues summary project. You can watch a short demo of this process to see how to use status labels on the epics to make use of this automation.
Issues
Open planned work for our team is located in the Production Engineering project. Issues should be updated whenever significant work occurs. New issues are expected to:
Link to a related Epic.
Include the following Labels (update the priority as needed):
If there is a service label that is applicable, also apply that.
Processes
Monthly Availability Updates
The Ops Team is responsible for ensuring the published Monthly Availability Updates are maintained. This is currently a manual process. Items to update include:
Updates must be merged by the 7th day of each month. This is currently a scheduled event on the Reliability Ops Team’s Calendar. Contact any member of the team for more details on this process.
Monthly Review of Incident and Pager Trends
The Ops team coordinates the monthly process to identify incident and pager trends across the engineering organization. This is an async process with the following objectives:
Identify actions to address issues identified in the Reliability Team Monthly Availability Reports.
Generate action items based on the review of key metrics for incidents and pages.
Generate and delegate action items to the relevant teams based on the review process. This includes:
The process is scheduled on the Ops Team Calendar to kick off on the first Tuesday of each month.
The DRI kicking off the process and ensuring its progress is rotated among members of the Ops Team.
All our welcome to participate in the process of identifying trends. EOCs, especially, are encouraged to participate.
Monthly Review of Incident and Pager Trends: How to guide for DRIs
Add a new section to the agenda for the current month.
Announce that the process is kicking off in #infrastructure-lounge and #reliability-lounge on Slack and solicit feedback.
Week 1: Review the agenda and respond to any questions or comments
Week 2: Reply to the announcement thread and solicit additional feedback.
Week 2: Review the agenda and respond to any questions or comments
Week 3: Review the Identified Trends section of the agenda and coordinate the creation of any required Corrective Actions, Infradev Issues, or Infrastructure Improvement Issues.
Week 4: Reply to the announcement thread that the process is coming to a close
Week 4: Add an item to the Reliability Leadership Sync Agenda and include a summary of action items created. Please include severity for each item.
Week 4: Send a final reply to the announcement thread indicating that the process is closed for the month.
Continuous Disaster Recovery Testing and Practice
The Ops team creates, manages, and coordinates regular DR Practices (or “Gamedays”) to test and measure our Disaster Recovery processes.
There are many reasons to test and practice our disaster recovery process for GitLab.com.
Ensure our processes work as expected.
Keep up with changes that could break or complicate our recovery processes.
Increase confidence and knowledge of these processes for those who participate in the on-call rotation.
Satisfy compliance requirements around validation of disaster recovery scenarios.
Overview
These practices are often referred to as DR gamedays or just gamedays.
Currently, our DR gamedays focus on zonal recovery scenarios, and each one focuses on a specific component.
During a real zonal failure, these gamedays should be capable of being executed in parallel to save time.
Engineers on-call (EOC) onboarding buddies play a crucial role in ensuring a positive and effective onboarding experience for new engineers joining the on-call rotation. Being on call can be very stressful, particularly for engineers who are new to the role or unfamiliar with our systems and processes.
That’s why it’s essential that all new engineers joining the on-call rotation be assigned a buddy who is ready and willing to assist with the onboarding process. These buddies are experienced on-call engineers who can provide guidance and support and share their invaluable knowledge.
The engineer on-call (EOC) shadowing process is designed to provide new engineers with practical, hands-on experience in managing live incidents, responding to alerts, and ensuring system stability. Shadowing allows new team members (Shadows) to observe and gradually take on the responsibilities of an EOC under the guidance of an experienced engineer (EOC Buddy).
This document outlines the key expectations for both the EOC Shadow and EOC Buddy, ensuring a structured approach to learning and support throughout the shadowing process. By clearly defining the roles and responsibilities, we aim to ensure the Shadow gains the necessary skills, knowledge, and confidence to handle real-time incidents effectively, and that the EOC Buddy can provide the appropriate guidance and mentorship.
The on-call-handovers project contains issues for each SRE’s on-call shift. The outgoing EOC records the activities of their shift using the handover issue template to indicate a handoff and assigns it to the incoming EOC. The /sre-oncall [handover] slack command can also be used in the #production channel to simplify this process. It will automatically create a new issue and pre-populate some information such as outgoing/incoming EOC handles, open/closed incidents, resolved alerts, etc.
These are assigned to the SRE when they start. This will guide them
through different areas of the system, starting off with some simple
tasks and help both the SRE and the SRE manager through various access issues.
There is a third issue template for oncall onboarding,
which should be completed after the first two and will probably take at least 3 months from the start date to complete.
When you visit any website, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and change our default settings. However, blocking some types of cookies may impact your experience of the site and the services we are able to offer.
Cookie Policy
User ID: e53c7894-320c-48b4-8f28-dbdaa67a81f4
This User ID will be used as a unique identifier while storing and accessing your preferences for future.
Timestamp: --
Strictly Necessary Cookies
Always Active
These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, enabling you to securely log into the site, filling in forms, or using the customer checkout. GitLab processes any personal data collected through these cookies on the basis of our legitimate interest.
Functionality Cookies
These cookies enable helpful but non-essential website functions that improve your website experience. By recognizing you when you return to our website, they may, for example, allow us to personalize our content for you or remember your preferences. If you do not allow these cookies then some or all of these services may not function properly. GitLab processes any personal data collected through these cookies on the basis of your consent
Performance and Analytics Cookies
These cookies allow us and our third-party service providers to recognize and count the number of visitors on our websites and to see how visitors move around our websites when they are using it. This helps us improve our products and ensures that users can easily find what they need on our websites. These cookies usually generate aggregate statistics that are not associated with an individual. To the extent any personal data is collected through these cookies, GitLab processes that data on the basis of your consent.
Targeting and Advertising Cookies
These cookies enable different advertising related functions. They may allow us to record information about your visit to our websites, such as pages visited, links followed, and videos viewed so we can make our websites and the advertising displayed on it more relevant to your interests. They may be set through our website by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant advertisements on other websites. GitLab processes any personal data collected through these cookies on the basis of your consent.