Observability Team
Observability encompasses the technical elements responsible for metrics, logging, and tracing, along with the tools and processes that leverage these components.
Mission
Our mission is to deliver and maintain a world-class observability offering and frictionless operational experience for team members at GitLab.
Common Links
| Workflow | Team workflow |
| GitLab.com | @gitlab-org/production-engineering/observability |
| Issue Trackers | Observability Tracker Tamland |
| Team Slack Channels | #g_observability - Team channel #infrastructure_platforms_social - Social channel |
| Information Slack Channels | #infrastructure-lounge (Infrastructure Group Channel), #incidents (Incident Management), #g_infra_observability_alerts (Observability Slack Alerts), #alerts-general (SLO alerting), #mech_symp_alerts (Mechanical Sympathy Alerts) |
| Documentation | Documentation Hub |
Team Members
The following people are members of the Observability team:
| Name | Role |
|---|---|
Liam McAndrew
|
Engineering Manager, Observability |
Andreas Brandl
|
Staff Backend Engineer, Observability |
Bob Van Landuyt
|
Staff Backend Engineer, Observability |
Hercules Lemke Merscher
|
Backend Engineer, Observability |
Itay Rotman
|
Site Reliability Engineer |
Calliope Gardner
|
Site Reliability Engineer, Observability |
Nick Duff
|
Senior Site Reliability Engineer, Observability |
Padraig O Neill
|
Site Reliability Engineer, Observability |
Stephanie Jackson
|
Staff Site Reliability Engineer, Observability |
Taliesin Millhouse
|
Site Reliability Engineer, Observability |
Tony Ganga
|
Site Reliability Engineer, Observability |
The team is located all over the world in different timezones.
Technical principles, goals and responsibilities
Please see the Technical Blueprint for details on our principles and goals.
The following gives an overview of our scope and ownership.
- Monitoring fundamentals
- Metrics stack
- Logging stack
- Error budgets
- Ownership of concept and implementation
- Delivery of monthly error budget report
- Capacity planning
- Triage rotation for .com
- Operational aspects for GitLab Dedicated capacity planning
- Developing Tamland, the forecasting tool
- Capacity reporting for GitLab Dedicated
- Service Maturity model which covers GitLab.com’s production services.
- GitLab.com availability reporting: Provide underlying data and aggregate numbers
Documentation
We recognize the need to provide technical documentation for teams using our observability services and platforms, as well as for our team’s internal use.
Historically, we’ve provided reference documentation within the projects we own or contribute to, or in the runbooks project. As these projects are scattered around, it is rather difficult to discover the various pieces of relevant documentation for our users.
As we reshape our documentation, we follow along with the following idea and principles:
- The Infrastructure Observability Documentation Hub is the entrypoint for any observability related documentation we provide.
- Carefully crafted documentation is a core product for the observability platform, not an afterthought.
- We think of the documentation hub as a way to communicate about observability interfaces we offer as a platform with everyone in Engineering.
- We strive to provide documentation like guides and tutorials targeted for specific audiences - in addition to reference documentation.
- We use the documentation hub also to explain concepts and architecture for our team’s internal use to create a common understanding and empower everyone to contribute in all areas.
Where do we keep different types of documentation?
There are different types of documentation, which belong in different places.
| What? | Where? |
|---|---|
| Team organisation, processes and other team level agreements | GitLab Handbook (this page) |
| Technical reference documentation for standalone projects | On the project itself, and linked to the Documentation Hub |
| How do we at GitLab make use of the projects we maintain? | Documentation Hub |
| How does our GitLab specific architecture look like? | Documentation Hub |
| Tutorials, guides, FAQs and conceptual explanations | Documentation Hub |
Documentation outside the Documentation Hub should be linked from it (that’s why we call it a hub) and vice versa, to help increase discoverability.
This recognizes the need to ship reference documentation with the respective project, as we would expect to see for any project (whether open source or not). The benefit here is that a change in functionality can also update reference documentation in the same merge request.
On the other hand, how we make particular use of these projects in our stack is too specific to ship with the project itself. Often, we want to understand the bigger picture and how projects play together. This is out of scope for technical documentation that ships with a certain project itself and hence we put this information on the documentation hub instead.
For our internal use, we use the documentation hub to help us reason about the services we own and how we operate them. We expect this helps everyone on the team and helps us gather a common understanding as we have different roles and perspectives on the team.
A recommended read on different types of documentation and how to organize it is the Divio Documentation System.
How do we create documentation?
As we reshape and build documentation, the documentation hub benefits from each and all contributions:
- Explain existing concepts
- Link together existing documentation
- Consolidate existing documentation and move in the right places
- Writing and graphics on system architecture and operational principles
We aspire to create and maintain documentation as a primary citizen and similar to the Handbook First mindset. For example, instead of answering specific questions from team members individually (for example on Slack), we can take this as an opportunity to write a piece of documentation and ask them to review and work with that.
Indicators
The group is an owner of several performance indicators that roll up to the Infrastructure department indicators:
- Service Maturity model which covers GitLab.com’s production services.
- Capacity Planning uses capacity warnings to prevent incidents.
These are combined to enable us to better prioritize team projects.
An overly simplified example of how these indicators might be used, in no particular order:
- Service Maturity - provides detail on how trustworthy the data we received from observability stack in relation to the service; the lower the level the more focus we need to improve the service observability
- Capacity Planning - Provides a forecast for a specific service
Between these different signals, we have a relatively (im)precise view into the past, present and future to help us prioritise scaling needs for GitLab.com.
Provisioned Services
The team are responsible for provisioning access to the services listed below, as per the tech_stack.yml file.
- Kibana is accessed through Okta. Team members need to be in either of the following Okta groups:
gl-engineering(entire Engineering department);okta-kibana-users. The latter group is used to manage access for team members outside of Engineering on an ad-hoc basis (context). Team members should be (de)provisioned through an Access Request (example). If the access request is approved, the provisioner should add the user to this group, which will then automatically sync to its namesake group in Okta. - Elastic Cloud is for administrative access to our Elastic stack. The login screen is available here and access is through Google SSO. Team members should be (de)provisioned through an Access Request (example). If approved, the provisioner can add/remove members on the membership page with appropriate permissions to the instances they require access to.
- Grafana is accessed through Okta. The login screen is availabile here. Any GitLab team member can access Grafana. Provisioning and deprovisioning is handled through Okta.
How we work
We default to working inline with the GitLab values and by following the processes of the wider Infrastructure Platforms section. In addition to this, listed below are some processes that are specific, or particularly important, to how we work in Observability.
Roadmap
We transparently prioritize our Roadmap in this epic, which includes a view of how all Epics/Issues are grouped using roadmap:: labels.
Project Management
Most projects that we work on are represented by Epics that contain multiple Issues. Simple tasks can be worked on in Issues without a parent epic.
Project Sizing
Projects will be staffed by at least three engineers from the team, and preferably from both SRE and BE roles. This will allow us to share knowledge and keep projects moving forward when people are unavailable.
Labels
All Epics/Issues pertaining to our team have the ~"group::Observability" label.
We use the following roadmap:: labels to represent the current priorities of our team:
roadmap::nowitems that are currently being worked onroadmap::nextitems that are candidates to be picked up nextroadmap::lateritems that aren’t directly in our plans to be worked onroadmap::perpetualitems that are always relevant, such as version upgrades
We use Category: labels to represent the functional area of a work item. Examples include Category:Logging and Category:Capacity Planning.
We use theme:: labels to group the type of work and its benefit. Labels include:
theme::Operational ExcellenceOperations processes that keep a system running in productiontheme::Performance EnablementThe ability of a system to adapt to changes in loadtheme::ReliabilityThe ability of a system to recover from failures and continue to functiontheme::SecurityProtecting applications and data from threatstheme::Cost OptimizationManaging costs to maximize the value deliveredtheme::Shift Leftitems that encourage other stage groups to self-servetheme::Enablement & Assistanceitems that require directly helping other Stage Group. We try to minimize this work by Shifting Left.
The first 5 theme labels cover the primary pillars covered by Well-Architected Frameworks. You can read more about this topic here.
We also use the workflow-infra:: labels to show where in the work cycle a particular Epic or Issue is.
In addition, all issues require either a Service label or the team-tasks, discussion, or capacity planning labels.
Boards
Assignees
We use issue assignments to signal who is the DRI for the issue. We expect issue assignees to regularly update their issues with the status, and to be as explicit as possible at what has been done and what still needs to be done. We expect the assignee of an issue to drive the issue to completion. The assignee status typically expresses that the assigned team member is currently actively working on this or planning to come back to it relatively soon. We unassign ourselves from issues we are not actively working on or planning to revisit in a few days.
Group Reviews
The Production Engineering group review happens every Thursday. We ensure that all work items with a roadmap::now label have a status update each Wednesday. We use this epic to share updates, which is generated by this script.
To help improve visibility of all work in progress, it is expected that each team member is assigned to at least one project that is visible in the roadmap::now section of the Group Review epic - either as a DRI or participant.
Retrospectives
A team-level retrospective issue is created every 6 weeks, allowing the team to regularly reflect and to encourage a culture of continuous improvement. The creation of the retrospective issue is the responsibility of the Engineering Manager. You can find retrospectives here.
Updates in Slack
We are using GeekBot for weekly updates, which go to the #g_observability channel.
When posting updates, consider providing enough context (e.g. through links) so that interested team members are able to dive in on their own (low context).
Team Happiness
Each Thursday a reminder is sent in our Slack channel asking team members to score their happiness for the week using this form. Team members can view the results in this spreadsheet (note there is also a visualization in the ‘Graph’ sheet).
Cost Management
For details on the daily operational costs of our observability services refer to the Cost of Observability Stack documentation. This resource includes access instructions and cost breakdowns.
History and Accomplishments
This section contains notes on team history and our accomplishments, but is not meant to be exhaustive.
- 2024-02, Capacity planning: Proactive investigation of postgres CPU spike seen in saturation forecast uncovered a database design issue
- 2024-03, Capacity planning: Tamland predicted redis CPU saturation which led to Practices proactively scaling Redis (slides)
- 2024-05, Metrics: The migration from Thanos to Mimir was completed which brought significant improvements in metrics accuracy and dashboard performance.
Year-in-Review Issues
Cost of Observability Stack
Error Budgets
Technical Blueprint
3643eb9e)
Liam McAndrew
Andreas Brandl
Bob Van Landuyt
Hercules Lemke Merscher
Itay Rotman
Calliope Gardner
Nick Duff
Padraig O Neill
Stephanie Jackson
Taliesin Millhouse
Tony Ganga