Scalability:Observability Team
Observability encompasses the technical elements responsible for metrics, logging, and tracing, along with the tools and processes that leverage these components.
Mission
Our mission is to deliver and maintain a world-class observability offering and frictionless operational experience for team members at GitLab.
Common Links
Team Members
The following people are members of the Scalability:Observability team:
The team is located all over the world in different timezones.
Technical principles, goals and responsibilities
Please see the Technical Blueprint for details on our principles and goals.
The following gives an overview of our scope and ownership.
- Monitoring fundamentals
- Metrics stack
- Logging stack
- Error budgets
- Ownership of concept and implementation
- Delivery of monthly error budget report
- Capacity planning
- Triage rotation for .com
- Operational aspects for GitLab Dedicated capacity planning
- Developing Tamland, the forecasting tool
- Capacity reporting for GitLab Dedicated
- Service Maturity model which covers GitLab.com’s production services.
- GitLab.com availability reporting: Provide underlying data and aggregate numbers
Indicators
The group is an owner of several performance indicators that roll up to the Infrastructure department indicators:
- Service Maturity model which covers GitLab.com’s production services.
- The forecasting project named Tamland which generates capacity warnings to prevent incidents.
These are combined to enable us to better prioritize team projects.
An overly simplified example of how these indicators might be used, in no particular order:
- Service Maturity - provides detail on how trustworthy the data we received from observability stack in relation to the service; the lower the level the more focus we need to improve the service observability
- Tamland reports - Provides a forecast for a specific service
Between these different signals, we have a relatively (im)precise view into the past, present and future to help us prioritise scaling needs for GitLab.com.
Provisioned Services
The team are responsible for provisioning access to the services listed below, as per the tech_stack.yml file.
- Kibana is accessed through Okta. Team members need to be in either of the following Okta groups:
gl-engineering
(entire Engineering department); okta-kibana-users
. The latter group is used to manage access for team members outside of Engineering on an ad-hoc basis (context). Team members should be (de)provisioned through an Access Request (example). If the access request is approved, the provisioner should add the user to this group, which will then automatically sync to its namesake group in Okta.
- Elastic Cloud is for administrative access to our Elastic stack. The login screen is available here and access is through Google SSO. Team members should be (de)provisioned through an Access Request (example). If approved, the provisioner can add/remove members on the membership page with appropriate permissions.
How we work
We default to working inline with the GitLab values and by following the processes of the wider SaaS Platforms section and Scalability group. In addition to this, listed below are some processes that are specific, or particularly important, to how we work in Scalability:Observability.
Project Sizing
Projects will be staffed by at least three engineers from the team, and preferably from both SRE and BE roles. This will allow us to share knowledge and keep projects moving forward when people are unavailable.
Issue management
While we mainly operate from the scalability issue tracker, there are other projects under the gl-infra
group team members are working on.
Hence we strive to use group-level labels and boards to get the entire picture.
Labels
All issues pertaining to our team have the ~"team::Scalability-Observability"
label.
All issues that are within scope of current work have a ~board::build
or ~board::planning
label.
This is a measure to cut through noise on the tracker and allows us to get a view on what’s currently important to us.
See Boards below on how this is being used.
All issues require either a Service label or the team-tasks, discussion, or capacity planning labels.
Assignees
We use issue assignments to signal who is the DRI for the issue.
We expect issue assignees to regularly update their issues with the status, and to be as explicit as possible at what has been done and what still needs to be done.
We expect the assignee of an issue to drive the issue to completion.
The assignee status typically expresses, that the assigned team member is currently actively working on this or planning to come back to it relatively soon.
We unassign ourselves from issues we are not actively working on or planning to revisit in a few days.
Boards
The Scalability::Observability team’s issue boards track the progress of ongoing work.
We use issue boards to track the progress of planned and ongoing work.
Refer to the Scalability group issue boards section for more details.
Planning |
Building |
Planning Board |
Build Board |
Issues where we are investigating the work to be done. |
Issues that will be built next, or are actively in development. |
|
|
Retrospectives
A team-level retrospective issue is created every 6 weeks, allowing the team to regularly reflect and to encourage a culture of continuous improvement. The creation of the retrospective issue is the responsibility of the Engineering Manager. You can find retrospectives here.
Updates in Slack
In order to stay informed with everyone’s immediate topics, we post regular status updates in our Slack channel.
These updates include whatever the team member is currently working on and dealing with, for example consider including current focus area, general work items, blockers, in-flight changes, learnings, side tracks, upcoming time off and other relevant information.
There is no strict frequency for posting updates, although we strive to make updates at least once per week.
When posting updates, consider providing enough context (e.g. through links) so that interested team members are able to dive in on their own (low context).
Cost Management
For details on the daily operational costs of our observability services refer to the Cost of Observability Stack documentation. This resource includes access instructions and cost breakdowns.
History and Accomplishments
This section contains notes on team history and our accomplishments, but is not meant to be exhaustive.
Cost of Observability Stack
Elastic Cloud Costs (Snowflake)
Where to Find the Views
You can find the view here.
Access Requirements
To view the dashboard, you need access to Snowflake as a SNOWFLAKE_ANALYST
. Follow these steps to gain access:
- Create an Access Request (AR): Use the provided template to submit your AR.
- Create a Merge Request (MR): Once your AR is approved, create an MR to grant yourself permissions. Use the template and instructions provided.
After gaining access, you will be able to view and execute the dashboard.
We maintain and improve the Capacity Planning process that is described in the Infrastructure Handbook. This is a controlled activity covered by SOC 2. Please see this issue for further details
The goal of this process is to predict and prevent saturation incidents on GitLab.com.
Issues are kept in the capacity planning issue tracker. Where an issue is needed to improve metrics to support this process, we raise an issue in the Scalability group tracker with the label of Saturation Metrics
.
We maintain the Error Budgets process that is described in the Engineering Handbook.
Issues are kept in the Scalability group tracker with
the label of Category::Error Budgets
.
We maintain the metrics used to generate the Error Budgets and we ensure that the reports are published on time. The reports are available as issues in the Error Budget Reports issue tracker. They are automatically generated by the code in the same project but require manual editing before sending them out.
Tamland: Saturation forecasting
Tamland is a saturation forecasting tool we use for capacity planning purposes. The Scalability:Observability team develops and releases Tamland.
This page details how we organize development efforts around Tamland.
GitLab Projects and Capacity Planning Trackers
The main development project for Tamland is gitlab.com/gitlab-com/gl-infra/tamland.
Feature requests, improvements and bug reports can be filed in Tamland’s issue tracker.
Aside from development, we operate Tamland forecasting and reporting components through a number of rather operational projects. Those are geared towards their respective environment’s needs:
The purpose of this document is to detail the goals and guidelines for the Scalability:Observability team. The focus is on our principles and our goals for the next five years.
Team Principles
We are better when we work together.
Although our main mandate is to support GitLab’s SaaS Platforms, we actively share knowledge with other teams at GitLab and other customers and build tools and frameworks that work for everyone.
Last modified October 29, 2024:
Fix broken links (455376ee
)