Scalability Group
Common Links
Project Management Links
Group Level
- Scalability Epic Board
- Scalability Issue Board
- Scalability Issues not in an Epic
- Scalability Issues by Team
- Scalability Issues by Team Member
Teams
The Scalability group is currently formed of two teams:
Scalability:Observability
The Observability team focuses on observability, forecasting & projection systems that enable development engineering to predict
system growth for their areas of responsibility.
The following people are members of the Scalability:Observability team:
Scalability:Practices
The Practices team focuses on tools and frameworks that enable the stage groups to support their features on our production systems.
The following people are members of the Scalability:Practices team:
Mission
The Scalability group is responsible for GitLab at scale, working on the highest priority scaling items related to our SaaS platforms.
We support other Engineering teams by sharing data and techniques so they can become better at scalability as well.
Vision
As its name implies, the Scalability group enhances the availability, reliability and, performance of GitLab’s SaaS platforms
by observing the application’s capabilities to operate at scale.
The Scalability group analyzes application performance on GitLab’s SaaS platforms,
recognizes bottlenecks in service availability, proposes (and develops) short term improvements
and develops long term plans that help drive the decisions of other Engineering teams.
Short term goals for the group include:
- Refine existing, define new, and document Service Level Objectives
for each of GitLab’s services.
- Continuously expose the top 3 critical bottlenecks that threaten the stability of our SaaS platforms.
- Work on scoping, planning and defining the implementation steps of the top critical bottleneck.
- Define and track team KPI’s to track impact on our SaaS platforms.
- Work on implementing user facing application features (such as API improvements) as a means to reduce pressure on our SaaS platforms generated by regular user interactions.
Direction for FY24
We’ve moved the direction to the direction section here so that it’s in the same place as the rest of our product direction.
Indicators
The Infrastructure Department is concerned with the availability
and performance of GitLab’s SaaS platforms.
GitLab.com’s service level availability is visible on the SLA Dashboard,
and we use the General GitLab Dashboard
in Grafana to observe the service level indicators (SLIs) of apdex, error ratios, requests per second, and saturation of the services.
These dashboards show lagging indicators for how the services have responded to the demand generated by the application.
Each team is responsible for separate indicators. For more information, please view the team pages linked above.
Themes
The broad nature of work undertaken by the Scalability group can make prioritization challenging as it’s tricky to compare some issues like-for-like. For example, how do we compare the benefit of an issue to address a performance concern against an issue that reduces developer toil? To help guide the direction of the group and to inform our prioritization process, we can categorize issues in to the following themes, in order of priority:
- Critical Saturation Response. On occasions saturation alerts can unexpectedly occur - for example, when caused by a sudden change in platform usage patterns - and need to be addressed with urgency. We try to avoid working reactively by proactively working on other themes.
- Horizontal Scalability. The most obvious scaling bottlenecks in our infrastructure are those that can only be scaled vertically instead of horizontally. Horizontal scaling brings the benefit of elasticity, which increases confidence that we can meet future demand while keeping costs linear - both of these elements are strongly aligned with the vision of the Scalability group.
- Increasing Plaform Capacity. Delivering foundational project work in the GitLab application and infrastructure to support service capacity needs for GitLab SaaS.
- Scalability Advocacy and Facilitation. An effective method for the Scalability group to leverage its output is by collaborating closely with other engineering teams to promote scalability best practises. This might include building tools to enable wider engagement in GitLab SaaS operations (e.g. Stage Dashboards), or serving as a point of contact to other teams for scaling questions relating to their own initiatives.
- Eliminating Toil. We want to make our output as efficient as possible by spending more time on engineering projects and less time on manual, repetitive, or automatable tasks. An effective way of achieving this is by considering how future toil can be avoid when delivering projects. However, inline with our Iteration value, we don’t want to over-optimize and we can’t consider all eventualities ahead of time. We should always be mindful of opportunities to reduce toil, which will make us more effective in the long-term.
The above list is not comprehensive, nor does it outline a formal process. We should remain pragmatic when prioritizing work, while using the themes as a guideline.
Job Families
The Scalability Group consists of a Senior Engineering Manager, Engineering Managers, Backend Engineers, and Site Reliability Engineers.
The Engineering Roles section of the handbook lists the responsbilies of these roles:
Working with us
Emergency Escalation during S1/S2 incidents
Scalability leadership can be reached via PagerDuty Scalability Escalation.
From https://gitlab.pagerduty.com/incidents, click on the “New Incident” button and complete the new incident form as shown below.
How do I engage with the Scalability Group?
- Start with an issue in the Scalability tracker: Create an issue.
- You are welcome to follow this up with a Slack message in #g_scalability.
- Please don’t add any
workflow
labels to the issue. The team will triage the issue and apply these.
Alternatively, mention us in the issue where you’d like our input.
When issues are sent out way, we will do our best to help or find a suitable owner to move the issue forward.
We may be a development team’s first contact into the Infrastructure department and we endeavour to treat these
requests with care so that we can help to find an effective resolution for the issue.
Scalability review requests
If you’re working on a feature that has specific scaling requirements, you
can create an issue with the review request template.
Some examples are:
- Review Request - Impact on database load for enabling advanced global search
- Review Request - Assumptions about build prerequisite-related application limits
- Review Request - Throttling for Cleanup Policies Scaling Request
This template gives the Scalability group the information we need to help you, and the issue will be shown on
our build board with a high priority.
How does the Scalability Group engage with Stage Groups?
When we observe a situation on GitLab.com that needs to be addressed alongside a stage group, we first raise an issue
in the Scalability issue tracker that describes what we are seeing. We try to determine if the problem lies with the action
the code is performing, or the way in which it is running on GitLab.com. For example, with queues and workers, we will see
if the problem is in what the queue does, or how the worker should run.
If we find that the problem is in what the code is doing, then we engage with the EM/PM of that group to find the right path
forward. If work is required from that group, we will create a new issue in the gitlab-org project and use the Availability and Performance Refinement process to highlight this issue.
How we work
Handbook First
In line with the broader GitLab culture, we adopt a Handbook First approach to documenting our team’s workflow. Should you have any proposals aimed at enhancing our processes, please initiate a Merge Request (MR) to update the handbook. Assign the MR to @rnienaber
for the Group level change and Scalability EMs for the respective team changes and tag the team in a comment to solicit feedback. If there are no objections within three working days of tagging the team, the MR will be deemed ready for merging. We adhere to the principle of making two-way door decisions meaning additional MRs can be created to suggest changes or removals of processes that are deemed inefficient.
Communication
Everything is written in epics, issues, runbooks or the handbook.
Decisions or important information in Slack must be copied into a relevant location so that the information is persisted beyond the 90-day Slack retention policy.
Slack Channels and Guidelines
We communicate in public using the following channels:
- #g_scalability - Company facing channel where other team members can reach out to us. We also use this channel for highlighting work we have done.
- #g_scalability-observability - Team channel where daily work information is shared and team coordination takes place.
- #g_scalability-practices - Team channel where daily work information is shared and team coordination takes place.
- #scalability-social - Group social channel.
In the team channels, team members are encouraged to share what they are working and any blockers they may have.
The format is not fixed so that people share in a way that feels natural to them.
In addition to the channels above, we create a project channel per epic when appropriate.
A channel dedicated to a project helps to keep everything about the topic in one place and makes it easy to stay up to date on that topic.
It is useful for teams working across time zones or for getting back up to speed after being on leave.
The DRI for a project owns the project Slack channel.
Meetings and Scheduled Calls
We prefer to work asynchronously as far as possible but still use synchronous communication where it makes sense to do so.
Asynchronous communication is the best way to make sure everyone, regardless of timezone or availability, is included.
To keep people connected, team members are encouraged to schedule at least one coffee-chat with another team member each week.
Demo Calls
There is one demo call per week.
The time of the call differs each week to try to get people from different timezones to join different calls.
The purpose of this call is to showcase something that you have been working on during that week. It does not have to be perfect or polished.
These calls are a technical conversation and while we might land up with guidance on what we are working on, the purpose is not to make fixed decisions on this call.
Items should be added to the agenda ahead of the meeting.
If there are no agenda items at that time, the call is cancelled for that week.
The call should be recorded and added to GitLab Unfiltered. Please use your discretion when choosing the visibility level as some screen shares contain private data. If you upload a private video, please add information in the description for why this visibility was chosen.
A playlist of the recordings is available on GitLab Unfiltered.
Communicating our work schedule to others
It is important that team-members know when we are available, so we keep our calendars updated.
- We use the “working hours” settings in Google calendar.
- We indicate “async only” periods if we need them during our working hours.
- Our PTO tracker is linked to the group Google calendar using
gitlab.com_3puidsh74uhqdv9rkp3fj56af4@group.calendar.google.com
as an additional calendar in the Calendar Sync settings.
- Any team members with on-call responsibilities should share their Pager Duty calendar with their manager.
Impact
As a small team covering a wide domain, we need to make sure that
everything we do has sufficient impact. If we do something that only the
rest of the Scalability group knows about, we haven’t ‘shipped’ anything.
Our ‘users’ in this context are the infrastructure itself, SREs, and
Development Engineers.
Impact could take the form of changes like:
- Development practices that make it easier for our Development
department to ship code that works well at scale.
- Monitoring changes that mean we can detect and attribute problems sooner,
particularly focusing on utilisation.
- Code changes that reduce pressure on a part of our system that’s
feeling the strain.
- Guides for Developers and SREs to work with a given service.
Announcing Impactful Items
In order to make others aware of the work we have done, we should advertise changes in the following locations:
- Engineering Week-in-Review
- Slack Channels
- For Backend Engineers
#development
#eng-managers
#dev_tip_of_the_day
#development-guidelines
- For SRE’s
#infrastructure-lounge
#infra-staff
When collaborating on the announcement text, consider using a threaded discussion on the relevant epic, issue, or change request.
Documentation or tutorial videos should also be added to the README.md
in our team repository.
Project Management
We use epics and issues to manage our work. Our project management process describes how we work on our roadmaps, backlogs, and active projects.
Triage rotation
We have automated triage policies defined in the triage-ops project. These
perform tasks such as automatically labelling issues, asking the author to add labels, and creating weekly triage issues.
We rotate the triage ownership each month, with the current triage owner responsible for picking the next one (a
reminder is added to their last triage issue).
Triaging issues
When issues arrive on our backlog, we should consider how they align with our vision, mission, and current OKRs.
We also determine which of the teams would be the more appropriate owner for that task.
We need to effectively triage these issues so that they can be handled appropriately. This means:
- Critically assess the issue to understand the problem
- Determine if this impacts .com or Self-Managed instances.
- If this primarily affects Self-Managed instances, the issue can usually be redirected to the Application Performance group.
- If this is a scaling issue, assign it into our backlog using workflow labels and place it on the planning board if necessary.
- If this is not a scaling issue, find the most appropriate owner in either Infrastructure or Development, or any other department.
When handing over an issue to the new owner, provide as much information as you can from your assessment of the issue.
Engagement with Incidents
The Scalability team members often have specialized knowledge that is helpful in resolving incidents. Some team members are also SREs who are part of the on-call rota. We follow the guidelines below when contributing to incidents.
For an on-call SRE:
For an Incident Manager:
If you are not EOC or an Incident Manager when an incident occurs:
- For S1 incidents
- the priority is to get GitLab.com up and running and getting back to a stable state takes priority over project work
- when the system is stable, contribute to determining the root cause and writing up the corrective actions
- the IM or Reliability EM will delegate corrective actions
- work with the Scalability EM to prioritize any work that arises from an S1
- For all other incidents
- if you are called into an incident, the priority is to enable others to resolve the problem
- the expectation is to be hands-off, giving guidance where necessary, and returning to project work as soon as possible
The reason for this position is that our project work prevents future large S1 incidents from occurring.
If we try to participate in and resolve many incidents, our project work is delayed and the risk of future S1 incidents increases.
Engagement with the Infradev Process
The Infradev process aims to highlight SaaS
availability and reliability improvements with the Stage Groups.
Where issues marked as infradev
are found to be scaling problems, the team::Scalability
label should be added.
Our commitment to this process, in line with the team’s vision, is to provide guidance and assistance to the stage groups who are responsible for resolving
these issues. We proactively assist them to determine how to resolve a problem, and then we contribute to reviewing
the changes that they make.
Weekly Issues
Service::Unknown
refinement - go through issues marked Service::Unknown
and add a defined service, where possible.
- Review request processing - the goal is to move review request issues to
workflow-infra::In Progress
, either
through picking them up directly, or asking on our team channel if any one else is able.
Monthly Issues
- Infradev review - show issues with
team::Scalability
and infradev
labels so we can help the stage groups move those forward.
Quarterly Issues
Every quarter, we perform a review of all issues on the backlog that are not part of any project. When reviewing issues:
- if the issue is no longer relevant then it should be closed
- if it is relevant but we are unlikely to work on it soon it should remain open
The EM creates this issue each quarter. It is not the sole responsibility of the person on Triage Rotation and is shared among all team members.
Regarding Coding at Scale
Software development often happens on a single machine, with a single application version and almost no load.
This configuration is very different from what happens on GitLab.com and our customers’ installations.
The problem “at scale” comes from a different order of magnitude than the development and testing environments.
Things like the number of servers, the number of incoming requests, the number of rows on a table or
the number of application versions will make the difference between something that works on your computer and
something that works in production.
An extra challenge, almost unique to GitLab, is that we deploy from the main branch multiple times each day, but we have a monthly release cycle
and zero downtime updates is a requirement for both releases.
Overlooking the compatibility with multiple versions of the application running at the same time
can induce a production incident.
You can find more detailed information in the links below. If this is not enough, please reach out to the
delivery or
scalability team.
- Expand and Contract pattern
- Zero Downtime Updates
- Sidekiq Compatibility across Updates
- Avoiding downtime in migrations
- Uploads development documentation
Team History
The Scalability team became a reality during the fourth organizational iteration in the Infrastructure department on 2019-08-22, although it only became a reality once the first team member joined the team on 2019-11-29.
Even though it might not look like it at first glance, the Scalability team has its origin connected to the Delivery team. Namely, the first two backend engineers with Infrastructure specialisation were a part of the Delivery team, a specialisation that previously did not fit into the organizational structure. They had a focus on reliability improvements for GitLab.com, often working on features that had many scaling considerations. A milestone, that will prove to be a case for the Scalability team, was Continuous Delivery on GitLab.com.
Throughout July, August and September 2019, GitLab.com experienced a higher than normal amount of customer facing events. Mirroring delays, slowdowns, vertical node scaling issues (to name a few) all contributed to general need to improve stability. This placed higher expectations on the Infrastructure department and with the organization at the time, this was harder to meet. To accelerate the timelines, “infradev” and “rapid action” processes were created, as a connection point between Infrastructure and Development departments to help Product prioritise higher impact issues. This approach was starting to yield results, but the process was there as a reaction to an (ongoing) event with the focus on resolving that specific need.
The background processing architectural proposal clearly illustrated the need to stay ahead of the growing needs of the platform and approach the growth strategically as well as tactically. With a clear case and approvals in hand, the team mission, vision, and goals were set and the team buildout could commence. While that was in motion, we had another confirmation through a performance retrospective that the need for the team is real.
As the team was taking shape, the background processing architectural changes were the first changes delivered by the team with a large impact on GitLab.com, with many more incremental changes throughout 2020 that followed. Measuring that impact reliably, and predicting the future challenges remains one of the team focuses at the time of the writing of this history summary.
The team impact overview is logged in issues:
- Year overview for 2020
- Year overview for 2021
- Year overview for 2022
- Year overview for 2023
Project Management
The majority of our project management process is described at the Platforms level and is shared between all SaaS Platform teams.
Please read this first.
This page describes the additions to the process described on the Platforms page.
The single source of truth for all work is Scaling GitLab SaaS Platforms epic.
We often refer to this as our top-level epic.
Epics that are added as children to the top-level epic are used to describe projects that the team undertakes.
Observability encompasses the technical elements responsible for metrics, logging, and tracing, along with the tools and processes that leverage these components.
Mission
Our mission is to deliver and maintain a world-class observability offering and frictionless operational experience for team members at GitLab.
Common Links
Team Members
The following people are members of the Scalability:Observability team:
Mission
We enable GitLab services to operate at production scale by providing paved roads for onboarding and maintaining features and services.
Common Links
Team Members
The following people are members of the Scalability:Practices team: