Scalability Group

Workflow Team workflow
GitLab.com @gitlab-org/scalability
Issue Trackers Scalability
Team Slack Channels #g_scalability - Company facing channel
#g_scalability-observability - Team channel
#g_scalability-practices - Team channel
#scalability_social - Group social channel
Information Slack Channels #infrastructure-lounge (Infrastructure Group Channel),
#incident-management (Incident Management),
#alerts-general (SLO alerting),
#mech_symp_alerts (Mechanical Sympathy Alerts)

Group Level

  1. Scalability Epic Board
  2. Scalability Issue Board
  3. Scalability Issues not in an Epic
  4. Scalability Issues by Team
  5. Scalability Issues by Team Member

Teams

The Scalability group is currently formed of two teams:

Name Role
Rachel NienaberRachel Nienaber Senior Engineering Manager, Scalability:Projections
Chance FeickChance Feick Senior Backend Engineer
Devin SylvaDevin Sylva Senior Site Reliability Engineer
JP TotoJP Toto Manager, Engineering
Kam KyralaKam Kyrala Engineering Manager, Production Engineering
Liam McAndrewLiam McAndrew Engineering Manager, Scalability:Frameworks
Paul John PhillipsPaul John Phillips Backend Engineering Manager, Cloud Connector
Steve AbramsSteve Abrams Engineering Manager, Saas Platforms:Foundations

Scalability:Observability

The Observability team focuses on observability, forecasting & projection systems that enable development engineering to predict system growth for their areas of responsibility.

The following people are members of the Scalability:Observability team:

Name Role
Liam McAndrewLiam McAndrew Engineering Manager, Scalability:Frameworks
Andreas BrandlAndreas Brandl Staff Backend Engineer, Scalability
Bob Van LanduytBob Van Landuyt Staff Backend Engineer, Scalability
Hercules Lemke MerscherHercules Lemke Merscher Backend Engineer
Calliope GardnerCalliope Gardner Site Reliability Engineer
Nick DuffNick Duff Senior Site Reliability Engineer
Stephanie JacksonStephanie Jackson Staff Site Reliability Engineer, Scalability
Taliesin MillhouseTaliesin Millhouse Site Reliability Engineer
Tony GangaTony Ganga Site Reliability Engineer

Scalability:Practices

The Practices team focuses on tools and frameworks that enable the stage groups to support their features on our production systems.

The following people are members of the Scalability:Practices team:

Name Role

Mission

The Scalability group is responsible for GitLab at scale, working on the highest priority scaling items related to our SaaS platforms. We support other Engineering teams by sharing data and techniques so they can become better at scalability as well.

Vision

As its name implies, the Scalability group enhances the availability, reliability and, performance of GitLab’s SaaS platforms by observing the application’s capabilities to operate at scale.

The Scalability group analyzes application performance on GitLab’s SaaS platforms, recognizes bottlenecks in service availability, proposes (and develops) short term improvements and develops long term plans that help drive the decisions of other Engineering teams.

Short term goals for the group include:

  • Refine existing, define new, and document Service Level Objectives for each of GitLab’s services.
  • Continuously expose the top 3 critical bottlenecks that threaten the stability of our SaaS platforms.
  • Work on scoping, planning and defining the implementation steps of the top critical bottleneck.
  • Define and track team KPI’s to track impact on our SaaS platforms.
  • Work on implementing user facing application features (such as API improvements) as a means to reduce pressure on our SaaS platforms generated by regular user interactions.

Direction for FY24

We’ve moved the direction to the direction section here so that it’s in the same place as the rest of our product direction.

Indicators

The Infrastructure Department is concerned with the availability and performance of GitLab’s SaaS platforms.

GitLab.com’s service level availability is visible on the SLA Dashboard, and we use the General GitLab Dashboard in Grafana to observe the service level indicators (SLIs) of apdex, error ratios, requests per second, and saturation of the services.

These dashboards show lagging indicators for how the services have responded to the demand generated by the application.

Each team is responsible for separate indicators. For more information, please view the team pages linked above.

Themes

The broad nature of work undertaken by the Scalability group can make prioritization challenging as it’s tricky to compare some issues like-for-like. For example, how do we compare the benefit of an issue to address a performance concern against an issue that reduces developer toil? To help guide the direction of the group and to inform our prioritization process, we can categorize issues in to the following themes, in order of priority:

  1. Critical Saturation Response. On occasions saturation alerts can unexpectedly occur - for example, when caused by a sudden change in platform usage patterns - and need to be addressed with urgency. We try to avoid working reactively by proactively working on other themes.
  2. Horizontal Scalability. The most obvious scaling bottlenecks in our infrastructure are those that can only be scaled vertically instead of horizontally. Horizontal scaling brings the benefit of elasticity, which increases confidence that we can meet future demand while keeping costs linear - both of these elements are strongly aligned with the vision of the Scalability group.
  3. Increasing Plaform Capacity. Delivering foundational project work in the GitLab application and infrastructure to support service capacity needs for GitLab SaaS.
  4. Scalability Advocacy and Facilitation. An effective method for the Scalability group to leverage its output is by collaborating closely with other engineering teams to promote scalability best practises. This might include building tools to enable wider engagement in GitLab SaaS operations (e.g. Stage Dashboards), or serving as a point of contact to other teams for scaling questions relating to their own initiatives.
  5. Eliminating Toil. We want to make our output as efficient as possible by spending more time on engineering projects and less time on manual, repetitive, or automatable tasks. An effective way of achieving this is by considering how future toil can be avoid when delivering projects. However, inline with our Iteration value, we don’t want to over-optimize and we can’t consider all eventualities ahead of time. We should always be mindful of opportunities to reduce toil, which will make us more effective in the long-term.

The above list is not comprehensive, nor does it outline a formal process. We should remain pragmatic when prioritizing work, while using the themes as a guideline.

Job Families

The Scalability Group consists of a Senior Engineering Manager, Engineering Managers, Backend Engineers, and Site Reliability Engineers.

The Engineering Roles section of the handbook lists the responsbilies of these roles:

Working with us

Emergency Escalation during S1/S2 incidents

Scalability leadership can be reached via PagerDuty Scalability Escalation.

From https://gitlab.pagerduty.com/incidents, click on the “New Incident” button and complete the new incident form as shown below.

Scalability PD Incident

How do I engage with the Scalability Group?

  1. Start with an issue in the Scalability tracker: Create an issue.
  2. You are welcome to follow this up with a Slack message in #g_scalability.
  3. Please don’t add any workflow labels to the issue. The team will triage the issue and apply these.

Alternatively, mention us in the issue where you’d like our input.

When issues are sent out way, we will do our best to help or find a suitable owner to move the issue forward. We may be a development team’s first contact into the Infrastructure department and we endeavour to treat these requests with care so that we can help to find an effective resolution for the issue.

Scalability review requests

If you’re working on a feature that has specific scaling requirements, you can create an issue with the review request template. Some examples are:

  1. Review Request - Impact on database load for enabling advanced global search
  2. Review Request - Assumptions about build prerequisite-related application limits
  3. Review Request - Throttling for Cleanup Policies Scaling Request

This template gives the Scalability group the information we need to help you, and the issue will be shown on our build board with a high priority.

How does the Scalability Group engage with Stage Groups?

When we observe a situation on GitLab.com that needs to be addressed alongside a stage group, we first raise an issue in the Scalability issue tracker that describes what we are seeing. We try to determine if the problem lies with the action the code is performing, or the way in which it is running on GitLab.com. For example, with queues and workers, we will see if the problem is in what the queue does, or how the worker should run.

If we find that the problem is in what the code is doing, then we engage with the EM/PM of that group to find the right path forward. If work is required from that group, we will create a new issue in the gitlab-org project and use the Availability and Performance Refinement process to highlight this issue.

How we work

Handbook First

In line with the broader GitLab culture, we adopt a Handbook First approach to documenting our team’s workflow. Should you have any proposals aimed at enhancing our processes, please initiate a Merge Request (MR) to update the handbook. Assign the MR to @rnienaber for the Group level change and Scalability EMs for the respective team changes and tag the team in a comment to solicit feedback. If there are no objections within three working days of tagging the team, the MR will be deemed ready for merging. We adhere to the principle of making two-way door decisions meaning additional MRs can be created to suggest changes or removals of processes that are deemed inefficient.

Communication

Everything is written in epics, issues, runbooks or the handbook. Decisions or important information in Slack must be copied into a relevant location so that the information is persisted beyond the 90-day Slack retention policy.

Slack Channels and Guidelines

We communicate in public using the following channels:

  1. #g_scalability - Company facing channel where other team members can reach out to us. We also use this channel for highlighting work we have done.
  2. #g_scalability-observability - Team channel where daily work information is shared and team coordination takes place.
  3. #g_scalability-practices - Team channel where daily work information is shared and team coordination takes place.
  4. #scalability-social - Group social channel.

In the team channels, team members are encouraged to share what they are working and any blockers they may have. The format is not fixed so that people share in a way that feels natural to them.

In addition to the channels above, we create a project channel per epic when appropriate. A channel dedicated to a project helps to keep everything about the topic in one place and makes it easy to stay up to date on that topic. It is useful for teams working across time zones or for getting back up to speed after being on leave. The DRI for a project owns the project Slack channel.

Meetings and Scheduled Calls

We prefer to work asynchronously as far as possible but still use synchronous communication where it makes sense to do so. Asynchronous communication is the best way to make sure everyone, regardless of timezone or availability, is included.

To keep people connected, team members are encouraged to schedule at least one coffee-chat with another team member each week.

Demo Calls

There is one demo call per week. The time of the call differs each week to try to get people from different timezones to join different calls. The purpose of this call is to showcase something that you have been working on during that week. It does not have to be perfect or polished. These calls are a technical conversation and while we might land up with guidance on what we are working on, the purpose is not to make fixed decisions on this call.

Items should be added to the agenda ahead of the meeting. If there are no agenda items at that time, the call is cancelled for that week.

The call should be recorded and added to GitLab Unfiltered. Please use your discretion when choosing the visibility level as some screen shares contain private data. If you upload a private video, please add information in the description for why this visibility was chosen. A playlist of the recordings is available on GitLab Unfiltered.

Communicating our work schedule to others

It is important that team-members know when we are available, so we keep our calendars updated.

  1. We use the “working hours” settings in Google calendar.
  2. We indicate “async only” periods if we need them during our working hours.
  3. Our PTO tracker is linked to the group Google calendar using gitlab.com_3puidsh74uhqdv9rkp3fj56af4@group.calendar.google.com as an additional calendar in the Calendar Sync settings.
  4. Any team members with on-call responsibilities should share their Pager Duty calendar with their manager.

Impact

As a small team covering a wide domain, we need to make sure that everything we do has sufficient impact. If we do something that only the rest of the Scalability group knows about, we haven’t ‘shipped’ anything. Our ‘users’ in this context are the infrastructure itself, SREs, and Development Engineers.

Impact could take the form of changes like:

  1. Development practices that make it easier for our Development department to ship code that works well at scale.
  2. Monitoring changes that mean we can detect and attribute problems sooner, particularly focusing on utilisation.
  3. Code changes that reduce pressure on a part of our system that’s feeling the strain.
  4. Guides for Developers and SREs to work with a given service.

Announcing Impactful Items

In order to make others aware of the work we have done, we should advertise changes in the following locations:

  1. Engineering Week-in-Review
  2. Slack Channels
    1. For Backend Engineers
      1. #development
      2. #eng-managers
      3. #dev_tip_of_the_day
      4. #development-guidelines
    2. For SRE’s
      1. #infrastructure-lounge
      2. #infra-staff

When collaborating on the announcement text, consider using a threaded discussion on the relevant epic, issue, or change request.

Documentation or tutorial videos should also be added to the README.md in our team repository.

Project Management

We use epics and issues to manage our work. Our project management process describes how we work on our roadmaps, backlogs, and active projects.

Triage rotation

We have automated triage policies defined in the triage-ops project. These perform tasks such as automatically labelling issues, asking the author to add labels, and creating weekly triage issues.

We rotate the triage ownership each month, with the current triage owner responsible for picking the next one (a reminder is added to their last triage issue).

Triaging issues

When issues arrive on our backlog, we should consider how they align with our vision, mission, and current OKRs.

We also determine which of the teams would be the more appropriate owner for that task.

We need to effectively triage these issues so that they can be handled appropriately. This means:

  1. Critically assess the issue to understand the problem
  2. Determine if this impacts .com or Self-Managed instances.
    1. If this primarily affects Self-Managed instances, the issue can usually be redirected to the Application Performance group.
  3. If this is a scaling issue, assign it into our backlog using workflow labels and place it on the planning board if necessary.
  4. If this is not a scaling issue, find the most appropriate owner in either Infrastructure or Development, or any other department.

When handing over an issue to the new owner, provide as much information as you can from your assessment of the issue.

Engagement with Incidents

The Scalability team members often have specialized knowledge that is helpful in resolving incidents. Some team members are also SREs who are part of the on-call rota. We follow the guidelines below when contributing to incidents.

For an on-call SRE:

For an Incident Manager:

If you are not EOC or an Incident Manager when an incident occurs:

  • For S1 incidents
    • the priority is to get GitLab.com up and running and getting back to a stable state takes priority over project work
    • when the system is stable, contribute to determining the root cause and writing up the corrective actions
    • the IM or Reliability EM will delegate corrective actions
    • work with the Scalability EM to prioritize any work that arises from an S1
  • For all other incidents
    • if you are called into an incident, the priority is to enable others to resolve the problem
    • the expectation is to be hands-off, giving guidance where necessary, and returning to project work as soon as possible

The reason for this position is that our project work prevents future large S1 incidents from occurring. If we try to participate in and resolve many incidents, our project work is delayed and the risk of future S1 incidents increases.

Engagement with the Infradev Process

The Infradev process aims to highlight SaaS availability and reliability improvements with the Stage Groups.

Where issues marked as infradev are found to be scaling problems, the team::Scalability label should be added.

Our commitment to this process, in line with the team’s vision, is to provide guidance and assistance to the stage groups who are responsible for resolving these issues. We proactively assist them to determine how to resolve a problem, and then we contribute to reviewing the changes that they make.

Weekly Issues

  1. Service::Unknown refinement - go through issues marked Service::Unknown and add a defined service, where possible.
  2. Review request processing - the goal is to move review request issues to workflow-infra::In Progress, either through picking them up directly, or asking on our team channel if any one else is able.

Monthly Issues

  1. Infradev review - show issues with team::Scalability and infradev labels so we can help the stage groups move those forward.

Quarterly Issues

Every quarter, we perform a review of all issues on the backlog that are not part of any project. When reviewing issues:

  • if the issue is no longer relevant then it should be closed
  • if it is relevant but we are unlikely to work on it soon it should remain open

The EM creates this issue each quarter. It is not the sole responsibility of the person on Triage Rotation and is shared among all team members.

Regarding Coding at Scale

Software development often happens on a single machine, with a single application version and almost no load.

This configuration is very different from what happens on GitLab.com and our customers’ installations.

The problem “at scale” comes from a different order of magnitude than the development and testing environments. Things like the number of servers, the number of incoming requests, the number of rows on a table or the number of application versions will make the difference between something that works on your computer and something that works in production.

An extra challenge, almost unique to GitLab, is that we deploy from the main branch multiple times each day, but we have a monthly release cycle and zero downtime updates is a requirement for both releases.

Overlooking the compatibility with multiple versions of the application running at the same time can induce a production incident.

You can find more detailed information in the links below. If this is not enough, please reach out to the delivery or scalability team.

  1. Expand and Contract pattern
  2. Zero Downtime Updates
  3. Sidekiq Compatibility across Updates
  4. Avoiding downtime in migrations
  5. Uploads development documentation

Team History

The Scalability team became a reality during the fourth organizational iteration in the Infrastructure department on 2019-08-22, although it only became a reality once the first team member joined the team on 2019-11-29.

Even though it might not look like it at first glance, the Scalability team has its origin connected to the Delivery team. Namely, the first two backend engineers with Infrastructure specialisation were a part of the Delivery team, a specialisation that previously did not fit into the organizational structure. They had a focus on reliability improvements for GitLab.com, often working on features that had many scaling considerations. A milestone, that will prove to be a case for the Scalability team, was Continuous Delivery on GitLab.com.

Throughout July, August and September 2019, GitLab.com experienced a higher than normal amount of customer facing events. Mirroring delays, slowdowns, vertical node scaling issues (to name a few) all contributed to general need to improve stability. This placed higher expectations on the Infrastructure department and with the organization at the time, this was harder to meet. To accelerate the timelines, “infradev” and “rapid action” processes were created, as a connection point between Infrastructure and Development departments to help Product prioritise higher impact issues. This approach was starting to yield results, but the process was there as a reaction to an (ongoing) event with the focus on resolving that specific need.

The background processing architectural proposal clearly illustrated the need to stay ahead of the growing needs of the platform and approach the growth strategically as well as tactically. With a clear case and approvals in hand, the team mission, vision, and goals were set and the team buildout could commence. While that was in motion, we had another confirmation through a performance retrospective that the need for the team is real.

As the team was taking shape, the background processing architectural changes were the first changes delivered by the team with a large impact on GitLab.com, with many more incremental changes throughout 2020 that followed. Measuring that impact reliably, and predicting the future challenges remains one of the team focuses at the time of the writing of this history summary.

The team impact overview is logged in issues:

  1. Year overview for 2020
  2. Year overview for 2021
  3. Year overview for 2022
  4. Year overview for 2023

Observability Team

Observability encompasses the technical elements responsible for metrics, logging, and tracing, along with the tools and processes that leverage these components.

Mission

Our mission is to deliver and maintain a world-class observability offering and frictionless operational experience for team members at GitLab.

Workflow Team workflow
GitLab.com @gitlab-org/scalability/observability
Issue Trackers Scalability
Tamland
Team Slack Channels #g_observability - Team channel
#infrastructure_platforms_social - Social channel
Project Slack Channels #observability-tamland Tamland development
Information Slack Channels #infrastructure-lounge (Infrastructure Group Channel),
#incident-management (Incident Management),
#alerts-general (SLO alerting),
#mech_symp_alerts (Mechanical Sympathy Alerts)

Team Members

The following people are members of the Observability team:

Scalability Group Project Management

Project Management

The majority of our project management process is described at the Platforms level and is shared between all SaaS Platform teams. Please read this first.

This page describes the additions to the process described on the Platforms page.

The single source of truth for all work is Scaling GitLab SaaS Platforms epic. We often refer to this as our top-level epic.

Epics that are added as children to the top-level epic are used to describe projects that the team undertakes.

Scalability:Practices Team

Mission

We enable GitLab services to operate at production scale by providing paved roads for onboarding and maintaining features and services.

Workflow Team workflow
GitLab.com @gitlab-org/scalability/practices
Issue Trackers Scalability
Team Slack Channels #g_scalability-practices - Team channel
#scalability_social - Group social channel
Information Slack Channels #infrastructure-lounge (Infrastructure Group Channel),
#incident-management (Incident Management),
#alerts-general (SLO alerting)

Team Members

The following people are members of the Scalability:Practices team:

Last modified January 7, 2025: Move eng images to static folder (be4d32f4)