Application Performance Group - 2020 Impact

Application Performance Group - 2020 Impact

The year 2020 marked the first full year for the Application Performance group. The origins of the Application Performance group began in early 2019, with the most recent team members joining in October of 2019. This summary is intended to highlight the impact and the work of the Application Performance group in 2020. The sections below will give a brief overview of the project, describe the impact and link to issues and epics, mostly within the GitLab.org project. The efforts listed are in approximate chronological order. This is also not to be considered a comprehensive list. There are some isolated issues as well as confidential initiatives that are not listed below.

Focused Sidekiq improvements

As part of a 10x initiative to Iteratively re-architect our queue implementation to create 10x headroom the Application Performance group focused on Sidekiq improvements to improve observability and implement configurability. We wrapped up our efforts on this work at the beginning of 2020.

Impact - Observability and Sidekiq Memory Killer

We implemented tooling to improve observability with Sidekiq. This improvement, in coordination with the Infrastructure’s team to simplify sidekiq worker pools, led to a saturation point improvement from approximately 600 jobs per second to 7000 jobs per second. We also fixed longstanding issues with the Sidekiq memory killer. Additionally, an epic was created to identify future improvements to Improve reliability, observability, performance of background jobs. The scalability team was able to build upon our research and prototyping to implement a solution to move sidekiq-cluster to Core making it available to everyone.

Migrated to Puma from Unicorn

An extensive account of our process of moving from unicorn to puma on our web servers can be read in our blog post How we migrated application servers from Unicorn to Puma

Impact - Reduced memory usage

Once we deployed Puma to our entire web fleet we observed a drop in memory usage from 1.28T to approximately 800GB (approximately a 37% drop) while our request queuing, request duration and CPU usage all remained roughly the same.

Project Import and Export improvements

This initially started as a rapid action due to failing imports and exports on large projects. For the rapid action we were able to assist in troubleshooting the large projects and identified some areas of improvement going forward. Subsequently, as part of the 10x Initiative we identified some short term project import/export improvements.

Impact - We fixed large imports, reduced execution time, fewer SQL calls

Not only were we able to fix broken imports for large projects, we were able to improve (numbers approximate) by

  • 35-65% faster execution time (depending on the size of the data set)
  • 60% fewer SQL calls
  • 23% less garbage collection
  • Bulk inserts
  • Streaming serialization

Introducing NDJSON also allowed for GitLab to process imports and exports with approximately constant memory use regardless of the project size. Prior to this, imports and exports would increase in memory usage as the project size increased. More details about the impact of NDJSON processing can be found in the Introduce .ndjson as a way to process import epic.

Implement Shared Runner Minute Factors

In this MVC we implemented the ability to track accumulated compute minutes on shared runners. This helped to build some of the framework for a proposed feature in a future release.

Impact - Ability to track and configure limits

The MVC we built allows for configuring limits and cost factors of shared and public runners. This enabled additional shared runner types such as Windows, as well as customers to purchase additional shared runner minutes if desired.

Improving the PipelineProcessService

We identified that the PipelineProcessService is executed multiple times resulting in duplicate jobs and SQL queries being called, among other exessive resource usage. In this issue we identifed and implemented steps to improve the PipelineProcessService

Impact - Reduction in duration and resources

The improvements to our PipelineProcessService not only reduced the duration of the pipeline execution, thereby increasing the feedback loop, it also decreased the CPU load. A comprehensive overview of the positive impact of these changes can be found in the rollout issue. Included in this overview:

  • Average duration per pipeline_id for PipelineProcessWorker
  • Average duration percentile
  • Max duration percentile
  • DB duration
  • CPUs for Pipeline Workers

Improving non-performant API Endpoints

Testing identified API endpoints that were failing due to high memory usage or were too CPU intensive. Due to the high priority and severity of these issues the Application Performance Group focused on the initial improvements to the BlameController and BlobControllers to reduce the severity of these endpoints to acceptable levels and reassign them to feature teams.

Impact - Greatly improved response times

Telemetry: Gathering operational metrics and topology data

This effort was dedicated to focus on the memory usage of our self-managed instances. Prior to this implementation we only had anecdotal information on how and where memory was allocated for our customers in the field. The issues and merge requests within this epic allowed us to collect and anlyze memory usage of our self-managed installations.

Impact - Tracking of self managed resource usage

This allowed the Application Performance group to validate goals for their North Star Metric. We are now also able to measure multiple metrics, for self-managed instances reporting telemetry, on our Enablement::Memory Tableau Dashboard.

Dynamic Image Scaling

Prior to this improvement all images displayed by the GitLab application were delivered in the same size from which they were uploaded, leaving it to the browser rendering engine to scale them down as necessary. This meant serving megabytes of image data in a single page load, just so the frontend would throw most of it away. We determined that the first iteration would focus on avatars since they made up the vast majority of the use cases. The Application Performance Group implemented a solution to dynamically resize avatars

Impact - Greatly reduced image transfer sizes

We shrunk image transfers by 93%. The details can be read here in our blog post Scaling down: How we shrank image transfers by 93%.

Provide the means to handle N+1 CACHE SQL calls

In %13.1 the Application Performance group added instrumentation to log various SQL requests from sidekiq. Based on the data we collected we set about detecting potential N+1 CACHED SQL calls, documented why these queries are considered bad from a memory perspective, reduced several cached calls and provided guidelines to enable developers to continue these efforts going forward.

Impact - Reduction in transactions per second

We saw a 10% reduction in transactions per second (TPS) across the fleet, as measured by Postgres. More details and links to the supporting Thanos query (internally accessible) can be found in this issue.

Working Groups

Members of the Application Performance group have been involved in working groups to provide guidance and consultation on performance and memory consumption. The working groups we have participated in are listed below: