Automated Query Analysis

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
proposed mattkasa jon_jenkins DylanGriffith rogerwoo alexives devops data stores 2023-02-08

Problem Summary

Our overarching goal is to improve the reliability and throughput of the GitLab database review process. The current process requires merge request authors to manually provide query plans and raw SQL when introducing new queries or updating existing queries. This is both time consuming and error prone.

We believe we can improve operational efficiency by automatically identifying and analyzing newly introduced SQL queries. This will reduce the risk of human error, leading to improved system stability and an overall reduction in performance regressions.

Our key success metric is a reduction in the number of manual actions required by both code contributors and database reviewers, while maintaining a consistent standard for database related code contributions.

Goals

  1. Replace the current process of the author manually obtaining SQL and query plans with an automated process.
  2. Decrease the incidence of performance regressions when poorly performing queries are missed by a manual process.
  3. Increase contributor and reviewer efficiency by automating the query testing portion of database review.

Challenges

  • Capturing the number of SQL queries generated by an application the size of gitlab-org/gitlab without causing an increase in CI time and/or resources may present a challenge.
  • Storing the number of SQL queries generated by an application the size of gitlab-org/gitlab may consume large amounts of database storage.

Opportunity

  • Automated test suites already generate a large number of SQL queries, for instance rspec test suites, that can be captured and used to perform automated analysis.
  • We already utilize postgres.ai to analyze query performance, and it has an API that will allow us to automate the creation of database clones with realistic production-like data in order to perform automated analysis.
  • For customers who do not use something like postgres.ai, but who are connecting to a test database in CI, we would use this connection to generate query plans. The accuracy of these query plans will be affected by how realistic the test data is, and can be improved by seeding the test database with production-like data.
  • By storing queries and their query plans, we can tokenize the query plan into plan components, assign a cost and weight, then match those against a machine learning model. We can build this model by generating query plans for queries in our slow query logs, and assign actual cost and weight to their plan components. This will allow us to leverage our corpus of queries and slow query logs to predict the performance of arbitrary query text for other applications and our customers.

Proposal

We plan to automate the process of identifying new and changed database queries, so that contributors and reviewers can more accurately and efficiently assess the database performance impact of a code change.

We will capture queries generated as a side effect of running tests in CI, normalize them, deduplicate them, analyze them using one or more analyzers, and store them with their analyses and other metadata for future retrieval and comparison.

We will post a comment to the originating merge request, containing a summary of the new and changed queries, with links to their analyses, and highlighting any queries that exceed established timing or other performance guidelines.

Design and implementation details

Iteration 1

In the first iteration we will focus on how we capture queries, including normalization, deduplication, and storage. We must consider the performance and resource impacts on CI pipelines during capture, and include things like partitioning and time decay for the information we are storing.

Capturing queries

We will strive to limit the time and resource impacts on our CI pipelines as much as possible. These are some of the options we will consider for capturing queries:

  • Instrumenting ActiveRecord in ruby
    • Challenges:
      • Only applies to ruby projects so it would not be applicable to projects like container-registry.
      • Has a non-zero impact on time and resources in CI pipelines (these impacts can be observed in !111638)
    • Opportunities:
      • Simple and straightforward to implement.
      • Allows access to more information (eg. stacktrace and calling locations).
  • Connection proxy with logging
    • Challenges:
      • Adds complexity and possible performance overhead.
      • Requires maintaining the code for the proxy.
    • Opportunities:
      • Allows us to customize the capture.
      • Allows us to perform normalization/deduplication at capture time.
  • Built-in logging in postgresql
    • Challenges:
      • Require adding a configuration to enable logging.
      • May be difficult to obtain the resulting logs.
    • Opportunities:
      • Doesn’t require maintaining any code.
      • Light weight in terms of performance impact.
  • Capture from pg_stat_statements
    • Challenges:
      • Requires creating the extension in the test database.
      • Requires adding a configuration to set pg_stat_statements.max to a value high enough to capture all queries.
      • Consumes shared memory proportional to pg_stat_statements.max.
    • Opportunities:
      • Requires minimal code.
      • Simple to obtain the data.
      • Data is already normalized.

We have already built a proof of concept for instrumenting ActiveRecord in ruby in !111638, so as a first step we will benchmark the other capture methods against it and select the best option.

Storing queries

For the next step of the first iteration we will use the proof of concept in !111638 as well as any data gathered from testing other capture methods to estimate the number of rows per project, and use the pipeline execution statistics for gitlab-org/gitlab to estimate throughput. These estimates will allow us to evaluate storage mechanisms that are suitable for our purpose.

Some of the storage mechanisms we plan to evaluate are:

  • In the ci database in the GitLab database instance
    • Challenges:
      • Places additional strain on this resource for GitLab.com.
    • Opportunities:
      • Allows us to utilize existing authentication and access control in the form of CI_JOB_TOKEN.
      • Allows us to leverage associations with ci_builds and ci_pipelines.
      • Simplifies deployment for self-managed.
  • In a new decomposed database in the GitLab database instance
    • Challenges:
      • Adds to required development and testing effort.
      • Adds to deployment effort for GitLab.com.
    • Opportunities:
      • Isolates database performance impacts from the existing main and ci database instances.
  • In a new external service
    • Challenges:
      • Adds to required development and testing effort.
      • Adds to deployment effort for GitLab.com and for self-managed.
    • Opportunities:
      • Isolates performance impacts from gitlab-org/gitlab.
      • Allows us to iterate faster without impacting the main application.
  • In ClickHouse
    • Challenges:
      • Not yet available for self-managed.
    • Opportunities:
      • Isolates database performance impacts from the existing main and ci database instances.

An example database schema for storing queries:

CREATE TABLE queries (
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL,
    id bigint NOT NULL,
    project_id bigint NOT NULL,
    analysis_id bigint,
    hash text,
    sql text
);
CREATE TABLE pipeline_queries (
    id bigint NOT NULL,
    project_id bigint NOT NULL,
    pipeline_id bigint NOT NULL,
    query_id bigint NOT NULL
);
CREATE TABLE analyses (
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL,
    id bigint NOT NULL,
    project_id bigint NOT NULL,
    query_id bigint NOT NULL,
    buffers int,
    walltime int,
    explain text,
    analysis_url text
);

One possible method of partitioning a schema like the above example would be to utilize sub-partitioning. If we partition by project_id then by some interval of updated_at, and touch the row when we see a query, we can store only queries that the codebase is still executing, and prune partitions that only contain queries the code is no longer generating.

Iteration 2

In the second iteration we plan to identify new and changed queries, and post MR comments containing a summary. We will begin soliciting feedback on the accuracy and utility of the information, and improve or filter it to maximize it’s usefulness.

Iteration 3+

In the third and following iterations we plan to automate query analysis using one or more analyzers, store these analyses, and add them to the MR comments. We also intend to re-evaluate our use of the database to store query information, and the API to retrieve it, and potentially move this to an external service.