Automated Query Analysis
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed |
mattkasa
jon_jenkins
|
DylanGriffith
|
rogerwoo
alexives
|
devops data stores | 2023-02-08 |
Problem Summary
Our overarching goal is to improve the reliability and throughput of the GitLab database review process. The current process requires merge request authors to manually provide query plans and raw SQL when introducing new queries or updating existing queries. This is both time consuming and error prone.
We believe we can improve operational efficiency by automatically identifying and analyzing newly introduced SQL queries. This will reduce the risk of human error, leading to improved system stability and an overall reduction in performance regressions.
Our key success metric is a reduction in the number of manual actions required by both code contributors and database reviewers, while maintaining a consistent standard for database related code contributions.
Goals
- Replace the current process of the author manually obtaining SQL and query plans with an automated process.
- Decrease the incidence of performance regressions when poorly performing queries are missed by a manual process.
- Increase contributor and reviewer efficiency by automating the query testing portion of database review.
Challenges
- Capturing the number of SQL queries generated by an application the size of
gitlab-org/gitlab
without causing an increase in CI time and/or resources may present a challenge. - Storing the number of SQL queries generated by an application the size of
gitlab-org/gitlab
may consume large amounts of database storage.
Opportunity
- Automated test suites already generate a large number of SQL queries, for
instance
rspec
test suites, that can be captured and used to perform automated analysis. - We already utilize
postgres.ai
to analyze query performance, and it has an API that will allow us to automate the creation of database clones with realistic production-like data in order to perform automated analysis. - For customers who do not use something like
postgres.ai
, but who are connecting to a test database in CI, we would use this connection to generate query plans. The accuracy of these query plans will be affected by how realistic the test data is, and can be improved by seeding the test database with production-like data. - By storing queries and their query plans, we can tokenize the query plan into plan components, assign a cost and weight, then match those against a machine learning model. We can build this model by generating query plans for queries in our slow query logs, and assign actual cost and weight to their plan components. This will allow us to leverage our corpus of queries and slow query logs to predict the performance of arbitrary query text for other applications and our customers.
Proposal
We plan to automate the process of identifying new and changed database queries, so that contributors and reviewers can more accurately and efficiently assess the database performance impact of a code change.
We will capture queries generated as a side effect of running tests in CI, normalize them, deduplicate them, analyze them using one or more analyzers, and store them with their analyses and other metadata for future retrieval and comparison.
We will post a comment to the originating merge request, containing a summary of the new and changed queries, with links to their analyses, and highlighting any queries that exceed established timing or other performance guidelines.
Design and implementation details
Iteration 1
In the first iteration we will focus on how we capture queries, including normalization, deduplication, and storage. We must consider the performance and resource impacts on CI pipelines during capture, and include things like partitioning and time decay for the information we are storing.
Capturing queries
We will strive to limit the time and resource impacts on our CI pipelines as much as possible. These are some of the options we will consider for capturing queries:
- Instrumenting
ActiveRecord
inruby
- Challenges:
- Only applies to
ruby
projects so it would not be applicable to projects likecontainer-registry
. - Has a non-zero impact on time and resources in CI pipelines (these impacts can be observed in !111638)
- Only applies to
- Opportunities:
- Simple and straightforward to implement.
- Allows access to more information (eg. stacktrace and calling locations).
- Challenges:
- Connection proxy with logging
- Challenges:
- Adds complexity and possible performance overhead.
- Requires maintaining the code for the proxy.
- Opportunities:
- Allows us to customize the capture.
- Allows us to perform normalization/deduplication at capture time.
- Challenges:
- Built-in logging in
postgresql
- Challenges:
- Require adding a configuration to enable logging.
- May be difficult to obtain the resulting logs.
- Opportunities:
- Doesn’t require maintaining any code.
- Light weight in terms of performance impact.
- Challenges:
- Capture from
pg_stat_statements
- Challenges:
- Requires creating the extension in the test database.
- Requires adding a configuration to set
pg_stat_statements.max
to a value high enough to capture all queries. - Consumes shared memory proportional to
pg_stat_statements.max
.
- Opportunities:
- Requires minimal code.
- Simple to obtain the data.
- Data is already normalized.
- Challenges:
We have already built a proof of concept for instrumenting ActiveRecord
in
ruby
in
!111638, so as a
first step we will benchmark the other capture methods against it and select the
best option.
Storing queries
For the next step of the first iteration we will use the proof of concept in
!111638 as well
as any data gathered from testing other capture methods to estimate the number
of rows per project, and use the pipeline execution statistics for
gitlab-org/gitlab
to estimate throughput. These estimates will allow us to
evaluate storage mechanisms that are suitable for our purpose.
Some of the storage mechanisms we plan to evaluate are:
- In the
ci
database in the GitLab database instance- Challenges:
- Places additional strain on this resource for
GitLab.com
.
- Places additional strain on this resource for
- Opportunities:
- Allows us to utilize existing authentication and access control in the form of
CI_JOB_TOKEN
. - Allows us to leverage associations with
ci_builds
andci_pipelines
. - Simplifies deployment for self-managed.
- Allows us to utilize existing authentication and access control in the form of
- Challenges:
- In a new decomposed database in the GitLab database instance
- Challenges:
- Adds to required development and testing effort.
- Adds to deployment effort for
GitLab.com
.
- Opportunities:
- Isolates database performance impacts from the existing
main
andci
database instances.
- Isolates database performance impacts from the existing
- Challenges:
- In a new external service
- Challenges:
- Adds to required development and testing effort.
- Adds to deployment effort for
GitLab.com
and for self-managed.
- Opportunities:
- Isolates performance impacts from
gitlab-org/gitlab
. - Allows us to iterate faster without impacting the main application.
- Isolates performance impacts from
- Challenges:
- In ClickHouse
- Challenges:
- Not yet available for self-managed.
- Opportunities:
- Isolates database performance impacts from the existing
main
andci
database instances.
- Isolates database performance impacts from the existing
- Challenges:
An example database schema for storing queries:
CREATE TABLE queries (
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
id bigint NOT NULL,
project_id bigint NOT NULL,
analysis_id bigint,
hash text,
sql text
);
CREATE TABLE pipeline_queries (
id bigint NOT NULL,
project_id bigint NOT NULL,
pipeline_id bigint NOT NULL,
query_id bigint NOT NULL
);
CREATE TABLE analyses (
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
id bigint NOT NULL,
project_id bigint NOT NULL,
query_id bigint NOT NULL,
buffers int,
walltime int,
explain text,
analysis_url text
);
One possible method of partitioning a schema like the above example would be to
utilize
sub-partitioning.
If we partition by project_id
then by some interval of updated_at
, and touch
the row when we see a query, we can store only queries that the codebase is
still executing, and prune partitions that only contain queries the code is no
longer generating.
Iteration 2
In the second iteration we plan to identify new and changed queries, and post MR comments containing a summary. We will begin soliciting feedback on the accuracy and utility of the information, and improve or filter it to maximize it’s usefulness.
Iteration 3+
In the third and following iterations we plan to automate query analysis using one or more analyzers, store these analyses, and add them to the MR comments. We also intend to re-evaluate our use of the database to store query information, and the API to retrieve it, and potentially move this to an external service.
55741fb9
)