Code Review database size reduction for GitLab.com
| Status | Authors | Coach | DRIs | Owning Stage | Created |
|---|---|---|---|---|---|
| proposed |
zhaochen.li
|
francoisrose
phikai
patrickbajao
dskim_gitlab
|
devops create | 2026-04-15 |
Summary
The Code Review group owns several of the largest tables on GitLab.com, including
notes (3,054 GB), events (2,371 GB), merge_requests (787 GB), and
merge_request_diffs (389 GB). Together these tables consume over 6,600 GB of
primary database storage and continue to grow, particularly as AI tooling
accelerates record creation.
This document proposes approaches to reduce the on-disk size of these tables by approximately 50% through a combination of strategies: clearing cached HTML fields for stale records, converting column types to more compact representations, removing redundant columns and indexes, decomposing position data into structured tables, enforcing retention policies, and reclaiming table bloat. The strategies are prioritized by estimated savings, implementation effort, and risk.
Related: Blueprint for Code Review database size reduction (&20233), Code Review database size reduction (#17571), Initial spike for database size reduction blueprint (#586185).
Motivation
Large tables on GitLab.com are a major problem for both operations and development. As tables grow beyond hundreds of gigabytes, several problems compound:
- Query performance suffers. Larger tables increase index sizes, slow down sequential scans, and reduce buffer cache hit rates.
- Table maintenance becomes expensive.
VACUUM,ANALYZE, and index rebuilds take longer and hold locks that affect application availability. - Infrastructure costs increase. Storage, I/O, replication lag, and backup times all scale with on-disk size.
- Data migrations become complex. Schema changes on large tables require
significantly more effort to implement and are more likely to cause stability
problems on GitLab.com. For example, swapping bigint columns with integer
columns on
merge_requestshad to be split into 3 stages and took several months to complete (#507695). - Operational risk grows. Failovers and disaster recovery become slower and more fragile as data volume increases.
The Database Scalability blueprint (June 2021) established a target of keeping individual physical tables under 100 GB on GitLab.com. Nearly five years later, multiple Code Review tables on GitLab.com still exceed this threshold by 7x to 30x and have grown significantly since the original analysis. Without intervention, these tables will continue to grow as GitLab.com usage increases.
The tables over 100 GB on GitLab.com owned by or closely related to Code Review, as of January 2026, are:
| Table | Size |
|---|---|
merge_request_diff_commits |
7,875 GB |
merge_request_diff_files |
3,290 GB |
notes |
3,156 GB |
events |
2,371 GB |
merge_requests |
787 GB |
merge_request_diffs |
451 GB |
note_diff_files |
170 GB |
approval_merge_request_rules_users |
160 GB |
merge_request_metrics |
140 GB |
This document focuses on the remaining large tables after excluding the items
listed in Non-Goals below: notes, events, merge_requests,
and merge_request_diffs.
Goals
- Reduce the combined on-disk size of
notes,merge_requests,merge_request_diffs, andeventstables by approximately 50%. - Achieve the largest savings with the lowest-risk changes first (quick wins), then progress to larger structural changes.
- Maintain backward compatibility with existing application behavior with no user-facing feature regressions.
- Establish repeatable patterns (for example, HTML cache clearing, retention policies) that
other groups can adopt for their own large tables such as
issuesandwork_items. - Update application and model code when needed to support larger structural changes (for example, table decomposition, column type conversions).
- Deliver changes incrementally across multiple milestones, with each change independently valuable.
Non-Goals
This document does not cover the two largest Code Review tables,
merge_request_diff_commits and merge_request_diff_files. Those tables are
already being addressed by separate epics:
- Reduce the growth and size of the merge_request_diff_commits (&16385)
- Partition and reduce size of merge_request_diff_files (&11272)
See Out-of-scope opportunities below for other opportunities identified during the investigation that are not initially in scope.
Proposal
The table below summarizes all in-scope opportunities in the order we intend to pursue them, starting with the largest savings. We will iterate through these incrementally, prioritizing by effort-to-impact ratio so small-effort “quick wins” can be delivered in parallel with larger structural changes. A concrete priority order will be proposed in follow-up MRs. Detailed analysis for each opportunity follows in the per-table sections.
| Opportunity | Table | Effort | Savings |
|---|---|---|---|
Clear note_html for stale MRs |
notes |
Large | 1,000 GB |
| Decompose system notes | notes |
Large | 800 GB |
| Convert position columns to structured table | notes |
Large | 200 GB |
Clear description_html and title_html for stale MRs |
merge_requests |
Large | 150 GB |
Retention policy on merge_request_diffs |
merge_request_diffs, merge_request_diff_commits, merge_request_diff_files |
Large | TBD (expected large) |
Reclaim bloat (pg_repack) |
merge_requests |
Small | 123 GB |
Convert SHA columns to bytea |
merge_request_diffs |
Small | 78 GB |
| Drop redundant noteable index | notes |
Small | 63 GB |
Drop external_diff column and index |
merge_request_diffs |
Small | 52 GB |
Drop updated_at column |
events |
Small | 34 GB |
Drop/convert index_notes_on_line_code |
notes |
Small | 34 GB |
Remove merge_params for merged MRs |
merge_requests |
Small | 25 GB |
Convert index_notes_on_organization_id to partial |
notes |
Small | 19 GB |
Convert SHA columns to bytea |
merge_requests |
Small | 15 GB |
| Convert integer columns to smaller types | merge_request_diffs |
Small | 10 GB |
Convert merge_status to smallint |
merge_requests |
Small | 3.5 GB |
Drop assignee_id column and index |
merge_requests |
Small | ~2.7 GB |
| Total | ~2,604 GB |
Retention policy on merge_request_diffs. Discussed in
issue #594843 (comment).
We expect savings to be large because a retention policy on
merge_request_diffs would also reduce merge_request_diff_commits and
merge_request_diff_files, but this overlaps with the separate epics already
addressing those tables
(epic &16385 and
epic &11272) and needs
to be coordinated there. A savings estimate should be produced as part of
that coordination.
Out-of-scope opportunities
The following opportunities were identified during the investigation but are not in scope for this design document. Each is documented here for visibility and future follow-up:
| Opportunity | Table | Effort | Savings |
|---|---|---|---|
| 90-day retention policy | events |
Large | 1,800 GB |
Partition events table |
events |
Medium | 0 GB (enabler) |
| Merge namespace columns | events |
Large | 50 GB |
Drop st_diff column |
notes |
Medium | 20 GB |
Drop confidential column |
notes |
Small | ~0.1 GB |
These are out of scope for the following reasons:
eventstable changes (90-day retention, partitioning, and merging namespace columns). These approaches are not yet mature enough to commit to in this document. Code Review co-owns theeventstable because it contains MR-related data, but many of the features that rely on this data are owned by other groups, so changes here require cross-team collaboration and further POC work to validate feasibility, impact, and retention semantics. The retention policy proposal (#571288) is the current starting point for that discussion.- Drop
st_diffcolumn. Requires removingLegacyDiffNotehandling across the application, which has a broader scope than a column-level change and is better tracked as a separate deprecation. - Drop
confidentialcolumn. Savings are negligible (~0.1 GB) and do not justify prioritizing this over larger opportunities. Can be finalized opportunistically alongside othernoteschanges.
The following sections provide detailed analysis for each opportunity, organized by table. Each includes context from the initial spike investigation.
notes table (3,054 GB total: 2,246 GB columns + 808 GB indexes)
The notes table is the third largest table on GitLab.com. The top six columns
by size are note_html, note, original_position,
position, discussion_id, and change_position.
Clear note_html for stale merge requests
- Estimated savings: 1,000 GB (44.5% of
notestable) - Effort: Large
note_html is a cached rendered version of the note field, generated by
CacheMarkdownField. It consumes 1,169 GB (52% of the table). For notes on
merge requests that have not been accessed recently, this cached value can be
cleared and regenerated on demand.
The approach:
- Define “stale” criteria: for merged MRs, for example
updated_atolder than 3 months; for open MRs,updated_atolder than 6 months. - Run an async worker to set
note_htmltoNULLfor notes belonging to stale MRs. - On read, if
note_htmlisNULL, regenerate fromnoteand persist back to the database. The existingCacheMarkdownFieldmodule already supports this pattern throughcached_markdown_version. - Benchmark the performance impact of on-the-fly regeneration for API endpoints that return many notes (for example, merge request discussions API).
There have been past markdown cache version bumps (for example, 492f0853, e7a98807) which
essentially trigger the same regeneration, suggesting the performance impact
is manageable.
This pattern can later be applied to description_html and title_html on
merge_requests, issues, and work_items tables. Nicolas Dular from the Plan
group has expressed support for this approach and noted it could also benefit the
issues table.
Decompose system notes
- Estimated savings: 800 GB (35.6% of
notestable) - Effort: Large
Approximately 79% of notes rows are system-generated notes. Unlike user-authored notes, system notes are mostly structured data rendered into full text (for example, “added 3 commits”, “changed the description”, “mentioned in !1234”). Instead of storing the full rendered text, we can store only the structured parameters needed to reconstruct the message on the fly (for example, action type, count, reference).
There are two possible approaches:
- Decompose into a new table. Move system notes into a dedicated
system_notestable with structured columns for each action type. This reduces the effective size of both the original and new tables, improving query performance for each access pattern. - Add structured columns to the existing table. Add columns for the
structured parameters and clear the
noteandnote_htmltext fields for system notes, avoiding the complexity of a table split while still reclaiming the storage.
Convert position, original_position, and change_position to a structured table
- Estimated savings: 200 GB (3.9% of
notestable) - Effort: Large
These three columns store YAML-serialized position data for DiffNote records.
Currently each field is a YAML string consuming approximately 520 bytes per row.
Converting to a structured table (similar to the existing DiffNotePosition
model and its diff_note_positions table) reduces storage to approximately
350 bytes per row.
Additionally, position holds the exact same data as original_position for
approximately 2.28% of rows. This redundancy can be eliminated.
We investigated whether converting from YAML strings to jsonb would help, but
YAML strings actually use less space than jsonb due to TOAST compression.
The structured table approach provides the best savings.
Drop redundant noteable_id/noteable_type/system index
- Estimated savings: 63 GB (2.1% of
notestable) - Effort: Small
The composite index index_notes_on_noteable_id_and_noteable_type_and_system is
63 GB and has minimal usage. We need to evaluate whether queries can be served by other existing
indexes before removal.
Drop or convert index_notes_on_line_code to partial
- Estimated savings: 34 GB (1.1% of
notestable) - Effort: Small
This index is 36 GB. Grafana metrics show it is seldom used (fewer than 0.8
scans per second with occasional spikes). The only usage found is for
LegacyDiffNote, which is a legacy implementation (new diff notes are of type
DiffNote, and mainly import still creates the legacy type). This index can be
converted to a partial index or dropped after confirming no active query paths
depend on it.
Drop st_diff column and remove LegacyDiffNote type
- Estimated savings: 20 GB (0.7% of
notestable) - Effort: Medium
The st_diff field is only used by LegacyDiffNote. We can standardize the
legacy notes type and remove this column, saving approximately 20 GB.
Convert index_notes_on_organization_id to partial
- Estimated savings: 19 GB (0.6% of
notestable) - Effort: Small
This index is 19 GB but has near-zero usage over the past 90 days. The column
organization_id is almost entirely NULL (only 451 KB of actual data). There
is no actual usage in application code; the index exists solely for the
Cells/Organization sharding initiative. Converting to a partial index on
WHERE organization_id IS NOT NULL shrinks the index from 19 GB to near zero.
This requires confirmation from the Tenant Scale team.
Drop confidential column (migrated to internal)
- Estimated savings: ~0.1 GB
- Effort: Small
The confidential column is a duplicate of internal after the migration in
Rename confidential column in notes tables (#367923).
Finalize the migration by dropping the column and the associated
index_notes_on_id_where_confidential index (22 MB).
Total estimated savings for notes: ~1,700 GB (56%)
merge_requests table (787 GB total: 551 GB columns + 254 GB indexes)
The merge_requests table total size is 804 GB (including ~123 GB of
reclaimable bloat). The top columns by size are description and
description_html (251 GB, 45.6%), title and title_html (41 GB, 7.45%),
and merge_params (26 GB, 4.73%).
Clear description_html and title_html for stale merge requests
- Estimated savings: 150 GB (19.9% of
merge_requeststable) - Effort: Large
description_html consumes 160 GB and title_html consumes 27 GB. Both are
CacheMarkdownField caches, not sources of truth. Only title and
description are the source of truth. The same approach described for
note_html above applies here.
Reclaim table bloat (pg_repack)
- Estimated savings: 123 GB (18.8% of
merge_requestsphysical size) - Effort: Small (requires DB team coordination)
Analysis shows a 151 GB difference between the physical table size (551 GB) and
the actual column data (400 GB). This is attributed to: table bloat (~123 GB of
reclaimable dead tuples), row metadata (~11 GB), and alignment padding and page
headers (~17 GB). The bloat is likely caused by the bigint migration and
description updates. Running pg_repack or VACUUM FULL can reclaim this space,
coordinated with the Database team for production execution.
Remove merge_params for merged merge requests
- Estimated savings: 25 GB (3.8% of
merge_requeststable) - Effort: Small
merge_params contains highly repetitive data. For example,
force_remove_source_branch: '0' is the default behavior for any MR, yet it is
persisted for every row. After an MR is merged, merge_params is no longer
needed by the application. Most of this data is also available in Gitaly if
needed later.
Discussion with the Code Review backend team on Slack confirmed there is no known
usage of merge_params after the MR is merged. We can run an async worker
daily or weekly to clear merge_params for MRs merged more than 7 days ago,
reducing the column from 26 GB to under 1 GB. The merge_params field could
also be converted to jsonb if we choose not to decompose it into a separate
table.
Convert SHA columns to bytea
- Estimated savings: 15 GB (1.9% of
merge_requeststable) - Effort: Small
Three SHA fields are stored as character varying, which uses a hex-encoded
text representation:
squash_commit_sha: varchar field takes 42 bytes,byteatakes 20 bytes. 65.7M rows x 22 bytes saved = 1.31 GB.merge_commit_sha: varchar field takes 42 bytes,byteatakes 20 bytes. 272.3M rows x 22 bytes saved = 5.99 GB.merged_commit_sha: varchar field takes 82 bytes (double-encoded),byteatakes 20 bytes. 121.4M rows x 62 bytes saved = 7.53 GB.
This standardizes SHA storage. The existing in_progress_merge_commit_sha
column already uses bytea, so there is precedent in this table.
Convert merge_status from varchar to smallint
- Estimated savings: 3.5 GB (0.5% of
merge_requeststable) - Effort: Small
merge_status is defined as character varying(510) but only stores 7 possible
enum values (unchecked, preparing, checking, can_be_merged,
cannot_be_merged, cannot_be_merged_recheck, cannot_be_merged_rechecking).
Each row consumes approximately 11 bytes. Converting to smallint (2 bytes)
with a Rails enum mapping saves approximately 10 bytes per row, with no change
above the model layer.
Drop legacy assignee_id column
- Estimated savings: ~2.7 GB (column + index)
- Effort: Small
The assignee_id column has been replaced by merge_request_assignees
association. The column itself is only 43 MB, but the associated index
index_merge_requests_on_assignee_id is 2.62 GB. Both can be dropped after
confirming the deprecation is complete.
Total estimated savings for merge_requests: ~311.5 GB (47.6%)
merge_request_diffs table (389 GB total: 203 GB columns + 186 GB indexes)
Drop external_diff column and index
- Estimated savings: 52 GB (13.4% of
merge_request_diffstable) - Effort: Small
The external_diff column is no longer populated. It was computed on the
Carrierwave side. The column and its associated index
index_merge_request_diffs_on_external_diff (14 GB) can be removed, saving
approximately 52 GB total.
Convert 3 SHA columns to bytea
- Estimated savings: 78 GB (20.1% of
merge_request_diffstable) - Effort: Small
base_commit_sha, start_commit_sha, and head_commit_sha can be converted
from character varying to bytea, following the same approach as the
merge_requests SHA columns. Each index on these columns also shrinks by
approximately one-third.
Convert real_size, state, external_diff_store, and commits_count to smaller integer types
- Estimated savings: 10 GB (2.6% of
merge_request_diffstable) - Effort: Small
These columns currently use larger integer types than necessary. Converting to 1-byte or 2-byte integers where the value range permits saves approximately 10 GB.
Total estimated savings for merge_request_diffs: ~140 GB (36%)
events table (2,371 GB)
The events table has a clean schema with limited optimization potential at the
column or index level. The table definition is well-designed, and the content
looks well-structured in terms of what events we store. The primary savings
opportunities come from data lifecycle management.
Drop updated_at column
- Estimated savings: 34 GB (1.4% of
eventstable) - Effort: Small
Events are append-only and immutable. Analysis shows only 0.02% of rows have
different created_at and updated_at values, and most of those differ by only
nanoseconds or milliseconds. Deeper investigation by Abdul Wadood confirmed that
rows where updated_at and created_at differ by more than 10 seconds have not
occurred in the last year (the last such rows are from 2024).
There is no index on updated_at, which implies it is not actively used for
queries. However, as Shane Maglangit noted, absence of an index does not
definitively prove the column is unused (for example, namespaces.updated_at is heavily
used but has no index). We should perform a double-check of application code
before action. If needed for backward compatibility, we can alias updated_at
to created_at in Rails.
Merge project_id, group_id, and personal_namespace_id into namespace_id
- Estimated savings: 50 GB (2.1% of
eventstable) - Effort: Large
The events table stores three separate columns for the owning entity. These
could potentially be merged into a single namespace_id column. However,
merging these columns would require joining with the namespaces table to find
events (since projects is a different table from namespaces), which would
slow down already slow event queries. This needs careful benchmarking before
proceeding.
90-day retention policy
- Estimated savings: 1,800 GB (75.9% of
eventstable) - Effort: Large
This is the single largest opportunity across all tables. Both GitHub and Azure
DevOps offer 90-day event retention. A similar policy would dramatically reduce
the events table size. Christina Lohr (@lohrc) has
recommended a 90-day retention period,
focused on the fact that events is essentially a duplicate of data that can be
found in or reconstructed from other tables.
This strategy requires:
- Product input on acceptable retention periods.
- Evaluation of whether historical events can be reconstructed from other data
sources. Notably,
push_event_payloadsis a highly-entangled table that may not be reconstructable, and we need to consider whether we can delete events for open issues and MRs. - Time-based partitioning of the
eventstable to enable efficient partition dropping rather than row-by-row deletion (the time-decay data pattern describes this approach in detail). - Archiving events older than 90 days to object storage or a data warehouse for compliance and analytics use cases.
One additional idea: identify bot and automation actions and treat them differently (fewer fields to store, or not saving to DB at all). This could significantly reduce the table’s growth rate, especially as AI tooling increases record creation. This needs Product input first.
Table partitioning
- Estimated savings: 0 GB direct (enables retention and improves maintenance)
- Effort: Medium
Even without a retention policy, partitioning the events table by created_at
improves query performance, vacuum efficiency, and maintenance operations. It is
a prerequisite for implementing an efficient retention policy.
Kerri Miller noted that a blend of approaches will serve best in the long term, and that partitioning might be appropriate even with a short retention period, as we can use time-based partitions and drop the oldest one every 90 days.
Total estimated savings for events: ~1,884 GB (if retention is adopted)
Design and implementation details
TBD
Alternative Solutions
TBD
e5f6799a)
