Code Review database size reduction for GitLab.com

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
proposed zhaochen.li francoisrose phikai patrickbajao dskim_gitlab devops create 2026-04-15

Summary

The Code Review group owns several of the largest tables on GitLab.com, including notes (3,054 GB), events (2,371 GB), merge_requests (787 GB), and merge_request_diffs (389 GB). Together these tables consume over 6,600 GB of primary database storage and continue to grow, particularly as AI tooling accelerates record creation.

This document proposes approaches to reduce the on-disk size of these tables by approximately 50% through a combination of strategies: clearing cached HTML fields for stale records, converting column types to more compact representations, removing redundant columns and indexes, decomposing position data into structured tables, enforcing retention policies, and reclaiming table bloat. The strategies are prioritized by estimated savings, implementation effort, and risk.

Related: Blueprint for Code Review database size reduction (&20233), Code Review database size reduction (#17571), Initial spike for database size reduction blueprint (#586185).

Motivation

Large tables on GitLab.com are a major problem for both operations and development. As tables grow beyond hundreds of gigabytes, several problems compound:

  1. Query performance suffers. Larger tables increase index sizes, slow down sequential scans, and reduce buffer cache hit rates.
  2. Table maintenance becomes expensive. VACUUM, ANALYZE, and index rebuilds take longer and hold locks that affect application availability.
  3. Infrastructure costs increase. Storage, I/O, replication lag, and backup times all scale with on-disk size.
  4. Data migrations become complex. Schema changes on large tables require significantly more effort to implement and are more likely to cause stability problems on GitLab.com. For example, swapping bigint columns with integer columns on merge_requests had to be split into 3 stages and took several months to complete (#507695).
  5. Operational risk grows. Failovers and disaster recovery become slower and more fragile as data volume increases.

The Database Scalability blueprint (June 2021) established a target of keeping individual physical tables under 100 GB on GitLab.com. Nearly five years later, multiple Code Review tables on GitLab.com still exceed this threshold by 7x to 30x and have grown significantly since the original analysis. Without intervention, these tables will continue to grow as GitLab.com usage increases.

The tables over 100 GB on GitLab.com owned by or closely related to Code Review, as of January 2026, are:

Table Size
merge_request_diff_commits 7,875 GB
merge_request_diff_files 3,290 GB
notes 3,156 GB
events 2,371 GB
merge_requests 787 GB
merge_request_diffs 451 GB
note_diff_files 170 GB
approval_merge_request_rules_users 160 GB
merge_request_metrics 140 GB

This document focuses on the remaining large tables after excluding the items listed in Non-Goals below: notes, events, merge_requests, and merge_request_diffs.

Goals

  1. Reduce the combined on-disk size of notes, merge_requests, merge_request_diffs, and events tables by approximately 50%.
  2. Achieve the largest savings with the lowest-risk changes first (quick wins), then progress to larger structural changes.
  3. Maintain backward compatibility with existing application behavior with no user-facing feature regressions.
  4. Establish repeatable patterns (for example, HTML cache clearing, retention policies) that other groups can adopt for their own large tables such as issues and work_items.
  5. Update application and model code when needed to support larger structural changes (for example, table decomposition, column type conversions).
  6. Deliver changes incrementally across multiple milestones, with each change independently valuable.

Non-Goals

This document does not cover the two largest Code Review tables, merge_request_diff_commits and merge_request_diff_files. Those tables are already being addressed by separate epics:

See Out-of-scope opportunities below for other opportunities identified during the investigation that are not initially in scope.

Proposal

The table below summarizes all in-scope opportunities in the order we intend to pursue them, starting with the largest savings. We will iterate through these incrementally, prioritizing by effort-to-impact ratio so small-effort “quick wins” can be delivered in parallel with larger structural changes. A concrete priority order will be proposed in follow-up MRs. Detailed analysis for each opportunity follows in the per-table sections.

Opportunity Table Effort Savings
Clear note_html for stale MRs notes Large 1,000 GB
Decompose system notes notes Large 800 GB
Convert position columns to structured table notes Large 200 GB
Clear description_html and title_html for stale MRs merge_requests Large 150 GB
Retention policy on merge_request_diffs merge_request_diffs, merge_request_diff_commits, merge_request_diff_files Large TBD (expected large)
Reclaim bloat (pg_repack) merge_requests Small 123 GB
Convert SHA columns to bytea merge_request_diffs Small 78 GB
Drop redundant noteable index notes Small 63 GB
Drop external_diff column and index merge_request_diffs Small 52 GB
Drop updated_at column events Small 34 GB
Drop/convert index_notes_on_line_code notes Small 34 GB
Remove merge_params for merged MRs merge_requests Small 25 GB
Convert index_notes_on_organization_id to partial notes Small 19 GB
Convert SHA columns to bytea merge_requests Small 15 GB
Convert integer columns to smaller types merge_request_diffs Small 10 GB
Convert merge_status to smallint merge_requests Small 3.5 GB
Drop assignee_id column and index merge_requests Small ~2.7 GB
Total ~2,604 GB

Retention policy on merge_request_diffs. Discussed in issue #594843 (comment). We expect savings to be large because a retention policy on merge_request_diffs would also reduce merge_request_diff_commits and merge_request_diff_files, but this overlaps with the separate epics already addressing those tables (epic &16385 and epic &11272) and needs to be coordinated there. A savings estimate should be produced as part of that coordination.

Out-of-scope opportunities

The following opportunities were identified during the investigation but are not in scope for this design document. Each is documented here for visibility and future follow-up:

Opportunity Table Effort Savings
90-day retention policy events Large 1,800 GB
Partition events table events Medium 0 GB (enabler)
Merge namespace columns events Large 50 GB
Drop st_diff column notes Medium 20 GB
Drop confidential column notes Small ~0.1 GB

These are out of scope for the following reasons:

  • events table changes (90-day retention, partitioning, and merging namespace columns). These approaches are not yet mature enough to commit to in this document. Code Review co-owns the events table because it contains MR-related data, but many of the features that rely on this data are owned by other groups, so changes here require cross-team collaboration and further POC work to validate feasibility, impact, and retention semantics. The retention policy proposal (#571288) is the current starting point for that discussion.
  • Drop st_diff column. Requires removing LegacyDiffNote handling across the application, which has a broader scope than a column-level change and is better tracked as a separate deprecation.
  • Drop confidential column. Savings are negligible (~0.1 GB) and do not justify prioritizing this over larger opportunities. Can be finalized opportunistically alongside other notes changes.

The following sections provide detailed analysis for each opportunity, organized by table. Each includes context from the initial spike investigation.

notes table (3,054 GB total: 2,246 GB columns + 808 GB indexes)

The notes table is the third largest table on GitLab.com. The top six columns by size are note_html, note, original_position, position, discussion_id, and change_position.

Clear note_html for stale merge requests

  • Estimated savings: 1,000 GB (44.5% of notes table)
  • Effort: Large

note_html is a cached rendered version of the note field, generated by CacheMarkdownField. It consumes 1,169 GB (52% of the table). For notes on merge requests that have not been accessed recently, this cached value can be cleared and regenerated on demand.

The approach:

  1. Define “stale” criteria: for merged MRs, for example updated_at older than 3 months; for open MRs, updated_at older than 6 months.
  2. Run an async worker to set note_html to NULL for notes belonging to stale MRs.
  3. On read, if note_html is NULL, regenerate from note and persist back to the database. The existing CacheMarkdownField module already supports this pattern through cached_markdown_version.
  4. Benchmark the performance impact of on-the-fly regeneration for API endpoints that return many notes (for example, merge request discussions API).

There have been past markdown cache version bumps (for example, 492f0853, e7a98807) which essentially trigger the same regeneration, suggesting the performance impact is manageable.

This pattern can later be applied to description_html and title_html on merge_requests, issues, and work_items tables. Nicolas Dular from the Plan group has expressed support for this approach and noted it could also benefit the issues table.

Decompose system notes

  • Estimated savings: 800 GB (35.6% of notes table)
  • Effort: Large

Approximately 79% of notes rows are system-generated notes. Unlike user-authored notes, system notes are mostly structured data rendered into full text (for example, “added 3 commits”, “changed the description”, “mentioned in !1234”). Instead of storing the full rendered text, we can store only the structured parameters needed to reconstruct the message on the fly (for example, action type, count, reference).

There are two possible approaches:

  • Decompose into a new table. Move system notes into a dedicated system_notes table with structured columns for each action type. This reduces the effective size of both the original and new tables, improving query performance for each access pattern.
  • Add structured columns to the existing table. Add columns for the structured parameters and clear the note and note_html text fields for system notes, avoiding the complexity of a table split while still reclaiming the storage.

Convert position, original_position, and change_position to a structured table

  • Estimated savings: 200 GB (3.9% of notes table)
  • Effort: Large

These three columns store YAML-serialized position data for DiffNote records. Currently each field is a YAML string consuming approximately 520 bytes per row. Converting to a structured table (similar to the existing DiffNotePosition model and its diff_note_positions table) reduces storage to approximately 350 bytes per row.

Additionally, position holds the exact same data as original_position for approximately 2.28% of rows. This redundancy can be eliminated.

We investigated whether converting from YAML strings to jsonb would help, but YAML strings actually use less space than jsonb due to TOAST compression. The structured table approach provides the best savings.

Drop redundant noteable_id/noteable_type/system index

  • Estimated savings: 63 GB (2.1% of notes table)
  • Effort: Small

The composite index index_notes_on_noteable_id_and_noteable_type_and_system is 63 GB and has minimal usage. We need to evaluate whether queries can be served by other existing indexes before removal.

Drop or convert index_notes_on_line_code to partial

  • Estimated savings: 34 GB (1.1% of notes table)
  • Effort: Small

This index is 36 GB. Grafana metrics show it is seldom used (fewer than 0.8 scans per second with occasional spikes). The only usage found is for LegacyDiffNote, which is a legacy implementation (new diff notes are of type DiffNote, and mainly import still creates the legacy type). This index can be converted to a partial index or dropped after confirming no active query paths depend on it.

Drop st_diff column and remove LegacyDiffNote type

  • Estimated savings: 20 GB (0.7% of notes table)
  • Effort: Medium

The st_diff field is only used by LegacyDiffNote. We can standardize the legacy notes type and remove this column, saving approximately 20 GB.

Convert index_notes_on_organization_id to partial

  • Estimated savings: 19 GB (0.6% of notes table)
  • Effort: Small

This index is 19 GB but has near-zero usage over the past 90 days. The column organization_id is almost entirely NULL (only 451 KB of actual data). There is no actual usage in application code; the index exists solely for the Cells/Organization sharding initiative. Converting to a partial index on WHERE organization_id IS NOT NULL shrinks the index from 19 GB to near zero. This requires confirmation from the Tenant Scale team.

Drop confidential column (migrated to internal)

  • Estimated savings: ~0.1 GB
  • Effort: Small

The confidential column is a duplicate of internal after the migration in Rename confidential column in notes tables (#367923). Finalize the migration by dropping the column and the associated index_notes_on_id_where_confidential index (22 MB).

Total estimated savings for notes: ~1,700 GB (56%)

merge_requests table (787 GB total: 551 GB columns + 254 GB indexes)

The merge_requests table total size is 804 GB (including ~123 GB of reclaimable bloat). The top columns by size are description and description_html (251 GB, 45.6%), title and title_html (41 GB, 7.45%), and merge_params (26 GB, 4.73%).

Clear description_html and title_html for stale merge requests

  • Estimated savings: 150 GB (19.9% of merge_requests table)
  • Effort: Large

description_html consumes 160 GB and title_html consumes 27 GB. Both are CacheMarkdownField caches, not sources of truth. Only title and description are the source of truth. The same approach described for note_html above applies here.

Reclaim table bloat (pg_repack)

  • Estimated savings: 123 GB (18.8% of merge_requests physical size)
  • Effort: Small (requires DB team coordination)

Analysis shows a 151 GB difference between the physical table size (551 GB) and the actual column data (400 GB). This is attributed to: table bloat (~123 GB of reclaimable dead tuples), row metadata (~11 GB), and alignment padding and page headers (~17 GB). The bloat is likely caused by the bigint migration and description updates. Running pg_repack or VACUUM FULL can reclaim this space, coordinated with the Database team for production execution.

Remove merge_params for merged merge requests

  • Estimated savings: 25 GB (3.8% of merge_requests table)
  • Effort: Small

merge_params contains highly repetitive data. For example, force_remove_source_branch: '0' is the default behavior for any MR, yet it is persisted for every row. After an MR is merged, merge_params is no longer needed by the application. Most of this data is also available in Gitaly if needed later.

Discussion with the Code Review backend team on Slack confirmed there is no known usage of merge_params after the MR is merged. We can run an async worker daily or weekly to clear merge_params for MRs merged more than 7 days ago, reducing the column from 26 GB to under 1 GB. The merge_params field could also be converted to jsonb if we choose not to decompose it into a separate table.

Convert SHA columns to bytea

  • Estimated savings: 15 GB (1.9% of merge_requests table)
  • Effort: Small

Three SHA fields are stored as character varying, which uses a hex-encoded text representation:

  • squash_commit_sha: varchar field takes 42 bytes, bytea takes 20 bytes. 65.7M rows x 22 bytes saved = 1.31 GB.
  • merge_commit_sha: varchar field takes 42 bytes, bytea takes 20 bytes. 272.3M rows x 22 bytes saved = 5.99 GB.
  • merged_commit_sha: varchar field takes 82 bytes (double-encoded), bytea takes 20 bytes. 121.4M rows x 62 bytes saved = 7.53 GB.

This standardizes SHA storage. The existing in_progress_merge_commit_sha column already uses bytea, so there is precedent in this table.

Convert merge_status from varchar to smallint

  • Estimated savings: 3.5 GB (0.5% of merge_requests table)
  • Effort: Small

merge_status is defined as character varying(510) but only stores 7 possible enum values (unchecked, preparing, checking, can_be_merged, cannot_be_merged, cannot_be_merged_recheck, cannot_be_merged_rechecking). Each row consumes approximately 11 bytes. Converting to smallint (2 bytes) with a Rails enum mapping saves approximately 10 bytes per row, with no change above the model layer.

Drop legacy assignee_id column

  • Estimated savings: ~2.7 GB (column + index)
  • Effort: Small

The assignee_id column has been replaced by merge_request_assignees association. The column itself is only 43 MB, but the associated index index_merge_requests_on_assignee_id is 2.62 GB. Both can be dropped after confirming the deprecation is complete.

Total estimated savings for merge_requests: ~311.5 GB (47.6%)

merge_request_diffs table (389 GB total: 203 GB columns + 186 GB indexes)

Drop external_diff column and index

  • Estimated savings: 52 GB (13.4% of merge_request_diffs table)
  • Effort: Small

The external_diff column is no longer populated. It was computed on the Carrierwave side. The column and its associated index index_merge_request_diffs_on_external_diff (14 GB) can be removed, saving approximately 52 GB total.

Convert 3 SHA columns to bytea

  • Estimated savings: 78 GB (20.1% of merge_request_diffs table)
  • Effort: Small

base_commit_sha, start_commit_sha, and head_commit_sha can be converted from character varying to bytea, following the same approach as the merge_requests SHA columns. Each index on these columns also shrinks by approximately one-third.

Convert real_size, state, external_diff_store, and commits_count to smaller integer types

  • Estimated savings: 10 GB (2.6% of merge_request_diffs table)
  • Effort: Small

These columns currently use larger integer types than necessary. Converting to 1-byte or 2-byte integers where the value range permits saves approximately 10 GB.

Total estimated savings for merge_request_diffs: ~140 GB (36%)

events table (2,371 GB)

The events table has a clean schema with limited optimization potential at the column or index level. The table definition is well-designed, and the content looks well-structured in terms of what events we store. The primary savings opportunities come from data lifecycle management.

Drop updated_at column

  • Estimated savings: 34 GB (1.4% of events table)
  • Effort: Small

Events are append-only and immutable. Analysis shows only 0.02% of rows have different created_at and updated_at values, and most of those differ by only nanoseconds or milliseconds. Deeper investigation by Abdul Wadood confirmed that rows where updated_at and created_at differ by more than 10 seconds have not occurred in the last year (the last such rows are from 2024).

There is no index on updated_at, which implies it is not actively used for queries. However, as Shane Maglangit noted, absence of an index does not definitively prove the column is unused (for example, namespaces.updated_at is heavily used but has no index). We should perform a double-check of application code before action. If needed for backward compatibility, we can alias updated_at to created_at in Rails.

Merge project_id, group_id, and personal_namespace_id into namespace_id

  • Estimated savings: 50 GB (2.1% of events table)
  • Effort: Large

The events table stores three separate columns for the owning entity. These could potentially be merged into a single namespace_id column. However, merging these columns would require joining with the namespaces table to find events (since projects is a different table from namespaces), which would slow down already slow event queries. This needs careful benchmarking before proceeding.

90-day retention policy

  • Estimated savings: 1,800 GB (75.9% of events table)
  • Effort: Large

This is the single largest opportunity across all tables. Both GitHub and Azure DevOps offer 90-day event retention. A similar policy would dramatically reduce the events table size. Christina Lohr (@lohrc) has recommended a 90-day retention period, focused on the fact that events is essentially a duplicate of data that can be found in or reconstructed from other tables.

This strategy requires:

  1. Product input on acceptable retention periods.
  2. Evaluation of whether historical events can be reconstructed from other data sources. Notably, push_event_payloads is a highly-entangled table that may not be reconstructable, and we need to consider whether we can delete events for open issues and MRs.
  3. Time-based partitioning of the events table to enable efficient partition dropping rather than row-by-row deletion (the time-decay data pattern describes this approach in detail).
  4. Archiving events older than 90 days to object storage or a data warehouse for compliance and analytics use cases.

One additional idea: identify bot and automation actions and treat them differently (fewer fields to store, or not saving to DB at all). This could significantly reduce the table’s growth rate, especially as AI tooling increases record creation. This needs Product input first.

Table partitioning

  • Estimated savings: 0 GB direct (enables retention and improves maintenance)
  • Effort: Medium

Even without a retention policy, partitioning the events table by created_at improves query performance, vacuum efficiency, and maintenance operations. It is a prerequisite for implementing an efficient retention policy.

Kerri Miller noted that a blend of approaches will serve best in the long term, and that partitioning might be appropriate even with a short retention period, as we can use time-based partitions and drop the oldest one every 90 days.

Total estimated savings for events: ~1,884 GB (if retention is adopted)

Design and implementation details

TBD

Alternative Solutions

TBD