Repository Backups
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed |
proglottis
|
DylanGriffith
|
devops systems | 2023-04-26 |
Summary
This proposal seeks to provide an out-of-a-box repository backup solution to
GitLab that gives more opportunities to apply Gitaly specific optimisations. It
will do this by moving repository backups out of backup.rake
into a
coordination worker that enumerates repositories and makes per-repository
decisions to trigger repository backups that are streamed directly from Gitaly
to object-storage.
The advantages of this approach are:
- The backups are only transferred once, from the Gitaly hosting the physical repository to object-storage.
- Smarter decisions can be made by leveraging specific repository access patterns.
- Distributes backup and restore load.
- Since the entire process is run within Gitaly existing monitoring can be used.
- Provides architecture for future WAL archiving and other optimisations.
This should relieve the major pain points of the existing two strategies:
backup.rake
- Repository backups are streamed from outside of Gitaly using RPCs and stored in a single large tar file. Due to the amount of data transferred these backups are limited to small installations.- Snapshots - Cloud providers allow taking physical storage snapshots. These are not an out-of-a-box solution as they are specific to the cloud provider.
Motivation
Goals
- Improve time to create and restore repository backups.
- Improve monitoring of repository backups.
Non-Goals
- Improving filesystem based snapshots.
Filesystem based Snapshots
Snapshots rely on cloud platforms to be able to take physical snapshots of the
disks that Gitaly and Praefect use to store data. While never officially
recommended, this strategy tends to be used once creating or restoring backups
using backup.rake
takes too long.
Gitaly and Git use lock files and fsync in order to prevent repository corruption from concurrent processes and partial writes from a crash. This generally means that if a file is written, then it will be valid. However, because Git repositories are composed of many files and many write operations may be taking place, it would be impossible to schedule a snapshot while no file operations are ongoing. This means the consistency of a snapshot cannot be guaranteed and restoring from a snapshot backup may require manual intervention.
WAL may improve crash resistance and so improve automatic recovery from snapshots, but each repository will likely still require a majority of voting replicas in sync.
Since each node in a Gitaly Cluster is not homogeneous, depending on replication factor, in order to create a complete snapshot backup all nodes would need to have snapshots taken. This means that snapshot backups have a lot of repository data duplication.
Snapshots are heavily dependent on the cloud provider and so they would not provide an out-of-a-box experience.
Downtime
An ideal repository backup solution would allow both backup and restore operations to be done online. Specifically we would not want to shutdown or pause writes to ensure that each node/repository is consistent.
Consistency
Consistency in repository backups means:
- That the Git repositories are valid after restore. There are no partially applied operations.
- That all repositories in a cluster are healthy after restore, or are made healthy automatically.
Backups without consistency may result in data-loss or require manual intervention on restore.
Both types of consistency are difficult to achieve using snapshots as this requires that snapshots of the filesystems on multiple hosts are taken synchronously and without repositories on any of those hosts currently being mutated.
Distribute Work
We want to distribute the backup/restore work such that it isn’t bottlenecked
on the machine running backup.rake
, a single Gitaly node, or a single network
connection.
On backup, backup.rake
aggregates all repository backups onto its local
filesystem. This means that all repository data needs to be streamed from
Gitaly (possibly via Praefect) to where the Rake task is being run. If this is
CNG then it also requires a large volume on Kubernetes. The resulting backup
tar file then gets transferred to object storage. A similar process happens on
restore, the entire tar file needs to be downloaded and extracted on the local
filesystem, even for a partial restore when restoring a subset of repositories.
Effectively all repository data gets transferred, in full, multiple times
between multiple hosts.
If each Gitaly could directly upload backups it would mean only transferring repository data a single time, reducing the number of hosts and so the amount of data transferred over all.
Gitaly Controlled
Gitaly is looking to become self-contained and so should own its backups.
backup.rake
currently determines which repositories to backup and where those
backups are stored. This restricts the kind of optimisations that Gitaly could
apply and adds development/testing complexity.
Monitoring
backup.rake
is run in a variety of different environments. Historically
backups from Gitaly’s perspective are a series of disconnected RPC calls. This
has resulted in backups having almost zero monitoring. Ideally the process
would run within Gitaly such that the process could be monitored using existing
metrics and log scraping.
Automatic Backups
When backup.rake
is set up on cron it can be difficult to tell if it has been
running successfully, if it is still running, how long it took, and how much
space it has taken. It is difficult to ensure that cron always has access to
the previous backup to allow for incremental backups or to determine if
updating the backup is required at all.
Having a coordination process running continuously will allow moving from a single-shot backup strategy to one where each repository determines its own backup schedule based on usage patterns and priority. This way each repository should be able to have a reasonably up-to-date backup without adding excess load to any Gitaly node.
Updated Repositories Only
backup.rake
packages all repository backups into a tar file and generally has
no access to the previous backup. This makes it difficult to determine if the
repository has changed since last backup.
Having access to previous backups on object-storage would mean that Gitaly could more easily determine if a backup needs to be taken at all. This allows us to waste less time backing up repositories that are no longer being modified.
Point-in-time Restores
There should be a mechanism by which a set of repositories can be restored to a specific point in time. The identifier (backup ID) used should be able to be determined by an admin and apply to all repositories.
WAL (write ahead log)
We want to be able to provide infrastructure to allow continuous archiving of the WAL. This means providing a central place to stream the archives to and being able to match any full backup to a place in the log such that repositories can be restored from the full backup, and the WAL applied up to a specific point in time.
WORM
Any Gitaly accessible storage should be WORM (write once, read many) in order to prevent existing backups being modified in the case an attacker gains access to a nodes object-storage credentials.
The pointer layout currently used by repository backups relies on being able to overwrite the pointer files, and as such would not be suitable for use on a WORM file store.
WORM is likely object-storage provider specific:
bundle-uri
Having direct access backup data may open the door for clone/fetch transfer optimisations using bundle-uri. This allows us to point Git clients directly to a bundle file instead of transferring packs from the repository itself. The bulk repository transfer can then be faster and is offloaded to a plain http server, rather than the Gitaly servers.
Proposal
The proposal is broken down into an initial MVP and per-repository coordinator.
MVP
The goal of the MVP is to validate that moving backup processing server-side will improve the worst case, total-loss, scenario. That is, reduce the total time to create and restore a full backup.
The MVP will introduce backup and restore repository RPCs. There will be no
coordination worker. The RPCs will stream a backup directly from the
called Gitaly node to object storage. These RPCs will be called from
backup.rake
via the gitaly-backup
tool. backup.rake
will no longer
package repository backups into the backup archive.
This work is already underway, tracked by the Server-side Backups MVP epic.
Per-Repository Coordinator
Instead of taking a backup of all repositories at once via backup.rake
, a
backup coordination worker will be created. This worker will periodically
enumerate all repositories to decide if a backup needs to be taken. These
decisions could be determined by usage patterns or priority of the repository.
When restoring, since each repository will have a different backup state, a timestamp will be provided by the user. This timestamp will be used to determine which backup to restore for each repository. Once WAL archiving is implemented, the WAL could then be replayed up to the given timestamp.
This wider effort is tracked in the Server-side Backups epic.
Design and implementation details
MVP
There will be a pair of RPCs BackupRepository
and RestoreRepository
. These
RPCs will synchronously create/restore backups directly onto object storage.
backup.rake
will continue to use gitaly-backup
with a new --server-side
flag. Each Gitaly will need a backup configuration to specify the
object-storage service to use.
Initially the structure of the backups in object-storage will be the same as the existing pointer layout.
For MVP the backup ID must match an exact backup ID on object-storage.
The configuration of object-storage will be controlled by a new config
config.backup.go_cloud_url
. The Go Cloud Development Kit
tries to use a provider specific way to configure authentication. This can be
inferred from the VM or from environment variables.
See Supported Storage Services.
Alternative Solutions
e47101dc
)