Fault Tolerance
Fault-Tolerance
GitLab has to be a highly-available, mission critical system. To achieve this, we must design and deploy the system in such a way that a number of principles are met:
- Eliminate single points of failures (SPOF): A failure of a single node should not cause downtime.
- Isolate failures: If a failure happens, it should be isolated as much as possible to a particular project, user, etc. The blast radius should be minimized.
- Rollback: Errors will invariably happen in the software development. If a bug happens, we must be able to revert quickly without exposing the problem to a large number of users.
Example Improvements to GitLab
Below is a list of examples of concrete items that will help improve GitLab fault-tolerance:
SPOF
- Eliminate use of NFS
- Use multiple Redis cache instances in Rails.cache
Isolation
- Allow GitLab to function if a single Gitaly node is down
- TODO
Microservices does not necessarily provide fault isolation
Note that the above list does not mention microservices as a cure-all. A
microservice architecture can help provide fault isolation, but it
does not inherently do this. For example, let’s suppose we introduce
UserAPI
microservice that creates an API for all services to retrieve
users in the system. Now our architecture may look like:
graph TD Rails --> UserAPI Sidekiq --> UserAPI Pages --> UserAPI
The UserAPI
microservice could still be a single point of failure
here; if that goes down, all the other services in the system
(e.g. Rails, Sidekiq, etc.) also stop working. We’ve introduced a new
service that can be owned by a single team, but in doing so we haven’t
necessarily improved isolation. Can the system function without this
service? Probably not, although there may be other advantageous to doing
this (e.g. make it possible to shard user data in multiple servers,
performance, etc.). We still have to think about how to avoid a SPOF.
In addition, GitLab also is unique in that every microservice that we create has to be shipped to customers, so there is overhead in managing configuration and redundancy of these services as well.
That being said, microservices may be worth it if we can clearly define the engineering benefit towards maintainability, scalability, and reliability. For example, we’ve considered introducing a GitLab CI service daemon that can better handle CI queues.
424f73d2
)