Workspaces Architecture for Kubernetes setup
Overview
Workspaces is delivered as a module(remote_development
) in the
GitLab agentk for Kubernetes project.
The overall goal of this architecture is to ensure that the actual state of all
workspaces running in the Kubernetes clusters is reconciled with the desired state of the
workspaces as set by the user.
This is accomplished as follows:
- The desired state of the workspaces is obtained from user actions in the GitLab UI or API and persisted in the Rails database.
- There is a reconciliation loop between the agentk and Rails, which:
- Retrieves the actual state of the workspaces from the Kubernetes clusters through the agentk and sends it to Rails to be persisted.
- Rails compares the actual state with the desired state and responds with actions to bring the actual state in line with the desired state for all workspaces.
System design
User actions to create/update/delete a workspace
GitLab Agent for Kubernetes’ reconciliation with Rails
User accessing the workspace
With GitLab Workspaces Proxy
With GitLab Agent for Workspaces(agentw)
NOTE: The below diagram only reflects the HTTP traffic flow. SSH traffic flow needs investigation and will depend on https://gitlab.com/groups/gitlab-org/-/epics/13984 .
GitLab Agent for Kubernetes topology
- The Kubernetes API is not shown in this diagram, but it is assumed that it is managing the workspaces through the agentk.
- The numbers of components in each Kubernetes cluster are arbitrary.
High-level overview of the communication between Rails and the agentk
Types of messages between Rails and the agentk
The agentk can send different types of messages to Rails to capture different information. Depending on what type of message the agentk sends, Rails will respond accordingly.
Different types of messages are:
reconcile
- Messages sent to Rails to persist the current state of the workspaces. There are two types of updates specified by theupdate_type
field with the following possible values:full
andpartial
. The payload schema remains the same for both update types.full
- Actions performed by the agentk:
- Send the current state of all the workspaces in the Kubernetes cluster managed by the agentk.
- To keep things consistent between the agentk and Rails, the agentk will send this message every time agentk undergoes a full reconciliation cycle that occurs
- when an agentk starts or restarts
- after a leader-election
- periodically, as set using the full reconciliation interval configuration (default: once every hour)
- whenever the agentk configuration is updated
- Actions performed by Rails:
- Update Postgres with the current state and respond with all the workspaces managed by the agentk and their last resource version that Rails has persisted in Postgres.
- Returning the persisted resource version back to the agentk gives it a confirmation that the updates for that workspace have been successfully processed on the Rails end.
- This persisted resource version will also help with sending only the latest workspaces changes from the agentk to Rails for
reconcile
message withpartial
update type.
- Actions performed by the agentk:
partial
- Actions performed by the agentk:
- Send the latest workspace changes to Rails that are not yet persisted in Postgres. This persisted resource version will help with sending only the latest workspaces changes from the agentk to Rails.
- Actions performed by Rails:
- Update Postgres with the current state and respond with the workspaces to be created/updated/deleted in the Kubernetes cluster and their last resource version that Rails has persisted in Postgres.
- The workspaces to be created/updated/deleted are roughly calculated by using the filter
desired state updated at >= agentk info reported at
. - Returning the persisted resource version back to the agentk gives it a confirmation that the updates for that workspace have been successfully processed on the Rails end.
- Actions performed by the agentk:
Event-driven polling vs full or partial reconciliation
It was initially considered desirable to be able to tell the agentk to not wait for the next reconciliation loop but instead poll immediately. This would grant the following benefits:
- This would grant the ability to trigger a full reconciliation on demand that would allow on-demand recovery/resetting of module state in the agentk.
- Apart from making the architecture more event-driven and real-time it would also help to increase the interval between reconciliation polls, thus reducing the load on the infrastructure.
However, as the prospective solutions were evaluated, it was concluded that there are very few/rare cases that would merit this capability, especially given the complexity of the viable options. An eventual reconciliation of state would suffice for most cases and it could be simply achieved through full reconciliation that is carried out periodically (with a longer interval as compared to partial reconciliation).
You can read more in this issue and conclusion comment
Workspace states
CreationRequested
- Initial state of a Workspace; Creation requested by user but hasn’t yet been acted onStarting
- In the process of being ready for useRunning
- Ready for useStopping
- In the process of scaling downStopped
- Persistent storage is still available but workspace has been scaled downFailed
- Kubernetes resources have been applied byagentk
but are not ready due to various reasons (for example, crashing container)Error
- Kubernetes resources failed to get applied byagentk
RestartRequested
- User has requested a restart of the workspace but the restart has not yet successfully happenedTerminating
- User has requested the termination of the workspace and the action has been initiated but not yet completed.Terminated
- Persistent storage has been deleted and the workspace has been scaled downUnknown
- Not able to understand the actual state of the workspace
Possible actual_state
values
The actual_state
values are determined from the status
attribute in the Kubernetes deployment changes, which the agentk listens to and sends to Rails.
The following diagram represents the typical flow of the actual_state
values for a Workspace
record based on the
status
values received from the agentk. The status
is parsed to derive the actual_state
of the workspace based on different conditions.
However, any of these states can be skipped if there have been any
transitional status
updates that were not received from the agentk for some reason (a quick transition, a
failure to send the event, etc).
Possible desired_state
values
The desired_state
values are determined from the user’s request to Rails and are sent to the agentk by Rails.
desired_state
is a subset of the actual_state
with only Running
, Stopped
, Terminated
and RestartRequested
values.
The state reconciliation logic in Rails will
continually attempt to transition the actual_state
to the desired_state
value, unless the workspace is in an unrecoverable state.
There is also an additional supported state of RestartRequested
which is only valid for desired_state
.
This value is not a valid value for actual_state
. It is required in order for Rails to
initiate a restart of a started workspace. It will only persist until a status
of Stopped
is received
from the agentk, indicating that the restart request was successful and in progress or completed.
At this point, the desired_state
will be automatically changed to Running
to trigger the workspace to restart again.
If there is a failure to restart the workspace, and a Stopped
status is never received, the
desired_state
will remain RestartRequested
until a new desired_state
is specified.
13540d10
)