Pipeline Execution Group - Risk Map
Overview
The goal of this page is to create, share and iterate on the Risk Map for the Pipeline Execution group.
Goals
Utilise the Risk Map as a tool to:
- Understand the risks the team faces
- Increase transparency on mitigation plans
- Effectively allocate limited resources
- Collaborate strategically in improving Quality
General Risk Map
Map key
- Impact - what happens if the risk is not mitigated or eliminated
- Impact level - Rate 1 (LOW) to 5 (HIGH)
- Probability - Rate 1 (LOW) to 5 (HIGH)
- Priority - Impact x Probability. Address highest score first.
- Mitigation - what could be done to lower the impact or probability
Risk Area | Risk Description | Impact | Impact level (1 LOW to 5 HIGH) | Probability (1 LOW to 5 HIGH) | Priority | Mitigation |
---|---|---|---|---|---|---|
Team/Capacity | We have 6 BE engineers and 2 FE engineers on Pipeline Execution and have a large (and growing) backlog | Burn out, missed SLO/SLA, lowers team productivity | 5 | 3 | 15 | Make BE headcount more available |
Team/Capacity | Unpredictable throughput | Low Say/Do, Missed SLO/SLA | ||||
Team/Capacity | We no longer have a stable counterpart for UX | Risk to usability and increase SUS bugs. Potential burn out for EM/PM who take over the responsibilities. | 5 | 3 | 12 | Consider scaling other counterparts if the size of the engineering team grows |
Team/Capacity | We have a shared (30%) stable counterpart for SET | Escape regession bugs | 4 | 4 | 16 | |
Team/Escalations | Escalations like Rapid Actions, Engineering Allocations are disrupting the ability to focus on team priorities | Burn out, low level of autonomy, lowers team productivity | 5 | 4 | 20 | Find ways to proactively mitigate urgent issues with gitlab.com, work on GraphQL to unblock FE, find a dedicated SRE for CI |
Product/Backlog | Bug and Technical Debt backlog has been accruing over the years | missed SLO/SLA, prioritzation is harder | 5 | 3 | 15 | Revisit ownership of domains to better share the gaps |
Infrastructure availability | Pipelines get stuck due to stuck sidekiq shard | Mass failure in E2E test suites and/or customer usage impacted | 4 | 3 | 12 | |
Quality/Testability | Hard to replicate production traffic to account for performance testing | 4 | 4 | 16 | ||
Quality/Test coverage | This is a mature product, there are many features and feature sets have yet to have test coverages (historical test gaps) | Escape regession bugs | 4 | 4 | 16 | |
Product/Cost | CI pipeline inefficiencies | CI Minute usage that could potentially be avoided | 5 | 5 | 25 | Develop features to optimize pipeline runtime |
Feature/Performance | Unperformant database queries | Adding load to gitlab.com database, slow page and feature load times | 3 | 3 | 9 | Recent rapid actions has helped, and there’s continual effort to address this to ensure we don’t regress |
Team/Efficiency | Migrating more REST to GraphQL to help unblock FE | FE productivity and delivery | 5 | 3 | 15 | |
Feature/Dependencies | Depends on runner response and processing time - https://gitlab.com/gitlab-org/gitlab/-/issues/326113 - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3631 |
If runners fail to process, jobs are not executed, pipeline is stuck | 5 | 3 | 15 | |
Infrastructure availability | CI/CD Data model scaling | CI/CD Data model scaling | 5 | 2 | 10 | Actively being worked on in CI/CD Data Model Blueprint MR |
Last modified June 27, 2024: Fix various vale errors (
46417d02
)