Infrastructure Department Performance Indicators
Executive Summary
KPI | Health | Status |
---|---|---|
GitLab.com Availability SLO | Okay |
|
Corrective Action SLO | Attention |
|
Master Pipeline Stability | Attention |
|
Merge request pipeline duration | Attention |
|
S1 Open Customer Bug Age (OCBA) | Okay |
|
S2 Open Customer Bug Age (OCBA) | Attention |
|
Quality Team Member Retention | Confidential |
|
Infrastructure Team Member Retention | Confidential |
|
Key Performance Indicators
GitLab.com Availability SLO
Percentage of time during which GitLab.com is fully operational and providing service to users within SLO parameters. Definition is available on the GitLab.com Service Level Availability page. Historical Availability is available on the Service Level Availability page.
Target: equal or greater than 99.80% Health:Okay
- June 2024 100.00%
- May 2024 99.99%
- April 2024 99.95%
Chart
URL(s):
-
https://about.gitlab.com/handbook/engineering/monitoring/#historical-service-availability
-
https://dashboards.gitlab.net/d/general-slas/general-slas?orgId=1&from=now%2FM&to=now
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/Gitlab_comAvailabilityKPI
-
https://docs.google.com/spreadsheets/d/1ee7O2k0Krg9PYDta_nehRZM2JMssPm81vS8kv05fu5g/edit#gid=0
Corrective Action SLO
The Corrective Actions (CAs) SLO focuses on the number of open severity::1/severity::2 Corrective Action Issues past their due date. Corrective Actions and their due dates are defined in our Incident Review process.
Target: below 0 Health:Attention
- Corrective Action SLO is at 2
Chart
URL(s):
Master Pipeline Stability
Measures our monolith master pipeline success rate. A key indicator to engineering productivity and the stability of our releases. We will continue to leverage Merge Trains in this effort.
Target: Above 95% Health:Attention
- April 2024 decreased to 91%
- Cause for broken
master
for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%) - We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team’s weekly triage report
- More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
Chart
URL(s):
-
https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/195
-
https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/30
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/MasterPipelineStabilityKPI
Merge request pipeline duration
Measures the average successful duration for the monolith merge request pipelines. Key building block to improve our cycle time, and efficiency. More pipeline improvements.
Target: Below 45 minutes Health:Attention
- The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore, causing our chart to sometimes not take child pipelines into account when computing pipeline duration. A fix was made on 2024-04-29 to ensure that we’re using the pipeline duration from GitLab database directly instead of calculating from queries (see investigation). As a result, the average duration and percentiles duration are higher than previously thought.
Chart
URL(s):
-
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/140625
-
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/139473
-
https://10az.online.tableau.com/#/site/gitlab/workbooks/2312755/views
S1 Open Customer Bug Age (OCBA)
S1 Open Customer Bug Age (OCBA) measures the total number of days that all S1 customer-impacting bugs are open within a month divided by the number of S1 customer-impacting bugs within that month.
Target: Below 30 days Health:Okay
- Promoted to KPI in FY24Q2
- Near target for 3 consecutive months, uptick in current month due to ongoing triaging of issues
- All S1 bugs are been reviewed for upcoming milestone planning
Chart
URL(s):
-
https://gitlab.com/groups/gitlab-org/quality/quality-engineering/-/epics/8
-
https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/2433
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/S1OpenCustomerBugAgeKPI
S2 Open Customer Bug Age (OCBA)
S2 Open Customer Bug Age (OCBA) measures the total number of days that all S2 customer-impacting bugs are open within a month divided by the number of S2 customer-impacting bugs within that month.
Target: Below 250 Health:Attention
- Promoted to KPI in FY24Q2
- Above target, significant reduction will require a focus on older customer impacting S2
Chart
URL(s):
-
https://gitlab.com/groups/gitlab-org/quality/quality-engineering/-/epics/8
-
https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/2421
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/S2OpenCustomerBugAgeKPI
Quality Team Member Retention
We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.
Target: at or above 84% This KPI cannot be public Health:Confidential
- Confidential metric, see notes in Key Review agenda
- Will be merged into a combined department retention metric
Infrastructure Team Member Retention
We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.
Target: at or above 84% This KPI cannot be public Health:Confidential
- Confidential metric, see notes in Key Review agenda
- Will be merged into a combined department retention metric
Regular Performance Indicators
Mean Time To Production (MTTP)
Measures the elapsed time (in hours) from merging a change in gitlab-org/gitlab projects master branch, to deploying that change to gitlab.com. It serves as an indicator of our speed capabilities to deploy application changes into production. This metric is equivalent to the Lead Time for Changes metric in the Four Keys Project from the DevOps Research and Assessment. Additionally, the data for this metric also shows Deployment Frequency, another of the Four Keys metrics. MTTP breakdown can be visualized on the Delivery Metrics page .
Target: less than 12 hours Health:Okay
- Work towards MTTP epic 280.
Chart
URL(s):
Review App deployment success rate
Measures the stability of our test tooling to enable end to end and exploratory testing feedback.
Target: Above 95% Health:Attention
- Moved to regular PI in FY24Q2
- Stabilized at 95% to 96% in the past 3 months
Chart
URL(s):
-
https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/50
-
https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/56
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/ReviewAppSuccessRatePI
Time to First Failure p80
TtFF (pronounced “teuf”) measures the 80th percentile time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.
Target: Below 20 minutes Health:Okay
- Track this metric in addition to average starting FY23Q4
- Plan to optimize selective tests in place for backend and frontend tests
Chart
URL(s):
-
https://about.gitlab.com/handbook/engineering/quality/engineering-productivity/#test-intelligence
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/TimetoFirstFailureP80PI
Time to First Failure
TtFF (pronounced “teuf”) measures the average time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.
Target: Below 15 minutes Health:Okay
- Moved to regular PI in FY24Q2
- Under target of 15 mins for the past 2 months
Chart
URL(s):
-
https://about.gitlab.com/handbook/engineering/quality/engineering-productivity/#test-intelligence
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/TimetoFirstFailurePI
Average duration of end-to-end test suite
Measures the speed of our full QA/end-to-end test suite in the master
branch. A Software Engineering in Test job-family performance-indicator.
Target: at 90 mins Health:Okay
- Below target of 90 mins
Chart
URL(s):
Average age of quarantined end-to-end tests
Measures the stability and effectiveness of our QA/end-to-end tests running in the master
branch. A Software Engineering in Test job-family performance-indicator.
Target: TBD Unknown
- Chart to track historical metric was broken. Chart has been recently fixed, but our visibility is limited.
Chart
URL(s):
S1 Open Bug Age (OBA)
S1 Open Bug Age (OBA) measures the total number of days that all S1 bugs are open within a month divided by the number of S1 bugs within that month.
Target: Below 60 days Health:Okay
- Under target for the past 5 months
- Moved to regular PI in FY24Q3
Chart
S2 Open Bug Age (OBA)
S2 Open Bug Age (OBA) measures the total number of days that all S2 bugs are open within a month divided by the number of S2 bugs within that month.
Target: Below 250 days Health:Okay
- Under target for the past 11 months
- Moved to regular PI in FY24Q3
Chart
URL(s):
-
https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1641
-
https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1644
Quality Handbook MR Rate
The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/quality/**
over time.
Target: Above 1 MR per person per month Health:Problem
- Declining in last 3 months
- To be combined into one handbook structure and measurement
Chart
</tableau-viz>
</div>
Quality Department Promotion Rate
The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.
Target: 12% Health:Okay
- Under target for 4 months, which is expected after being above target for 8 months
Chart
</tableau-viz>
</div>
Quality Department Discretionary Bonus Rate
The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.
Target: at or above 10% Health:Attention
- We have not been close to target for 10 months
- Combining into one measurement in progress
Chart
</tableau-viz>
</div>
Infrastructure Handbook MR Rate
The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/
or /source/handbook/support/
over time.
Target: 0.25 Health:Attention
- Adjusted the target to .55 to be consistent with larger org, reflect less activity from managers, and overall the trend that our initial suggested target is higher than many months of observed activity.
- Combining into one handbook structure and measurement in progress
Chart
</tableau-viz>
</div>
Infrastructure Department Discretionary Bonus Rate
The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.
Target: at or above 10% Health:Okay
- Combining into one department measurement in-progress
Chart
</tableau-viz>
</div>
Mean Time Between Incidents (MTBI)
Measures the mean elapsed time (in hours) from the start of one production incident, to the start of the next production incident. It serves primarily as an indicator of the amount of disruption being experienced by users and by on-call engineers. This metric includes only Severity 1 & 2 incidents as these are most directly impactful to customers. This metric can be considered “MTBF of Incidents”.
Target: more than 120 hours Health:Okay
- Target at 120 hours with the intent that we should not have such incidents more than approximately weekly (hopefully less). Furter iterations will increase this target when we incorporate environment (production only).
- Deployment failures (and the mean time between them) will be extracted into a separate metric to serve as a quality countermeasure for MTTP, unrelated to this metric which focuses on declared service incidents.
Chart
URL(s):
Mean Time To Resolution (MTTR)
For all customer-impacting services, measures the elapsed time (in hours) it takes us to resolve when an incident occurs. This serves as an indicator of our ability to execute said recoveries. This includes Severity 1 & Severity 2 incidents from production project
Target: less than 24 hours Health:Attention
- data depends on SREs adding incident::resolved label
- as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric
Chart
URL(s):
-
https://gitlab.com/gitlab-com/gl-infra/production/-/boards/1717012?&label_name[]=incident
-
https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/MeanTimetoResolutionPI
Mean Time To Mitigate (MTTM)
For all customer-impacting services, measures the elapsed time (in hours) it takes us to mitigate when an incident occurs. This serves as an indicator of our ability to mitigate production incidents. This includes Severity 1 & Severity 2 incidents from production project
Target: less than 1 hours Health:Attention
- This metric is equivalent to the Time to Restore metric in the Four Keys Project from the DevOps Research and Assessment
- data depends on SREs adding incident::mitigate label
- as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric
Chart
URL(s):
GitLab.com Saturation Forecasting
It is critical that we continuously observe resource saturation normal growth as well as detect anomalies. This helps to ensure that we have the appropriate platform capacity in place. This metric uses the results of Tamland forecasting framework of non-horizontally scalable services, which end up as issues in Capacity Planning Project. This metric counts the number of open capacity issues in that project.
Target: at or below 5 open issues Health:Attention
- Next improvements are to document the existing process for creating capacity planning issues with a view to simplifying and automating the process. Documentation an improvement is a requirement for the SOC 2 Availability Criteria and is an OKR for Scalability for Q1.
- Once we have Thanos data available in Snowflake we will switch this PI to show the percentage
Chart
URL(s):
GitLab.com Hosting Cost / Revenue
We need to spend our investors’ money wisely. As part of this we aim to follow industry standard targets for hosting cost as a % of overall revenue. In this case revenue is measured as MRR + one time monthly revenue from CI & Storage
Target: TBD This KPI cannot be public Health:Confidential
- Confidential metric - See Key Review agenda
Infrastructure Department Promotion Rate
The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.
Target: 12% Health:Okay
- Above target
- Combining into one department measurement in-progress
Chart
</tableau-viz>
</div>
Legends
Health
Value | Level | Meaning |
---|---|---|
3 | Okay | The KPI is at an acceptable level compared to the threshold |
2 | Attention | This is a blip, or we’re going to watch it, or we just need to enact a proven intervention |
1 | Problem | We'll prioritize our efforts here |
-1 | Confidential | Metric & metric health are confidential |
0 | Unknown | Unknown |
How pages like this work
Data
The heart of pages like this are Performance Indicators data files which are YAML files. Each - denotes a dictionary of values for a new (K)PI. The current elements (or data properties) are:
Property | Type | Description |
---|---|---|
name |
Required | String value of the name of the (K)PI. For Product PIs, product hierarchy should be separate from name by " - " (Ex. {Stage Name}:{Group Name} - {PI Type} - {PI Name} |
base_path |
Required | Relative path to the performance indicator page that this (K)PI should live on |
definition |
Required | refer to Parts of a KPI |
parent |
Optional | should be used when a (K)PI is a subset of another PI. For example, we might care about Hiring vs Plan at the company level. The child would be the division and department levels, which would have the parent flag. |
target |
Required | The target or cap for the (K)PI. Please use Unknown until we reach maturity level 2 if this is not yet defined. For GMAU, the target should be quarterly. |
org |
Required | the organizational grouping (Ex: Engineering Function or Development Department). For Product Sections, ensure you have the word section (Ex : Dev Section) |
section |
Optional | the product section (Ex: dev) as defined in sections.yml |
stage |
Optional | the product stage (Ex: release) as defined in stages.yml |
group |
Optional | the product group (Ex: progressive_delivery) as defined in stages.yml |
category |
Optional | the product group (Ex: feature_flags) as defined in categories.yml |
is_key |
Required | boolean value (true/false) that indicates if it is a (key) performance indicator |
health |
Required | indicates the (K)PI health and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI. |
health.level |
Optional | indicates a value between 0 and 3 (inclusive) to represent the health of the (K)PI. This should be updated monthly before Key Reviews by the DRI. |
health.reasons |
Optional | indicates the reasons behind the health level. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason. |
urls |
Optional | list of urls associated with the (K)PI. Should be an array (indented lines starting with dashes) even if you only have one url |
funnel |
Optional | indicates there is a handbook link for a description of the funnel for this PI. Should be a URL |
public |
Optional | boolean flag that can be set to false where a (K)PI does not meet the public guidelines. |
pi_type |
Optional | indicates the Product PI type (Ex: AMAU, GMAU, SMAU, Group PPI) |
product_analytics_type |
Optional | indicates if the metric is available on SaaS, SM (self-managed), or Both. |
is_primary |
Optional | boolean flag that indicates if this is the Primary PI for the Product Group. |
implementation |
Optional | indicates the implementation status and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI. |
implementation.status |
Optional | indicates the Implementation Status status. This should be updated monthly before Key Reviews by the DRI. |
implementation.reasons |
Optional | indicates the reasons behind the implementation status. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason. |
lessons |
Optional | indicates lessons learned from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI. |
lessons.learned |
Optional | learned is an attribute that can be nested under lessons and indicates lessons learned from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one lesson learned |
monthly_focus |
Optional | indicates monthly focus goals from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI. |
monthly_focus.goals |
Optional | indicates monthly focus goals from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one goal |
metric_name |
Optional | indicates the name of the metric in Self-Managed implemenation. The SaaS representation of the Self-Managed implementation should use the same name. |
c8544f4a
)