Infrastructure Department Performance Indicators

Executive Summary

KPI	Health	Status
GitLab.com Availability SLO	Okay	June 2024 100.00% May 2024 99.99% April 2024 99.95%
Corrective Action SLO	Attention	Corrective Action SLO is at 2
Master Pipeline Stability	Attention	April 2024 decreased to 91% Cause for broken `master` for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%) We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team's weekly triage report More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
Merge request pipeline duration	Attention	The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore, causing our chart to sometimes not take child pipelines into account when computing pipeline duration. A fix was made on 2024-04-29 to ensure that we're using the pipeline duration from GitLab database directly instead of calculating from queries ([see investigation](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/378#note_1740584179)). As a result, the average duration and percentiles duration are higher than previously thought.
S1 Open Customer Bug Age (OCBA)	Okay	Promoted to KPI in FY24Q2 Near target for 3 consecutive months, uptick in current month due to ongoing triaging of issues All S1 bugs are been reviewed for upcoming milestone planning
S2 Open Customer Bug Age (OCBA)	Attention	Promoted to KPI in FY24Q2 Above target, significant reduction will require a focus on older customer impacting S2
Quality Team Member Retention	Confidential	Confidential metric, see notes in Key Review agenda Will be merged into a combined department retention metric
Infrastructure Team Member Retention	Confidential	Confidential metric, see notes in Key Review agenda Will be merged into a combined department retention metric

Key Performance Indicators

GitLab.com Availability SLO

Percentage of time during which GitLab.com is fully operational and providing service to users within SLO parameters. Definition is available on the GitLab.com Service Level Availability page. Historical Availability is available on the Service Level Availability page.

Target: equal or greater than 99.80% Health:Okay

June 2024 100.00%
May 2024 99.99%
April 2024 99.95%

Chart

URL(s):

Corrective Action SLO

The Corrective Actions (CAs) SLO focuses on the number of open severity::1/severity::2 Corrective Action Issues past their due date. Corrective Actions and their due dates are defined in our Incident Review process.

Target: below 0 Health:Attention

Corrective Action SLO is at 2

Chart

URL(s):

Master Pipeline Stability

Measures our monolith master pipeline success rate. A key indicator to engineering productivity and the stability of our releases. We will continue to leverage Merge Trains in this effort.

Target: Above 95% Health:Attention

April 2024 decreased to 91%
Cause for broken master for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%)
We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team’s weekly triage report
More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines

Chart

URL(s):

Merge request pipeline duration

Measures the average successful duration for the monolith merge request pipelines. Key building block to improve our cycle time, and efficiency. More pipeline improvements.

Target: Below 45 minutes Health:Attention

The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore, causing our chart to sometimes not take child pipelines into account when computing pipeline duration. A fix was made on 2024-04-29 to ensure that we’re using the pipeline duration from GitLab database directly instead of calculating from queries (see investigation). As a result, the average duration and percentiles duration are higher than previously thought.

Chart

URL(s):

S1 Open Customer Bug Age (OCBA)

S1 Open Customer Bug Age (OCBA) measures the total number of days that all S1 customer-impacting bugs are open within a month divided by the number of S1 customer-impacting bugs within that month.

Target: Below 30 days Health:Okay

Promoted to KPI in FY24Q2
Near target for 3 consecutive months, uptick in current month due to ongoing triaging of issues
All S1 bugs are been reviewed for upcoming milestone planning

Chart

URL(s):

S2 Open Customer Bug Age (OCBA)

S2 Open Customer Bug Age (OCBA) measures the total number of days that all S2 customer-impacting bugs are open within a month divided by the number of S2 customer-impacting bugs within that month.

Target: Below 250 Health:Attention

Promoted to KPI in FY24Q2
Above target, significant reduction will require a focus on older customer impacting S2

Chart

URL(s):

Quality Team Member Retention

We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.

Target: at or above 84% This KPI cannot be public Health:Confidential

Confidential metric, see notes in Key Review agenda
Will be merged into a combined department retention metric

URL(s):

https://10az.online.tableau.com/#/site/gitlab/views/N5AttritionDashboard/AttritionDashboard?:iid=1

Infrastructure Team Member Retention

Target: at or above 84% This KPI cannot be public Health:Confidential

Confidential metric, see notes in Key Review agenda
Will be merged into a combined department retention metric

URL(s):

https://10az.online.tableau.com/#/site/gitlab/views/N5AttritionDashboard/AttritionDashboard?:iid=1

Regular Performance Indicators

Mean Time To Production (MTTP)

Measures the elapsed time (in hours) from merging a change in gitlab-org/gitlab projects master branch, to deploying that change to gitlab.com. It serves as an indicator of our speed capabilities to deploy application changes into production. This metric is equivalent to the Lead Time for Changes metric in the Four Keys Project from the DevOps Research and Assessment. Additionally, the data for this metric also shows Deployment Frequency, another of the Four Keys metrics. MTTP breakdown can be visualized on the Delivery Metrics page .

Target: less than 12 hours Health:Okay

Work towards MTTP epic 280.

Chart

URL(s):

Review App deployment success rate

Measures the stability of our test tooling to enable end to end and exploratory testing feedback.

Target: Above 95% Health:Attention

Moved to regular PI in FY24Q2
Stabilized at 95% to 96% in the past 3 months

Chart

URL(s):

Time to First Failure p80

TtFF (pronounced “teuf”) measures the 80th percentile time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.

Target: Below 20 minutes Health:Okay

Track this metric in addition to average starting FY23Q4
Plan to optimize selective tests in place for backend and frontend tests

Chart

URL(s):

Time to First Failure

TtFF (pronounced “teuf”) measures the average time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.

Target: Below 15 minutes Health:Okay

Moved to regular PI in FY24Q2
Under target of 15 mins for the past 2 months

Chart

URL(s):

Average duration of end-to-end test suite

Measures the speed of our full QA/end-to-end test suite in the master branch. A Software Engineering in Test job-family performance-indicator.

Target: at 90 mins Health:Okay

Below target of 90 mins

Chart

URL(s):

Average age of quarantined end-to-end tests

Measures the stability and effectiveness of our QA/end-to-end tests running in the master branch. A Software Engineering in Test job-family performance-indicator.

Target: TBD Unknown

Chart to track historical metric was broken. Chart has been recently fixed, but our visibility is limited.

Chart

URL(s):

S1 Open Bug Age (OBA)

S1 Open Bug Age (OBA) measures the total number of days that all S1 bugs are open within a month divided by the number of S1 bugs within that month.

Target: Below 60 days Health:Okay

Under target for the past 5 months
Moved to regular PI in FY24Q3

Chart

S2 Open Bug Age (OBA)

S2 Open Bug Age (OBA) measures the total number of days that all S2 bugs are open within a month divided by the number of S2 bugs within that month.

Target: Below 250 days Health:Okay

Under target for the past 11 months
Moved to regular PI in FY24Q3

Chart

URL(s):

Quality Handbook MR Rate

The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/quality/** over time.

Target: Above 1 MR per person per month Health:Problem

Declining in last 3 months
To be combined into one handbook structure and measurement

Chart

		</tableau-viz>
	</div>

Quality Department Promotion Rate

The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.

Target: 12% Health:Okay

Under target for 4 months, which is expected after being above target for 8 months

Chart

		</tableau-viz>
	</div>

Quality Department Discretionary Bonus Rate

The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.

Target: at or above 10% Health:Attention

We have not been close to target for 10 months
Combining into one measurement in progress

Chart

		</tableau-viz>
	</div>

Infrastructure Handbook MR Rate

Target: 0.25 Health:Attention

Adjusted the target to .55 to be consistent with larger org, reflect less activity from managers, and overall the trend that our initial suggested target is higher than many months of observed activity.
Combining into one handbook structure and measurement in progress

Chart

		</tableau-viz>
	</div>

Infrastructure Department Discretionary Bonus Rate

The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.

Target: at or above 10% Health:Okay

Combining into one department measurement in-progress

Chart

		</tableau-viz>
	</div>

Mean Time Between Incidents (MTBI)

Measures the mean elapsed time (in hours) from the start of one production incident, to the start of the next production incident. It serves primarily as an indicator of the amount of disruption being experienced by users and by on-call engineers. This metric includes only Severity 1 & 2 incidents as these are most directly impactful to customers. This metric can be considered “MTBF of Incidents”.

Target: more than 120 hours Health:Okay

Target at 120 hours with the intent that we should not have such incidents more than approximately weekly (hopefully less). Furter iterations will increase this target when we incorporate environment (production only).
Deployment failures (and the mean time between them) will be extracted into a separate metric to serve as a quality countermeasure for MTTP, unrelated to this metric which focuses on declared service incidents.

Chart

URL(s):

https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/MeanTimeBetweenIncidentsPI

Mean Time To Resolution (MTTR)

For all customer-impacting services, measures the elapsed time (in hours) it takes us to resolve when an incident occurs. This serves as an indicator of our ability to execute said recoveries. This includes Severity 1 & Severity 2 incidents from production project

Target: less than 24 hours Health:Attention

data depends on SREs adding incident::resolved label
as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric

Chart

URL(s):

Mean Time To Mitigate (MTTM)

For all customer-impacting services, measures the elapsed time (in hours) it takes us to mitigate when an incident occurs. This serves as an indicator of our ability to mitigate production incidents. This includes Severity 1 & Severity 2 incidents from production project

Target: less than 1 hours Health:Attention

This metric is equivalent to the Time to Restore metric in the Four Keys Project from the DevOps Research and Assessment
data depends on SREs adding incident::mitigate label
as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric

Chart

URL(s):

https://10az.online.tableau.com/#/site/gitlab/views/InfrastructureKPIs/MeanTimetoMitigatePI

GitLab.com Saturation Forecasting

It is critical that we continuously observe resource saturation normal growth as well as detect anomalies. This helps to ensure that we have the appropriate platform capacity in place. This metric uses the results of Tamland forecasting framework of non-horizontally scalable services, which end up as issues in Capacity Planning Project. This metric counts the number of open capacity issues in that project.

Target: at or below 5 open issues Health:Attention

Next improvements are to document the existing process for creating capacity planning issues with a view to simplifying and automating the process. Documentation an improvement is a requirement for the SOC 2 Availability Criteria and is an OKR for Scalability for Q1.
Once we have Thanos data available in Snowflake we will switch this PI to show the percentage

Chart

URL(s):

GitLab.com Hosting Cost / Revenue

We need to spend our investors’ money wisely. As part of this we aim to follow industry standard targets for hosting cost as a % of overall revenue. In this case revenue is measured as MRR + one time monthly revenue from CI & Storage

Target: TBD This KPI cannot be public Health:Confidential

Confidential metric - See Key Review agenda

URL(s):

https://10az.online.tableau.com/#/site/gitlab/views/FinOps-CloudBilling/ExecutiveDashboard-1?:iid=1

Infrastructure Department Promotion Rate

Target: 12% Health:Okay

Above target
Combining into one department measurement in-progress

Chart

		</tableau-viz>
	</div>

Legends

Health

Value	Level	Meaning
3	Okay	The KPI is at an acceptable level compared to the threshold
2	Attention	This is a blip, or we’re going to watch it, or we just need to enact a proven intervention
1	Problem	We'll prioritize our efforts here
-1	Confidential	Metric & metric health are confidential
0	Unknown	Unknown

How pages like this work

Data

The heart of pages like this are Performance Indicators data files which are YAML files. Each - denotes a dictionary of values for a new (K)PI. The current elements (or data properties) are:

Property	Type	Description
`name`	Required	String value of the name of the (K)PI. For Product PIs, product hierarchy should be separate from name by " - " (Ex. {Stage Name}:{Group Name} - {PI Type} - {PI Name}
`base_path`	Required	Relative path to the performance indicator page that this (K)PI should live on
`definition`	Required	refer to Parts of a KPI
`parent`	Optional	should be used when a (K)PI is a subset of another PI. For example, we might care about Hiring vs Plan at the company level. The child would be the division and department levels, which would have the parent flag.
`target`	Required	The target or cap for the (K)PI. Please use `Unknown until we reach maturity level 2` if this is not yet defined. For GMAU, the target should be quarterly.
`org`	Required	the organizational grouping (Ex: Engineering Function or Development Department). For Product Sections, ensure you have the word section (Ex : Dev Section)
`section`	Optional	the product section (Ex: dev) as defined in sections.yml
`stage`	Optional	the product stage (Ex: release) as defined in stages.yml
`group`	Optional	the product group (Ex: progressive_delivery) as defined in stages.yml
`category`	Optional	the product group (Ex: feature_flags) as defined in categories.yml
`is_key`	Required	boolean value (true/false) that indicates if it is a (key) performance indicator
`health`	Required	indicates the (K)PI health and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI.
`health.level`	Optional	indicates a value between 0 and 3 (inclusive) to represent the health of the (K)PI. This should be updated monthly before Key Reviews by the DRI.
`health.reasons`	Optional	indicates the reasons behind the health level. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason.
`urls`	Optional	list of urls associated with the (K)PI. Should be an array (indented lines starting with dashes) even if you only have one url
`funnel`	Optional	indicates there is a handbook link for a description of the funnel for this PI. Should be a URL
`public`	Optional	boolean flag that can be set to `false` where a (K)PI does not meet the public guidelines.
`pi_type`	Optional	indicates the Product PI type (Ex: AMAU, GMAU, SMAU, Group PPI)
`product_analytics_type`	Optional	indicates if the metric is available on SaaS, SM (self-managed), or Both.
`is_primary`	Optional	boolean flag that indicates if this is the Primary PI for the Product Group.
`implementation`	Optional	indicates the implementation status and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI.
`implementation.status`	Optional	indicates the Implementation Status status. This should be updated monthly before Key Reviews by the DRI.
`implementation.reasons`	Optional	indicates the reasons behind the implementation status. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason.
`lessons`	Optional	indicates lessons learned from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI.
`lessons.learned`	Optional	`learned` is an attribute that can be nested under `lessons`and indicates lessons learned from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one lesson learned
`monthly_focus`	Optional	indicates monthly focus goals from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI.
`monthly_focus.goals`	Optional	indicates monthly focus goals from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one goal
`metric_name`	Optional	indicates the name of the metric in Self-Managed implemenation. The SaaS representation of the Self-Managed implementation should use the same name.

Last modified December 15, 2023: Migrate Engineering ruby code to the new handbook (c8544f4a)

View page source - Edit this page - please contribute.