Infrastructure Department Performance Indicators

Executive Summary

KPI Health Status
GitLab.com Availability SLO Okay
  • June 2024 100.00%
  • May 2024 99.99%
  • April 2024 99.95%
Corrective Action SLO Attention
  • Corrective Action SLO is at 2
Master Pipeline Stability Attention
  • April 2024 decreased to 91%
  • Cause for broken `master` for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%)
  • We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team's weekly triage report
  • More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
Merge request pipeline duration Attention
  • The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore, causing our chart to sometimes not take child pipelines into account when computing pipeline duration. A fix was made on 2024-04-29 to ensure that we're using the pipeline duration from GitLab database directly instead of calculating from queries ([see investigation](https://gitlab.com/gitlab-org/quality/engineering-productivity/team/-/issues/378#note_1740584179)). As a result, the average duration and percentiles duration are higher than previously thought.
S1 Open Customer Bug Age (OCBA) Okay
  • Promoted to KPI in FY24Q2
  • Near target for 3 consecutive months, uptick in current month due to ongoing triaging of issues
  • All S1 bugs are been reviewed for upcoming milestone planning
S2 Open Customer Bug Age (OCBA) Attention
  • Promoted to KPI in FY24Q2
  • Above target, significant reduction will require a focus on older customer impacting S2
Quality Team Member Retention Confidential
  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric
Infrastructure Team Member Retention Confidential
  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric

Key Performance Indicators

GitLab.com Availability SLO

Percentage of time during which GitLab.com is fully operational and providing service to users within SLO parameters. Definition is available on the GitLab.com Service Level Availability page. Historical Availability is available on the Service Level Availability page.

Target: equal or greater than 99.80% Health:Okay

  • June 2024 100.00%
  • May 2024 99.99%
  • April 2024 99.95%
Chart

URL(s):


Corrective Action SLO

The Corrective Actions (CAs) SLO focuses on the number of open severity::1/severity::2 Corrective Action Issues past their due date. Corrective Actions and their due dates are defined in our Incident Review process.

Target: below 0 Health:Attention

  • Corrective Action SLO is at 2
Chart

URL(s):


Master Pipeline Stability

Measures our monolith master pipeline success rate. A key indicator to engineering productivity and the stability of our releases. We will continue to leverage Merge Trains in this effort.

Target: Above 95% Health:Attention

  • April 2024 decreased to 91%
  • Cause for broken master for April 2024 are flaky tests (45%), infrastructure/runner issues (42%), job timing out (17%), various infrastructure issues (11%), failed to pull job image (9%), runner disk full (5%), merge train missing (3%), test gap (3%), dependency upgrade (3%), broken ci config (2%), GitLab.com overloaded (2%)
  • We automated the test quarantine process to remove very disruptive flaky tests from the pipelines and report them to their team’s weekly triage report
  • More communication has been added to merge requests and Slack channels to seek earlier actions on failed pipelines
Chart

URL(s):


Merge request pipeline duration

Measures the average successful duration for the monolith merge request pipelines. Key building block to improve our cycle time, and efficiency. More pipeline improvements.

Target: Below 45 minutes Health:Attention

  • The previous chart we were showing made some assumptions on the dependency of CI jobs, and those assumptions do not hold anymore, causing our chart to sometimes not take child pipelines into account when computing pipeline duration. A fix was made on 2024-04-29 to ensure that we’re using the pipeline duration from GitLab database directly instead of calculating from queries (see investigation). As a result, the average duration and percentiles duration are higher than previously thought.

Chart

URL(s):


S1 Open Customer Bug Age (OCBA)

S1 Open Customer Bug Age (OCBA) measures the total number of days that all S1 customer-impacting bugs are open within a month divided by the number of S1 customer-impacting bugs within that month.

Target: Below 30 days Health:Okay

  • Promoted to KPI in FY24Q2
  • Near target for 3 consecutive months, uptick in current month due to ongoing triaging of issues
  • All S1 bugs are been reviewed for upcoming milestone planning
Chart

URL(s):


S2 Open Customer Bug Age (OCBA)

S2 Open Customer Bug Age (OCBA) measures the total number of days that all S2 customer-impacting bugs are open within a month divided by the number of S2 customer-impacting bugs within that month.

Target: Below 250 Health:Attention

  • Promoted to KPI in FY24Q2
  • Above target, significant reduction will require a focus on older customer impacting S2
Chart

URL(s):


Quality Team Member Retention

We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.

Target: at or above 84% This KPI cannot be public Health:Confidential

  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric
URL(s):


Infrastructure Team Member Retention

We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.

Target: at or above 84% This KPI cannot be public Health:Confidential

  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric
URL(s):


Regular Performance Indicators

Mean Time To Production (MTTP)

Measures the elapsed time (in hours) from merging a change in gitlab-org/gitlab projects master branch, to deploying that change to gitlab.com. It serves as an indicator of our speed capabilities to deploy application changes into production. This metric is equivalent to the Lead Time for Changes metric in the Four Keys Project from the DevOps Research and Assessment. Additionally, the data for this metric also shows Deployment Frequency, another of the Four Keys metrics. MTTP breakdown can be visualized on the Delivery Metrics page .

Target: less than 12 hours Health:Okay

  • Work towards MTTP epic 280.
Chart

URL(s):


Review App deployment success rate

Measures the stability of our test tooling to enable end to end and exploratory testing feedback.

Target: Above 95% Health:Attention

  • Moved to regular PI in FY24Q2
  • Stabilized at 95% to 96% in the past 3 months
Chart

URL(s):


Time to First Failure p80

TtFF (pronounced “teuf”) measures the 80th percentile time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.

Target: Below 20 minutes Health:Okay

  • Track this metric in addition to average starting FY23Q4
  • Plan to optimize selective tests in place for backend and frontend tests
Chart

URL(s):


Time to First Failure

TtFF (pronounced “teuf”) measures the average time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.

Target: Below 15 minutes Health:Okay

  • Moved to regular PI in FY24Q2
  • Under target of 15 mins for the past 2 months
Chart

URL(s):


Average duration of end-to-end test suite

Measures the speed of our full QA/end-to-end test suite in the master branch. A Software Engineering in Test job-family performance-indicator.

Target: at 90 mins Health:Okay

  • Below target of 90 mins
Chart

URL(s):


Average age of quarantined end-to-end tests

Measures the stability and effectiveness of our QA/end-to-end tests running in the master branch. A Software Engineering in Test job-family performance-indicator.

Target: TBD Unknown

  • Chart to track historical metric was broken. Chart has been recently fixed, but our visibility is limited.
Chart

URL(s):


S1 Open Bug Age (OBA)

S1 Open Bug Age (OBA) measures the total number of days that all S1 bugs are open within a month divided by the number of S1 bugs within that month.

Target: Below 60 days Health:Okay

  • Under target for the past 5 months
  • Moved to regular PI in FY24Q3
Chart


S2 Open Bug Age (OBA)

S2 Open Bug Age (OBA) measures the total number of days that all S2 bugs are open within a month divided by the number of S2 bugs within that month.

Target: Below 250 days Health:Okay

  • Under target for the past 11 months
  • Moved to regular PI in FY24Q3
Chart

URL(s):


Quality Handbook MR Rate

The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/quality/** over time.

Target: Above 1 MR per person per month Health:Problem

  • Declining in last 3 months
  • To be combined into one handbook structure and measurement
Chart

		</tableau-viz>
	</div>

Quality Department Promotion Rate

The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.

Target: 12% Health:Okay

  • Under target for 4 months, which is expected after being above target for 8 months
Chart

		</tableau-viz>
	</div>

Quality Department Discretionary Bonus Rate

The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.

Target: at or above 10% Health:Attention

  • We have not been close to target for 10 months
  • Combining into one measurement in progress
Chart

		</tableau-viz>
	</div>

Infrastructure Handbook MR Rate

The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/ or /source/handbook/support/ over time.

Target: 0.25 Health:Attention

  • Adjusted the target to .55 to be consistent with larger org, reflect less activity from managers, and overall the trend that our initial suggested target is higher than many months of observed activity.
  • Combining into one handbook structure and measurement in progress
Chart

		</tableau-viz>
	</div>

Infrastructure Department Discretionary Bonus Rate

The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.

Target: at or above 10% Health:Okay

  • Combining into one department measurement in-progress
Chart

		</tableau-viz>
	</div>

Mean Time Between Incidents (MTBI)

Measures the mean elapsed time (in hours) from the start of one production incident, to the start of the next production incident. It serves primarily as an indicator of the amount of disruption being experienced by users and by on-call engineers. This metric includes only Severity 1 & 2 incidents as these are most directly impactful to customers. This metric can be considered “MTBF of Incidents”.

Target: more than 120 hours Health:Okay

  • Target at 120 hours with the intent that we should not have such incidents more than approximately weekly (hopefully less). Furter iterations will increase this target when we incorporate environment (production only).
  • Deployment failures (and the mean time between them) will be extracted into a separate metric to serve as a quality countermeasure for MTTP, unrelated to this metric which focuses on declared service incidents.
Chart

URL(s):


Mean Time To Resolution (MTTR)

For all customer-impacting services, measures the elapsed time (in hours) it takes us to resolve when an incident occurs. This serves as an indicator of our ability to execute said recoveries. This includes Severity 1 & Severity 2 incidents from production project

Target: less than 24 hours Health:Attention

  • data depends on SREs adding incident::resolved label
  • as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric
Chart

URL(s):


Mean Time To Mitigate (MTTM)

For all customer-impacting services, measures the elapsed time (in hours) it takes us to mitigate when an incident occurs. This serves as an indicator of our ability to mitigate production incidents. This includes Severity 1 & Severity 2 incidents from production project

Target: less than 1 hours Health:Attention

  • This metric is equivalent to the Time to Restore metric in the Four Keys Project from the DevOps Research and Assessment
  • data depends on SREs adding incident::mitigate label
  • as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric
Chart

URL(s):


GitLab.com Saturation Forecasting

It is critical that we continuously observe resource saturation normal growth as well as detect anomalies. This helps to ensure that we have the appropriate platform capacity in place. This metric uses the results of Tamland forecasting framework of non-horizontally scalable services, which end up as issues in Capacity Planning Project. This metric counts the number of open capacity issues in that project.

Target: at or below 5 open issues Health:Attention

  • Next improvements are to document the existing process for creating capacity planning issues with a view to simplifying and automating the process. Documentation an improvement is a requirement for the SOC 2 Availability Criteria and is an OKR for Scalability for Q1.
  • Once we have Thanos data available in Snowflake we will switch this PI to show the percentage
Chart

URL(s):


GitLab.com Hosting Cost / Revenue

We need to spend our investors’ money wisely. As part of this we aim to follow industry standard targets for hosting cost as a % of overall revenue. In this case revenue is measured as MRR + one time monthly revenue from CI & Storage

Target: TBD This KPI cannot be public Health:Confidential

  • Confidential metric - See Key Review agenda
URL(s):


Infrastructure Department Promotion Rate

The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.

Target: 12% Health:Okay

  • Above target
  • Combining into one department measurement in-progress
Chart

		</tableau-viz>
	</div>

Legends

Health

Value Level Meaning
3 Okay The KPI is at an acceptable level compared to the threshold
2 Attention This is a blip, or we’re going to watch it, or we just need to enact a proven intervention
1 Problem We'll prioritize our efforts here
-1 Confidential Metric & metric health are confidential
0 Unknown Unknown

How pages like this work

Data

The heart of pages like this are Performance Indicators data files which are YAML files. Each - denotes a dictionary of values for a new (K)PI. The current elements (or data properties) are:

Property Type Description
name Required String value of the name of the (K)PI. For Product PIs, product hierarchy should be separate from name by " - " (Ex. {Stage Name}:{Group Name} - {PI Type} - {PI Name}
base_path Required Relative path to the performance indicator page that this (K)PI should live on
definition Required refer to Parts of a KPI
parent Optional should be used when a (K)PI is a subset of another PI. For example, we might care about Hiring vs Plan at the company level. The child would be the division and department levels, which would have the parent flag.
target Required The target or cap for the (K)PI. Please use Unknown until we reach maturity level 2 if this is not yet defined. For GMAU, the target should be quarterly.
org Required the organizational grouping (Ex: Engineering Function or Development Department). For Product Sections, ensure you have the word section (Ex : Dev Section)
section Optional the product section (Ex: dev) as defined in sections.yml
stage Optional the product stage (Ex: release) as defined in stages.yml
group Optional the product group (Ex: progressive_delivery) as defined in stages.yml
category Optional the product group (Ex: feature_flags) as defined in categories.yml
is_key Required boolean value (true/false) that indicates if it is a (key) performance indicator
health Required indicates the (K)PI health and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI.
health.level Optional indicates a value between 0 and 3 (inclusive) to represent the health of the (K)PI. This should be updated monthly before Key Reviews by the DRI.
health.reasons Optional indicates the reasons behind the health level. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason.
urls Optional list of urls associated with the (K)PI. Should be an array (indented lines starting with dashes) even if you only have one url
funnel Optional indicates there is a handbook link for a description of the funnel for this PI. Should be a URL
public Optional boolean flag that can be set to false where a (K)PI does not meet the public guidelines.
pi_type Optional indicates the Product PI type (Ex: AMAU, GMAU, SMAU, Group PPI)
product_analytics_type Optional indicates if the metric is available on SaaS, SM (self-managed), or Both.
is_primary Optional boolean flag that indicates if this is the Primary PI for the Product Group.
implementation Optional indicates the implementation status and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI.
implementation.status Optional indicates the Implementation Status status. This should be updated monthly before Key Reviews by the DRI.
implementation.reasons Optional indicates the reasons behind the implementation status. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason.
lessons Optional indicates lessons learned from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI.
lessons.learned Optional learned is an attribute that can be nested under lessonsand indicates lessons learned from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one lesson learned
monthly_focus Optional indicates monthly focus goals from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI.
monthly_focus.goals Optional indicates monthly focus goals from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one goal
metric_name Optional indicates the name of the metric in Self-Managed implemenation. The SaaS representation of the Self-Managed implementation should use the same name.