Development Department Learning and Development - Reliability

Goal of this training

As we have a renewed focus on reliability in engineering to reduce outages, we have made many changes to the handbook, production documentation, and our processes. While we have announced them via multimodal communication (EWIR, slack, email, meetings), not everyone has likely seen and internalized all of the important changes.

We want to gather all the crucial changes, explain why we made them, discuss a summary, and link to where you can find more information.

This material is available as a learning pathway on GitLab’s Level Up.

Introduction

Amplifying SaaS Reliability Focus

Reliability & Security Standup

The business impact of reliability

Importance of reliability to the business

Impact of reliability on users

Video (not public)

Improving SUS - slides 9 through 14 in particular

Updates to values

MR to change quality and reliability

MR around things that don’t scale

Blameless Culture

Google SRE Book: Blameless culture

Limiting the impact of far reaching work

Limiting the impact of far-reaching work

Overview of Risk Mapping

Quality Risk Mapping

Development ops risk mapping

MR acceptance checklist

MR acceptance checklist

Updates to the definition of done

Definition of Done

Backwards Compatibility

Course on backwards compatibility

How to use the stage group dashboards to understand how a feature category performs

Stage group dashboard documentation

Error budgets

Error budgets

Feature Change Locks (FCL)

Feature change locks

Added past due infradev as a KPI

Past due infradev issues

Overview of Engineering Metrics Dashboards

Engineering metric dashboards

Feedback on the training

  • What did you like about the training?
  • What did you not like that we should improve?

Add your comments in this feedback issue.