Package:Container Registry Group
The Team
The Container Registry is part of the GitLab Package stage, which integrates with GitLab’s CI/CD product.
Who We Are
Team Members
The following people are permanent members of the Container Registry Group:
Stable Counterparts
The following members of other functional teams are our stable counterparts:
Name | Role |
---|---|
Greg Myers | Security Engineer, Application Security, Package (Package Registry, Container Registry), US Public Sector Services, Gitaly Cluster, Analytics (Analytics Instrumentation, Product Analytics), AI Working Group |
Jackie Porter | Director of Product Management, Verify & Package |
Tim Rizzi | Principal Product Manager, Package |
How We Work
Directly Responsible Individual (DRI)
A DRI is assigned to every substantial project or initiative the team works on. A project is considered substantial when the work involved is expected to span more than two milestones. When projects take that long to deliver, tasks such as the planning and breakdown of deliverables and regular async updates become increasingly important for the project’s success. Therefore, it makes sense to enforce the assignment of a DRI, who will be personally accountable for those tasks.
We strongly encourage everyone on the team to step forward and sign up as DRI for new projects. Ideally, all team members should experience this role over time. This promotes shared ownership, accountability and development opportunities for all team members.
In case of critical, unusually long, or highly complex projects, a specific DRI with the most experience on the subject may be assigned by the Engineering Manager. In these situations, other team members may volunteer or be assigned to shadow the assigned DRI and act as backup. This provides not only a learning opportunity for newer team members but also redundancy.
Apart from what is described in the DRI handbook page, DRIs leading projects on the team must perform the following tasks:
- Make sure the epic that serves as single source of truth for the project is kept up to date, and so are the individual sub epics and issues under;
- Make sure to consistently provide a weekly async update on the related epic. Low-level updates on sub-epics are optional. High-level updates on the root epic are required.
- Ensure there is at least one issue ready to be scheduled on the next milestone;
- Engage with the Product Manager to have the issue(s) ready for development scheduled in the next milestone;
- Keep the Engineering Manager and Product Manager aware of any unexpected changes to the plan;
- Consult and collaborate with other DRIs when inter project dependencies or blockers are identified;
- Consult with other engineers when the project’s technical scope changes.
The DRI for a given project can be identified by looking at the corresponding epic’s description, where a section as follows should be added:
## Owners
* Team: [Container Registry](/handbook/engineering/development/ops/package/container-registry/)
* Most appropriate slack channel to reach out to: `#g_container-registry`
* Best individual to reach out to: <!-- GitLab handle of the DRI, or "TBD" if none has been assigned yet -->
* PM: @trizzi
* EM: @crystalpoole
Additionally, we maintain a list of active projects and the assigned DRI on this page, in What Are We Working On.
Authors of merge requests related to a specific project should request a review from the assigned DRI or backup DRI to ensure they are aware of the changes and can provide the necessary oversight.
Alert and CI flake management
The team is responsible for monitoring the Slack channel #g_container-registry_alerts where alerts and CI notifications failures are displayed for the registry service and code base (broken master). Service alerts are configured in the runbooks project and they follow the infrastructure team process to define them.
Process for handling alerts
The team has agreed on the following process to handle alerts:
- There is no person formally on-call (unless otherwise agreed during certain periods, e.g. end of year holidays).
- Everyone is responsible for keeping an eye on #g_container-registry_alerts during their working hours.
- When there is a new alert/CI notification:
- Add an 👀 emoji to the alert to signal it is being looked at.
- Click on an alert for details. Each alert may contain the following:
- Runbook - how to deal with the alert.
- Dashboard - link to Grafana that chart related to the metric that triggered the alert.
- Pipeline that failed - broken
master
. - Sentry issue - contains stacktrace to alert origin.
- Use the available resources to evaluate the problem.
- Determine if it’s safe to ignore:
- There is an existing issue for this alert. If so, add an occurrence of this problem in the issue description following the alert occurrence template.
- The logs/dashboards show that the issue seems to be resolved. For example, when the Pending Tasks metric for the online garbage collector is going down after a sudden peak and there are no errors in the logs.
- The alert has been automatically resolved.
- Open an issue if this requires attention in the future. If the alert/CI notification is due to a flake, identify the severity of the failure and add an appropriate priority label, CC
@trizzi
in the issue for prioritization and@gitlab-org/ci-cd/package-stage/container-registry-group
so that they are aware of the issue. - If this is a recurring alert that was deemed as safe to ignore, consider raising an issue to adjust the alert thresholds, CC
@trizzi
in the issue for prioritization and@gitlab-org/ci-cd/package-stage/container-registry-group
so that they are aware of the issue. - If you raised or updated an issue, ensure that it has the correct labels. If the problem is due to a flaky test, then apply the
~"failure::flaky-test"
label.~"flaky-test::<type>"
labels are optional but recommended. If it is due to an alert, apply the~"container registry::alert"
label. Finally, ensure that the issue has the appropriate~"priority::N"
label.
- Otherwise:
- Review the #production channel and the #incident-management channel for existing incidents that may be related.
- If there is an ongoing incident, consider helping or reaching out to the team for assistance.
- Otherwise, consider reporting an incident.
- Share details in the #g_container-registry channel to raise awareness.
- Ping people as needed.
- Add a comment as a thread to the alert that you reviewed.
- Once the problem has been resolved or the required short-term investigation is complete, react with a ✅ emoji to the notification.
Alert Occurrence Template
Add/update this template to the alert related issue with the number of times the alert has been seen.
## Alert Occurrence Update
- **Occurrence Count**: X (previously Y)
- **Date/Time**: [Insert timestamp of occurrence]
- **Last occurrences**: [Insert slack link]
Resources
Logs
Dashboards
- Overview
- Application
- Online GC
- Database
- Redis
- Storage
- Notifications
- PgBouncer
- Patroni
- HAProxy
- Redis Cache Server
Other
📈 Measuring results
OKRs
We use quarterly Objectives and Key Results as a tool to help us plan and measure how to achieve Key Performance Indicators (KPIs).
Here is the standard, company-wide process for OKRs
Performance indicators
We measure the value we contribute by using performance indicator metrics. The primary metric used for the Package Registry group is the number of monthly active users or GMAU.
What Are We Working On
Below is a list of projects and initiatives that we are currently working on, along with the corresponding DRI. We work on issues by priority and projects may not have active development in every milestone. DRI engineers take responsibility for planning and delivery of upcoming work, however, issues can be assigned to any team member.
What We’ve Recently Completed
Project | Milestone Completed |
---|---|
Documentation
Project documentation is available here.
ac0e3d5e
)