Observability Based Performance Testing at GitLab

Description

Observability Based Performance Testing is a proactive approach to understanding system performance through comprehensive instrumentation and real-time data collection. Unlike traditional performance testing, which relies on specific test scenarios, observability testing provides deep visibility into application behavior under real-world conditions.

Approach

Observability testing is actively making use of our Observability tools to detect trends that would develop into performance issues. A couple common approaches:

Have the development teams monitor the dashboards on their components and proactively pickup performance concerns
Build dashboards/tooling that support doing exploratory testing on the Observability data, looking for linkages that may not be obvious (system A causes system B to slow down)
- Tools like the Performance Bar can enable someone to notice a performance oddity and start the investigation into the root cause
Extend existing Observability tooling to development/test environments
- This enables teams to get performance metrics earlier in the development process
- This increases team familiarity with the tooling which facilitates easier adoption/use

Key Components

Instrumentation: Embedding traces, metrics, and structured logging across the application stack.
Real-time Data Collection: Gathering performance data continuously during actual usage.
Holistic Analysis: Examining the entire system to identify bottlenecks and patterns.

Goals

Enhanced Performance Visibility: Increase team awareness of their work’s performance impact.
Proactive Issue Detection: Identify potential problems before they affect customers.
Team Empowerment: Provide accessible performance analysis tools that don’t require specialized expertise.

Non-Goals

Replace Load Testing: Observability-based testing complements, but does not replace, well-designed load tests like those using GPT.

Implementation Approach

Phase 1: Foundation Setting

Education & Documentation
- Add documentation on what this process is so teams understand it
- Create training to help teams understand and adopt
- Create [guides(https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/work_items/3323)] to enable teams to build / use performance dashboards
Baseline Assessment & Infrastructure
- There are a number of existing dashboards, identify existing ones that already cover what we need
- Identify gaps and develop plans to fill

Phase 2: Initial Implementation

Team Selection
- Select a single team to pair with
- Work with to identify / build dashboards for their use
- Integrate with their workflow
Feedback Loop & Documentation
- Bring lessons learned back into documentation for other teams

Phase 3: Practice and Culture Building

Expand to Early Adopters
Process Integration
- Document how to use in GitLab Dev workflows

Phase 4: Expansion and Maturity

Full Scale Rollout
Continuous Improvement
Advanced Capabilities
- Machine Learning Integration: Implement ML algorithms to predict performance issues based on observability data patterns.
- Enhanced Developer Tools: Create IDE plugins that provide real-time performance insights during code writing.

Case Studies

Case Study 1: Scrum team using Observability Based Performance Testing

Background

The team was building an ETL (Extract, Transform, Load) project with critical performance requirements. A few hours of system delay would create an insurmountable backlog in the ETL process, making performance monitoring essential to business operations. The team was small (9 developers, 2 QA, 1 Ops) and deployed to production every other week, adding a heavy weight process would of introduced a significant burden. So, we implemented using our existing Observability stack.

Observability implementation

We implemented a three-tiered monitoring approach:

User Experience Monitoring
- Used existing Apdex-based dashboards for both production and test environments
- Tracked end-user performance metrics like page load times and API response times
- Set clear thresholds for acceptable performance levels
Service-Level Monitoring
- Created detailed dashboards for key services showing:
  - Resource utilization (CPU, memory, I/O)
  - Service dependencies and their performance
  - Queue lengths and processing rates
- Established KPI tracking for critical ETL operations
Error Tracking
- Implemented comprehensive error logging
- Created error pattern recognition dashboards
- Set up alerting for error rate thresholds

Daily Workflow

Dashboard Review (5 minutes in daily standup)
- Team reviewed all three dashboard tiers
- Focused on identifying patterns rather than immediate problem-solving
- Used a structured approach:
  - Review user experience metrics
  - Check service health indicators
  - Examine error patterns
Issue Management
- Created issues for concerning patterns
- Prioritized performance issues alongside feature work
- Allocated dedicated investigation time in sprint planning

Implementation Challenges

Small team size and fast release cycle was a constraint on what we could deliver
Balancing prioritizing performance investigation with new feature development was difficult
- Performance improvements often competed with feature deadlines
- Sometimes had to make trade-offs between immediate fixes and long-term solutions
It took some trial and error to identify the right metrics/dashboard configuration for our team
- What we thought was the right metric to monitor proved not to be correct
- The infrastructure evolved incrementally so we had to evolve the dashboards to keep in sync

Outcomes

Met performance KPIs consistently
Maintained 99.999% uptime during 10x growth period
- Zero performance-related incidents
- No emergency load testing needed
- Proactively scaled based on observability data
Lessons Learned:
- Early warning from observability prevented major incidents
- Daily review habit kept performance top-of-mind
- Small, regular improvements are better than big reactive changes

Tools and Technologies

Prometheus
Grafana
[Add other relevant tools used at GitLab]

Future Directions

[TBD]

Resources

GitLab Handbook pages
- GitLab Performance Monitoring
- Observability for stage groups
[Link to related blog posts or external resources]
[Information on available training or workshops]

Last modified April 9, 2025: Observability Based Performance Testing definition (87663b81)

View page source - Edit this page - please contribute.