Observability Based Performance Testing at GitLab
Description
Observability Based Performance Testing is a proactive approach to understanding system performance through comprehensive instrumentation and real-time data collection. Unlike traditional performance testing, which relies on specific test scenarios, observability testing provides deep visibility into application behavior under real-world conditions.
Approach
Observability testing is actively making use of our Observability tools to detect trends that would develop into performance issues. A couple common approaches:
- Have the development teams monitor the dashboards on their components and proactively pickup performance concerns
- Build dashboards/tooling that support doing exploratory testing on the Observability data, looking for linkages that may not be obvious (system A causes system B to slow down)
- Tools like the Performance Bar can enable someone to notice a performance oddity and start the investigation into the root cause
- Extend existing Observability tooling to development/test environments
- This enables teams to get performance metrics earlier in the development process
- This increases team familiarity with the tooling which facilitates easier adoption/use
Key Components
- Instrumentation: Embedding traces, metrics, and structured logging across the application stack.
- Real-time Data Collection: Gathering performance data continuously during actual usage.
- Holistic Analysis: Examining the entire system to identify bottlenecks and patterns.
Goals
- Enhanced Performance Visibility: Increase team awareness of their work’s performance impact.
- Proactive Issue Detection: Identify potential problems before they affect customers.
- Team Empowerment: Provide accessible performance analysis tools that don’t require specialized expertise.
Non-Goals
- Replace Load Testing: Observability-based testing complements, but does not replace, well-designed load tests like those using GPT.
Implementation Approach
Phase 1: Foundation Setting
- Education & Documentation
- Add documentation on what this process is so teams understand it
- Create training to help teams understand and adopt
- Create [guides(https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/work_items/3323)] to enable teams to build / use performance dashboards
- Baseline Assessment & Infrastructure
- There are a number of existing dashboards, identify existing ones that already cover what we need
- Identify gaps and develop plans to fill
Phase 2: Initial Implementation
- Team Selection
- Select a single team to pair with
- Work with to identify / build dashboards for their use
- Integrate with their workflow
- Feedback Loop & Documentation
- Bring lessons learned back into documentation for other teams
Phase 3: Practice and Culture Building
- Expand to Early Adopters
- Process Integration
- Document how to use in GitLab Dev workflows
Phase 4: Expansion and Maturity
- Full Scale Rollout
- Continuous Improvement
- Advanced Capabilities
- Machine Learning Integration: Implement ML algorithms to predict performance issues based on observability data patterns.
- Enhanced Developer Tools: Create IDE plugins that provide real-time performance insights during code writing.
Case Studies
Case Study 1: Scrum team using Observability Based Performance Testing
Background
The team was building an ETL (Extract, Transform, Load) project with critical performance requirements. A few hours of system delay would create an insurmountable backlog in the ETL process, making performance monitoring essential to business operations. The team was small (9 developers, 2 QA, 1 Ops) and deployed to production every other week, adding a heavy weight process would of introduced a significant burden. So, we implemented using our existing Observability stack.
Observability implementation
We implemented a three-tiered monitoring approach:
- User Experience Monitoring
- Used existing Apdex-based dashboards for both production and test environments
- Tracked end-user performance metrics like page load times and API response times
- Set clear thresholds for acceptable performance levels
- Service-Level Monitoring
- Created detailed dashboards for key services showing:
- Resource utilization (CPU, memory, I/O)
- Service dependencies and their performance
- Queue lengths and processing rates
- Established KPI tracking for critical ETL operations
- Created detailed dashboards for key services showing:
- Error Tracking
- Implemented comprehensive error logging
- Created error pattern recognition dashboards
- Set up alerting for error rate thresholds
Daily Workflow
- Dashboard Review (5 minutes in daily standup)
- Team reviewed all three dashboard tiers
- Focused on identifying patterns rather than immediate problem-solving
- Used a structured approach:
- Review user experience metrics
- Check service health indicators
- Examine error patterns
- Issue Management
- Created issues for concerning patterns
- Prioritized performance issues alongside feature work
- Allocated dedicated investigation time in sprint planning
Implementation Challenges
- Small team size and fast release cycle was a constraint on what we could deliver
- Balancing prioritizing performance investigation with new feature development was difficult
- Performance improvements often competed with feature deadlines
- Sometimes had to make trade-offs between immediate fixes and long-term solutions
- It took some trial and error to identify the right metrics/dashboard configuration for our team
- What we thought was the right metric to monitor proved not to be correct
- The infrastructure evolved incrementally so we had to evolve the dashboards to keep in sync
Outcomes
- Met performance KPIs consistently
- Maintained 99.999% uptime during 10x growth period
- Zero performance-related incidents
- No emergency load testing needed
- Proactively scaled based on observability data
- Lessons Learned:
- Early warning from observability prevented major incidents
- Daily review habit kept performance top-of-mind
- Small, regular improvements are better than big reactive changes
Tools and Technologies
- Prometheus
- Grafana
- [Add other relevant tools used at GitLab]
Future Directions
[TBD]
Resources
- GitLab Handbook pages
- [Link to related blog posts or external resources]
- [Information on available training or workshops]
87663b81
)