Advanced Global Search Rollout on GitLab.com
Steps and Enhancements
-
2019-11-05: Search security rapid action started.
- Advanced Global Search went through 3 rounds of enable-disable in the prior 2 months due to 3 security issues discovered by HackerOne only hours after going online. See 12.3.3, 12.3.5, and 12.4.1/12.3.6.
- Team decided to take a systematic approach to discover and fix issues.
- Rapid action weekly agenda.
- All engineering attention (2 engineers) swarmed on the security rapid action.
-
2019-12-09: Security rapid action completed.
- All known security issues were fixed.
- A comprehensive set of test matrices was executed by the AppSec team.
-
2019-12-16:
- Advanced Global Search was re-enabled for gitlab-org group on GitLab.com..
- Only GitLab’s own gitlab-org was enabled, restoring to the state before the security rapid action.
- Margin analysis requested by CFO, cost estimation issue created.
- Advanced Global Search was re-enabled for gitlab-org group on GitLab.com..
-
2020-01-15: Cost analysis completed.
- Due to holidays, it took about one month to complete initial estimation and margin analysis.
-
2020-01-17: Financial approval completed.
-
2020-01-24: First two customers were scheduled to get Advanced Global Search enabled.
- Elasticsearch cluster node crashed due to out of memory during initial indexing.
- The enabling process was stopped and Elasticsearch cluster was brought offline.
- Advanced Global Search service was completely offline.
-
2020-01-27: A retrospective of the incident was held.
- Continued troubleshooting with the Infrastructure team.
- Started engagement with an Elastic support engineer.
-
2020-01-30: Discovered the root cause of the problem with help from Elastic support.
- Out of memory was caused by the combination of fairly large bulk requests queued for ingestion and small heap.
- Started the work of scaling indexing jobs by utilizing Elasticsearch Bulk Import API and Redis sorted set.
-
2020-02-04:
-
2020-02-09 ~ 2020-03-05:
- Iterative learning of enabling new customers and monitoring production environment via indexing 7 new groups one-by-one, meanwhile developing tools and playbook for batch indexing.
-
2020-02-29: Merged Add a bulk processor for elasticsearch incremental updates.
- It processes incremental database updates in batch. It’s more efficient and can lower the load on Elasticsearch when it’s busy.
- It uses Redis sorted set, which can deduplicate the indexing jobs. It can also help lower the load on Elasticsearch cluster.
-
2020-03-06:
- Merged Use less expensive index_options, reduced index size by 36.6%.
- Merged Bulk API related Elasticsearch version compatibility fix.
-
2020-03-11: First attempt of adding new groups in batch succeeded. Another 30 groups enabled. Total enabled groups were 39, including GitLab’s own groups.
-
2020-03-12: Routing feature was turned on GitLab.com, which resulted in 5x-6x latency improvement.
-
2020-03-26: Merged Use prefix search instead of ngrams for sha fields, reduced index size by another 12.3%, total index size reduction is 44.4%.
-
2020-04-07: Optimized re-indexing process to achieve zero downtime.
- Elasticsearch index alias was used in GitLab.com to make re-indexing operation more flexible, efficient, and robust.
- Re-indexing does not require downtime any more on GitLab.com.
- Zero downtime re-indexing related work started.
-
2020-04-08: More groups were enabled and the total enabled groups is about 3%.
-
2020-04-09:
- More groups were enabled and the total enabled groups is about 6%.
- Started investigating how to speed up initial indexing of groups
- Increasing the number of Sidekiq workers has been discussed.
- The impact on other parts of GitLab.com was also evaluated.
-
2020-04-27: More groups were enabled and the total enabled groups is about 9%.
-
2020-04-28: A blog post was published which shared lessons learned from rolling out Advanced Global Search on GitLab.com.
-
2020-04-30: More groups were enabled and the total enabled groups is about 12%.
-
2020-05-08: Started working on moving repository indexing jobs to Redis sorted sets, which would help
- Boost indexing performance.
- Isolate incremental index update and initial indexing to separate job queues, accelerating both indexing job types.
-
2020-05-20: More groups were enabled and the total enabled groups is about 16%.
-
2020-05-24: More groups were enabled and the total enabled groups is about 17%.
-
2020-05-27: Merged remove partial word matching from code search, which would help reduce the storage size significantly.
-
2020-06-04: Another re-indexing was done on GitLab.com. The last storage optimization change was applied. It saved another 75% storage over previous optimizations and resulted in a total of 86.1% index size reduction.
-
2020-06-09: More groups were enabled and the total enabled groups is about 20%.
-
2020-06-10: More groups were enabled and the total enabled groups is about 21%.
-
2020-06-14: More groups were enabled and the total enabled groups is about 22%. More Sidekiq workers were added and the impact on overall GitLab.com was closely monitored during this rollout. The conclusion was made that it’s safe to double the number of Sidekiq workers for Elasticsearch jobs.
-
2020-06-15: More groups were enabled and the total enabled groups is about 25%.
-
2020-06-15: More groups were enabled and the total enabled groups is about 26%. Number of Sidekiq workers were doubled again and monitoring data showed it’s safe to keep the number of Sidekiq workers used in this rollout. The increased processing power would allow much faster rollout.
-
2020-06-16: More groups were enabled and the total enabled groups is about 27%.
-
2020-06-16: More groups were enabled and the total enabled groups is about 39%.
-
2020-06-17: More groups were enabled and the total enabled groups is about 64%.
-
2020-06-23: More groups were enabled and the total enabled groups is about 75%. However, a High CPU utilization issue was encountered in the Elasticsearch cluster during this batch enablement. A problemetic regex in the code analyzer was believed to be the culprit.
-
2020-06-24: the problemetic regex was fixed
-
2020-06-29: Following the code analyzer regex fix, another Elasticsearch cluster re-indexing was done successfully.
-
2020-07-02: A new feature was added to ensure newly created paid groups to be indexed.
-
2020-07-06: More groups were enabled and the total enabled groups is about 87%.
-
2020-07-08: More groups were enabled and the total enabled groups is about 93%.
-
2020-07-09: More groups were enabled and the total enabled groups is about 98%.
-
2020-07-10: All the paid groups are enabled on GitLab.com!
b38578d9
)