Data Science Handbook

GitLab Data Science Team Handbook

The Enterprise Data Science Team at GitLab

The mission of the Data Science Team is to facilitate making better decisions faster using predictive analytics.

Handbook First

At GitLab we are Handbook First and promote this concept by ensuring the data science team page remains updated with the most accurate information regarding data science objectives, processes, and projects. We also strive to keep the handbook updated with useful resources and our data science toolset.

Learning About Data Science

Check out this brief overview of what data science is at GitLab:

(Corresponding slides)

AMAs:

Common Data Science Terms

  • Accuracy - ability of a Data Science model to capture all correct data points out of all possible data points
  • Algorithm - sequence of computer-implementable instructions used to solve specific problem
  • Classification - process of predicting a category for each observation. For example, determining if a picture is of a cat or a dog
  • Clustering - process of finding natural groupings of observations in dataset. Often used for segmentation of users or customers
  • Data Science (DS) - interdisciplinary field that uses computer science, statistical techniques and domain expertise to extract insights from data
  • Exploratory Data Analysis (EDA) - analysis of the data that summarises it’s main characteristics (includes statistics and data visualisation)
  • Feature - single column in dataset that can be used for analysis, such as country or age. Also referred to as variables or attributes
  • Feature Engineering - process of selecting, combining and transforming data into features that can be used by machine learning algorithms
  • Imputation - process of replacing missing or incorrect data with statistical “best guesses” of the actual values
  • Machine Learning (ML) - use and development of algorithms without being explicitly programmed to determine patterns in data
  • Model - a complex set of mathematical formulas that generates predictions
  • Propensity modeling - building models to predict specific events by analyzing past behaviors of a target audience.
  • Regression - a statistical method for predicting an outcome. For example, predicting a person’s income, or how likely a customer is to churn
  • Scoring - process of generating predictions for the new dataset
  • Training - process of applying an algorithm to data to create a model
  • Test Dataset - deliberately excluding some observations from training the model so they can be used to verify how well the model predicts
  • Weight - numerical value assigned to feature that determines it’s strength

Data Science Responsibilities

Of the Data Team’s Responsibilities, the Data Science Team is directly responsible for:

  • Delivering descriptive, predictive, and prescriptive solutions that promote and improve GitLab’s KPIs
  • Being a Center of Excellence for predictive analytics and supporting other teams in their data science endeavors
  • Developing tooling, processes, and best practices for data science and machine learning

Additionally, the Data Science Team supports the following responsibilities:

  • With Data Leadership:
    • Scoping and executing a data science strategy that directly impacts business KPIs
    • Broadcasting regular updates about deliverables, ongoing initiatives, and roadmap
  • With the Data Platform Team:
    • Defining and championing data quality best practices and programs for GitLab data systems
    • Deploying data science models, ensuring data quality and integrity, shaping datasets to be compatible with machine learning, and bringing new datasets online
    • Creating data science pipelines that work natively with the GitLab platform and the Data Team tech stack
  • With the Data Analytics Team:
    • Incorporating data science into analytics initiatives
    • Designing dashboard to enhance the value and impact of the data science models

How We Work

As a Center of Excellence, the data science team is focused on working collaboratively with other teams in the organization. This means our stakeholders and executive sponsors are usually in other parts of the business (e.g. Sales, Marketing). Working closely with these other teams, we craft a project plan that aligns to their business needs, objectives, and priorities. This usually involves working closely with functional analysts within those teams to understand the data, the insights from prior analyses, and implementation hurdles.

The Data Science flywheel is focused on improving business efficiency and KPIs by creating accurate and reliable predictions. This is done in collaboration with Functional Analytics Center of Excellence to ensure the most relevant data sources are utilized, business objectives are met, and results can be quantifiably measured. As business needs change, and as the user-base grows, this flywheel approach will allow the data science team to quickly adapt, iterate, and improve machine learning models.

graph BT;
   id1(Faster, More Accurate Predictions)-->id2(Increased Business understanding) & id5(Continuous Feedback)
   id2-->id3(More Revenue & Users)
   id5-->id1
   id3-->id4(More Data)
   id4-->id1

Data Science Initiatives

Examples of current Data Science initiatives include:

  • Revenue Expansion
  • Churn Reduction
  • Improved Forecasting
  • Customer Health
  • MLOps with GitLab

Please refer to the Data Science Initiatives Internal Handbook for up-to-date information on all our on-going and planned projects.

Project Structure

The Data Science Team follows Cross-Industry standard process for data mining (CRISP-DM), which consists of 6 iterative phases:

  1. Business Understanding

    • Includes requirements gathering, stakeholders interviews, project definition, product user stories, and potential use cases in order to establish success criteria of the project.
  2. Data Understanding

    • Requires determining the breadth and scope of existing relevant data sources. Data scientists work closely with data engineers and data analysts to determine where gaps may exist and to identify any data discrepancies or risks.
  3. Data Preparation

    • Requires conducting data quality checks and exploratory data analysis (EDA) to develop a greater understanding of data and how different datapoints relate to solving the business need.
  4. Modeling

    • Machine learning techniques are used to find a solution that addresses the business need. This often takes the form of predicting why/when/how future instances of a business outcome will occur.
  5. Evaluation

    • Performance is generally measured by how accurate, powerful, and explainable the model is. Findings are presented to the stakeholders for feedback.
  6. Deployment

    • Once the model has been approved it then gets deployed into the data science production pipeline. This process automatically updates, generates predictions, and monitors the model on a regular cadence.

The GitLab approach

The Data Science Team approach to model development is centered around GitLab’s value of iteration and the CRISP-DM standard. Our process expands on some of the 6 phrase outlined in CRISP-DM in order to best address the needs of our specific business objectives and data infrastructure.

Data Science Platform

Our current platform consists of:

  • the Enterprise Data Warehouse for storing raw and normalized source data as well as final model output for consumption by downstream consumers
  • JupyterLab for model training, tuning, and selection
  • GitLab for collaboration, project versioning, and score code management, experiment tracking, and CI/CD -GitLab CI for automation and orchestration
  • Monte Carlo for drift detection
  • Tableau Server for model monitoring and on-going performance evaluation
  • Feast as a an open-source Feature Store for Machine Learning models

Current State Data Flows

graph
    A[Enterprise Data Warehouse: Raw and Normalized Data Sources]
    B[JupyterLab & GitLab CI/CD: Model Training, Tuning, and Selection]
    C(GitLab CI/CD & Pipeline Schedules: Batch scoring with Papermill)
    F[Enterprise Data Warehouse: Model Output for Consumption]
    D[Salesforce/Marketo: CRM Use Cases]
    E[Tableau/Monte Carlo: Model Monitoring and Reporting]
    G[GitLab: Source Code Management]
    H[Experiment tracking]
    A --> |ODBC| B
    B --> H
    H --> B
    B --> G
    G --> B
    G --> C
    C --> |JSON| F
    F --> |CSV| D
    F --> |ODBC| E

Feast: Feature Store Implementation

We use Feast as an open-source Feature Store for our machine learning models. Configuration can be found on the Feast project repository, updating the feature store is done using GitLab CI/CD and the web UI is published in a VM on GCP.

You can use the following pages to find more details on:

  1. How to use Feast to fetch features to train and deploy Machine Learning models.
  2. Feast - Feature Store Implementation Internal handbook section.

CI/CD Pipelines for Data Science

We deploy all of our models using the native GitLab CI/CD capabilities. Please see Getting Started With CI/CD for Data Science Pipelines for the most up-to-date information and instructions on how to deploy models with CI/CD

Data Science Tools at GitLab

  • Pre-configured JuypterLab Image: The data science team uses JupyterLab pre-configured with common python modules (pandas, numpy, etc.), native Snowflake connectivity, and git support. Working from a common framework allows us to create models and derive insights faster. This setup is freely available for anyone to use. Check out our Jupyter Guide for additional information.
  • GitLab Data Science Tools for Python: Functions to help automate common data prep (dummy coding, outlier detection, variable reduction, etc.) and modeling tasks (i.e. evaluating model performance). Install directly via pypi (pip install gitlabds), or use as part of the above JupyterLab image.
  • Modeling Templates: The data science team has created modeling templates to allow you to easily start building predictive models without writing python code from scratch. To enable these templates, follow the instructions on the Jupyter Guide.

Data Science Project Development Approach
GitLab Data Science Team Approach to Model Development
Last modified November 14, 2024: Fix broken external links (ac0e3d5e)