Jupyter Guide

Guidance on setting up JupyterLab

See related repository

Features

  • Full install of JupyterLab with the most useful extensions pre-installed
  • Common python DS/ML libraries (pandas, scikit-learn, sci-py, etc.)
  • Natively connected to Snowflake using your dbt credentials. No login required!
  • Git functionality: push and pull to Git repos natively within JupyterLab (requires ssh credentials)
  • Run any python file or notebook on your computer or in a GitLab repo
  • Linting python code using black natively within JupyterLab
  • Need a feature you use but don’t see? Let us know on #bt-data-science

Getting Started

JupyterLab is configured to run in a virtual environment on your local machine. If you prefer not to setup a virtual environment, you can instead use the data science docker image with CUDA support.

You have two options when setting up JupyterLab using the Data Science project. Choose from one of the following:

  • Full install (Recommended): Creates a pipenv virtual environment on your local machine, installs Mambaforge and the libraries defined in this Pipfile
  • Minimal install: Only Creates a pipenv virtual environment using your existing python flavor and installs the libraries defined in this Pipfile. This install should be used if you already have a python environment on your local machine that you would like to use instead of Mambaforge. Requires Python 3.10.

Installation Instructions

  1. Prerequisites - before installing please make sure your system is setup with the following:
    • Python3
    • Pip3 (usually aliased as pip).
    • pipenv. If not installed, can be installed on the command line pip install pipenv.
    • On certain versions of MacOS, it may be required to install Xcode Command Line Tools. From the command line, xcode-select --install
  2. Clone the repo to your local machine git clone git@gitlab.com:gitlab-data/data-science.git
  3. Navigate to the directory: cd data-science
  4. Based on which version you would like to install, run one of the following:
    • For full install: make setup-jupyter-local
    • For minimal install: make setup-jupyter-local-no-mamba
  5. make jupyter-local
  6. JupyterLab will launch automatically in your default browser.

Running from Docker

Although we recommend running JupyterLab from a virtual environment, sometimes that is not always possible. In those instances, we have created a docker image that can be used.

  1. Pull the image registry.gitlab.com/gitlab-data/data-science/datascienceimage:latest into your container manager (we prefer Rancher Desktop)
  2. Use the docker-compose.yml to launch JupyterLab. In your terminal, navigate to the location of the data-science repository on your local machine and type make jupyter-docker
  3. You will need to manually copy and paste the URL shown in the terminal into your web browser to load JupyterLab

Linting the repository

Included in the environment setup are all of the libraries needed to lint Jupyter notebooks in the repository. When you launch JupyterLab and open a notebook you should see a new “Format Notebook” icon in the task bar of your notebook. Clicking that button will lint your entire notebook.

Alternatively, after completing the above setup instructions run:

make lint

From the root of the Data Science project, this will find and correct and issues according to the Black format.

Connecting to Snowflake

  1. Make sure on your local machine you have setup /Users/{your_user_name}/.dbt/profiles.yml file which does not include your password. profiles.yml must be placed in this directory in your “home” directory otherwise you will not be able to connect to Snowflake from your local machine. You can use the sample profile as a reference
  2. Run through the auth_example notebook in the repo to confirm that you have configured everything successfully. You will get a browser redirect to authenticate your Snowflake credentials through Okta.
  3. If you get an error then likely Snowflake is not properly configured on your machine. Please refer to the Snowflake and dbt sections of the Data Onboarding Issue. It is likely that your .dbt/profiles.yml is not setup correctly.

Mounting a local directory

By default, the local install will use the data-science folder as the root directory for JupyterLab. This is not terribly useful when your code, data, and notebooks are in other locations on your computer. To change, this you will need to create and modify a Jupyter Notebook config file:

  1. Open terminal and nagivate to the data-science repo, e.g. cd repos/data-science. The config file must be created with the pipenv we setup in the above steps: pipenv run jupyter-lab --generate-config. This creates the file /Users/{your_user_name}/.jupyter/jupyter_lab_config.py.
  2. Browse to the file location and open it in a text editor
  3. Search for the following line in the file: #c.ServerApp.root_dir = '' and replace with c.ServerApp.root_dir = '/the/path/to/other/folder/'. If unsure, set the value to your repo directory (i.e. c.ServerApp.root_dir = '/Users/{your_user_name}/repos'). Make sure you remove the # at the beginning of the line.
  4. Make sure you use forward slashes in your path. Backslashes could be used if placed in double quotes, even if folder name contains spaces as such as \{your_user_name}\Any Folder\More Folders\
  5. Rerun make jupyter-local from the data-science directory and your root directory should now be changed to what you specified above.

Enabling Jupyter Templates

The data science team has created modeling templates that allow you to easily start building predictive models without writing python code from scratch. To enable these templates:

  • In your jupyter_lab_config.py that you created as part of the Mounting a local directory, add the following lines, replacing /Users/{your_user_name}/repos/ with the path to the data-science/templates repo on your local machine:
c.JupyterLabTemplates.template_dirs = ['/Users/{your_user_name}/repos/data-science/templates']
c.JupyterLabTemplates.include_default = False
  • Launch JupyterLab and you should see a new Template icon. Click the icon and select which template you would like to use. alt text

Setting Up Jupyter Extensions

  • The data-science repo comes with many useful JupyterLab extensions pre-installed, including git and execute time, and system monitor.
  • To get the most out of these (and to avoid having to configure them every time you run the container), create the following file: /Users/{your_user_name}/.jupyter/lab/user-settings/@jupyterlab/notebook-extension/tracker.jupyterlab-settings
  • Within that file, paste the following and save:
{
    "codeCellConfig": {
        "codeFolding": true,
        "lineNumbers": true,
    },

    "recordTiming": true,

}

Connecting to GitLab Model Experiments (MLFlow Integration)

  1. In your GitLab profile, create a personal access token access token with API permissions
  2. Ensure that Model experiments is toggled on under Settings -> General -> Visibility, project features, permissions for your project in GitLab
  3. Locate the project id for your project under Settings -> General
  4. On your local machine, you will need to create two new environment variables MLFLOW_TRACKING_TOKEN and MLFLOW_TRACKING_URI
    1. Open up your shell resource file (.zshrc, for example) in your local machine home directory.
    2. Add the following line export MLFLOW_TRACKING_TOKEN="your-access-token"
    3. Add the following line export MLFLOW_TRACKING_URI="https://gitlab.com/api/v4/projects/{your-project-id}/ml/mlflow", but with your project id. Alternatively, you can also place this directly in your notebook.
    4. Save the file
    5. Source the file (i.e. source ./zshrc) or exit terminal and restart
  5. Launch JupyterLab. You should now be able to initialize the experiment tracker with the mlflow.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URI'))command in JupyterLab

Note: If looking to connect to the Model Experiments when using CI, refer to Model Training Step-by-Step Instructions*

Updating the Virtual Environment

  1. From the data science repo, pull the latest changes to your local machine git pull
  2. Re-run on of the following installation commands:
    • For full install: make setup-jupyter-local
    • For minimal install: make setup-jupyter-local-no-mamba
  3. Launch JupyterLab: make jupyter-local

Some interesting libraries included

Data/Model Analysis

Visualisation tools

ML libraries

Last modified August 27, 2024: Update DS Projects (b1f4fab8)