Python/Tools package management and inventory
On this page the main intention is to keep all details about the inventory we are using for the Data Platform, mainly thinking of the Python libraries and the following tools. The motivation is to keep a neat inventory list with details and criteria of who, why, when and what we will upgrade. Also, the plan is to expose detailed conditions regarding the upgrade and command line application to gather all details about the recent and latest versions of Python libraries. The inventory list contains:
- 🐍
Pythonversions in use - 🛠 Tools we are using (
dbt,airflow,permifrost,meltano) - 📚 Python libraries (packages) in use (i.e
pandas,requests,matplotlib)
We strive to have a DRI for every tool or package. The DRI responsibility is to keep that tool or package in a healthy shape by advice and or initiate upon upgrades.
General motivation for upgrading
As the tool upgrading strategy is specific to each tool, it is up to DRI when the upgrade will happen and it/data-team/ which circumstances.
Motivation for upgrading:
- Reduce vulnerabilities
- Fix bugs and crashes
- Ensure compatibility with other updated technologies
- New functionality introduced which we intend to use
- Ensure proper support
- Easier future upgrades (not falling to far behind on version)
- Marketplace relevance
- End of life
End of life check up
At minimum we abide to and end of life policy. This is because anything becomes deprecated usually no longer receives any bug fixes, even critical security ones, which is a security risk to us.
The main benchmark for updating the Python version is the end-of-life parameter. In case, the version is not supported anymore, should be a candidate for upgrading.
Initiate/Schedule a version upgrade
A DRI is an expert in the domain of their tool or package and monitors new versions and releases. If and when an upgrade is applicable, the advise and initiates an upgrade. This advice contains:
- Motivation for the upgrade
- Impact and dependencies
- Criticality of the upgrade (timeline)
Above will be documented in an issue and discussed during a weekly Data Platform Team meeting.
Planning
Upgrades are scheduled as an OKR (P2) on a quarterly basis, following the Data Planning drumbeat. If needed (i.e. in case of a security vulnerability) we can schedule during the quarter as a P1. This will result compromising P2-OKR work that was already scheduled.
Python Image/Container Inventory List
There are several images we are using 🐍Python. Various versions are in use (>=3.7), due to the following reasons:
- use cases are various for the different projects
- some libraries require a specific 🐍
Pythonversion (due to dependencies) - multiple teams are using images and they have requirements for the specific versions implementation
Table 1: List of images we are using
| Image | Version in use | Image version | DRI | User |
|---|---|---|---|---|
| analytics | 3.10 |
N\A |
TBA |
Data Platform |
| airflow-image | 3.8 |
python:3.8 |
TBA |
Data Platform |
| ci-pyhton-image | 3.8 |
python:3.8-slim-buster |
TBA |
Data Platform |
| data-image | 3.10.3 |
python:3.10.3 |
TBA |
Data Platform |
| dbt-image | 3.10.3 |
python:3.10.3 |
TBA |
Data Platform |
| gitlab-data-meltano | 3.8 |
meltano/meltano:v2.16.1-python3.8 |
TBA |
Data Platform |
| mlfow-infra | 3.8 |
python:3.8 |
TBA |
Data Scientists |
| ci-streamlit-image | 3.12 |
python:3.12-slim |
@rbacovic |
Data Platform |
Dependency graph (Click to expand)
---
title: Dependency graph for images and Python versions
---
flowchart LR
P38 --> pip
P388 --> pip
P310 --> pip
P312 --> poetry
pip --> airflow-image
pip --> ci-python-image
pip --> data-image
pip --> dbt-image
pip --> mlfow-infra
pip --> gitlab-data-meltano
pip --> permifrost
poetry --> ci-streamlit-image
data-image --Inherit--> gitlab-data-utils
data-image --Inherit--> analytics
ci-python-image --Inherit--> analytics
subgraph Python
P38[Python 3.8]
P388[Python 3.8.8]
P310[Python 3.10.3]
P312[Python 3.12]
end
subgraph Package manager
pip[pip]
poetry[poetry]
end
subgraph Images
airflow-image
ci-python-image
data-image
data-science
dbt-image
gitlab-data-utils
mlfow-infra
gitlab-data-meltano
permifrost
ci-streamlit-image
end
subgraph DEV [Development environment]
analytics
end
Approach to update Python version
The current section is the guideline of how and when to upgrade the 🐍Python version in the particular image. There is no unified condition of when to upgrade the Python version, most of the listed items are recommendations and best methods. The main reason for this statement is the versatile use cases for the images we are using.
Another vital benchmark for extra upgrading is security vulnerabilities. Source for checking potential Python vulnerabilities:
In case of any known and confirmed Python vulnerabilities, we should start a process for upgrading the Python version as soon as possible.
Generally, in the following tex exposed a set of advices and, if there is a particular reason NOT to upgrade the 🐍Python version, will be good to expose the explanation.
For maintaining and upgrading 🐍Python versions and check when the particular version will become obsolete, check on Python supported timeline, or alternatively can check endoflife.date. The inventory list for the images we are using is listed in the table below.
Python version upgrades will be determined from case to case as the impact can be huge on the image. A good option is to consider end-of-life for the particular version.
Example for the end-of-life policy:
Regarding the upgrade policy, think we should at the minimum abide by the end of life policy.
This is because anything becomes deprecated usually no longer receives any bug fixes, even critical security ones, which is a security risk to us.
- Python life cycle, we are currently using the following Python versions which are either at end-of-life or close to it:
3.7already at end of life3.8at end of life in end of 2024
Tools inventory list
Table 2: List of tools GitLab Data team is using
| Tool name | Version in use | Version supported timeline | How to upgrade | DRI | Users | Upgrade Policy |
|---|---|---|---|---|---|---|
| dbt | 1.9.2 |
link | - dbt best practices for upgrading - Upgrading dbt version |
TBA |
- Data Platform- Analytics Engineers- Data Scientists |
Not more than 2 versions behind (beta release excluded) and minimum support level critical |
| airflow | 2.5.3 |
link | Upgrade Plan for Airflow | TBA |
Data Platform |
Current version released > 1 year |
| permifrost | 0.15.4 |
link | Upgrading permifrost version | @rbacovic | Data Platform |
Not more than 2 versions behind (beta release excluded) |
| meltano | 2.16.1 |
link | Upgrade Meltano version | TBA |
Data Platform |
Current version released > 1 year |
dbt packages inventory
| package name | Version in use | DRI | User |
|---|---|---|---|
| snowflake_spend | 1.1 |
N\A |
-Data Engineers -Analytics Engineers |
| data-tests | N\A |
N\A |
-Data Engineers -Analytics Engineers |
| dbt-labs/audit_helper | 0.9.0 |
N\A |
-Data Engineers -Analytics Engineers |
| dbt-labs/dbt_utils | 1.1.1 |
N\A |
-Data Engineers -Analytics Engineers |
| dbt-labs/snowplow | 0.15.1 |
N\A |
-Data Engineers -Analytics Engineers |
| dbt-labs/dbt_external_tables | 0.8.7 |
N\A |
-Data Engineers -Analytics Engineers |
| brooklyn-data/dbt_artifacts | 2.8.0 |
N\A |
-Data Engineers -Analytics Engineers |
Approach to update tools version
Example for the end-of-life policy:
Regarding the upgrade policy, think we should at the minimum abide by the end of life policy.
This is because anything becomes deprecated usually no longer receives any bug fixes, even critical security ones, which is a security risk to us.
- the airflow life cycle, end of life for
1.10.15wasJune 17, 2021so going forward, we should try to upgrade much earlier if following this end of life rule- dbt latest releases
v1.2andv1.3are already at their end of life and will be completely deprecated end of 2023- dbt describes best practices for upgrading, one of them being to at least upgrade to new
patch versions, which are “bugfix” or “security” releases
Libraries inventory list
The libraries inventory list is a bit automated process where we have a command line application that gathers all the information we need for upgrading libraries. Besides gathering all information about libraries, the application generates 2 reports that can be attractive for the upgrade process:
- Get inventory list for each image - For utilities to check if the library we are implementing is outdated and how far away from the latest version.
- Check duplicated versions among images - To check if there is the version is in use in more than one image and the version is not in sync. If there is a version of libraries not in sync, possibly there is a specific reason why they are mismatched. Generally, it will be handy to keep the unique version across images (if possible and not a stopper).
To be able to run this program, checkout the /package_inventory repo and run the following code:
- Run program
# run file
python3 gitnventory.py [--dry-run] [--logging [print/logging]] [--report_folder DEFINE_FOLDER] [--log_file DEFINE_LOG_FILE] [--help]
For more details how the program if running, refer to the source code.
Note: Keep in mind that we are relying on PyPi and GitLab Data group as a primary source of the latest version of the library.
Table 3: DRI for Python libraries
| Tool name | DRI |
|---|---|
| Python libraries | @rbacovic |
Approach to updating libraries
Proposal for creating a candidate list for upgrading criteria:
Table 4: Example criteria for upgrading Python, tools or libraries version
| Criteria | Example | Risk for the implementation (1.0 low, 5.0 high) |
|---|---|---|
The current version is outdated and not supported anymore (the end-of-life policy) |
Python 2.7 | 4.0 ⭐⭐⭐⭐☆ |
| The current version is vulnerable | Article | N\A |
| Major version released | Current version: 2.1.0Latest version 3.0.0 |
3.0 ⭐⭐⭐☆☆ |
| Minor version released | Current version: 2.1.0Latest version 2.2.0 |
1.0 ⭐☆☆☆ |
| Patch version released | Current version: 2.1.0Latest version 2.1.8 |
1.0 ⭐☆☆☆☆ |
As the library updating can be difficult to determine when and why to upgrade, a few considerations can assist us here. We should think about upgrade (in addition to the end-of-life criteria):
- Need a new feature for a particular version - in case there is a significant feature we need/want to use, should go for the upgrade
- Upgrade only if serious vulnerability in package - this is always a red flag and we should start upgrading immediately
- Upgrade if behind more than x major/minor versions - probably good reason to move on with the upgrade, with serious consideration of the impact. For instance, jumping to the newer major version, in some cases requires a new version of Python. If the risk is assessed properly, should move on
Dependency check up
The reason for upgrading can be dependent tool/package. For instance, if you plan to upgrade dbt and sometimes it requires upgrading 🐍 Python as well (for dbt-image image). That can be the reason to upgrade Python as it is required by other tools (dbt in this example).
Tips and tricks for upgrading
✅ Do`s:
- Do the extraordinary upgrade when there is a known issue that can impact our platform
- When upgrading the 🐍
Pythonpackage to a major version change, check compatibility with the Python version active in the image as there can be a collision with it - Check if a 🐍
Pythonpackage you want to install/upgrade is safe and does not contain malicious code - Always check backward compatibility with other packages (pipelines in the project will help you here during the image build).
🛑 Dont`s:
- Do not upgrade to the
pre-releaseversion of the software, always use a stable release version - Do not use
non-trustedsources. We prefer PyPi or packages from GitLab Data group for the source of the installation
Upgrade logging
Table 5: Log activities regarding the upgrade
| Quarter (In which quarter we do upgrade planning) |
Issue for check upgrades (Put the link to the issue you plan to use for the upgrade planning) |
Upgrade execution (In which quarter we do upgrade execution) |
Epic for planned upgrade (Put the link to the epic you plan to use for the upgrade execution) |
Type of upgrading [ Python|Tool|Libraries](What type of object you plan to upgrade) |
DRI (The individual who will check what is required for the upgrade planning) |
|---|---|---|---|---|---|
FY24Q4 |
#19248 | FY25Q1 |
@rbacovic | ||
FY25Q1 |
#20233 | FY25Q2 |
@rbacovic | ||
FY25Q2 |
#21082 | FY25Q3 |
@rbacovic | ||
FY25Q3 |
FY25Q4 |
||||
FY25Q4 |
FY26Q1 |
2e97f281)
