Python Guide
Python Guide
Motivation
It is the Data team’s collective responsibility to promote, respect, and improve our Python Style Guide since not all the things can be caught up by the tools we are using. This deliberate attention makes a difference and does a lot to ensure high-quality code. The main motivation to have comprehensive and useful Python guidelines is to ensure a high standard of code quality is maintained in our work. As we use this guideline as a support in that daily work, which is always changing, this guide is likewise always subject to iteration. In the long run this guide will help us to have a world-class code quality we should be proud of. All the changes are driven by our values.
Values
Campsite rule: As these guidelines are themselves in a constantly Work in Progress (WIP
) status - if you work with any code style, hint, or guideline which does not currently adhere to the style guide - please submit a merge request with the relevant changes and tag the Data Platform team to update the guide.
Technology Standardisation
Starting in Jan 2022 all new custom python extracts should adhere to the Singer standard.
High-level guidelines
This section brings details on a high level of how to follow the best practices regarding Python
code. Following recommendations, will ensure we are able to fully understand the advantage of high code quality and leverage our code base.
Zen of Python
It is difficult to resist writing a python guide without mentioning the Zen of Python
, a cornerstone for ensuring the code quality on a high level.
It is a helpful mental exercise when you want to write outstanding Python without overlooking basic ideas.
╰─$ python3
Python 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
PEP 8
PEP
stands for Python Enhancement Proposal, and there are several of them. A PEP
is a document that describes new features proposed for Python and documents aspects of Python, like design and style, for the community.
As per definition:
PEP 8
, sometimes spelledPEP8
orPEP-8
, is a document that provides guidelines and best practices on how to write Python code. It was written in 2001 byGuido van Rossum
,Barry Warsaw
, andNick Coghlan
. The primary focus ofPEP 8
is to improve the readability and consistency of Python code.
Why We Need PEP 8?
Readability counts.
— The Zen of Python
Code is much more often read than it is written.
— Guido van Rossum
Among many other things, it is vital to underscore the need to have a clean, clear and consistent code in order to be able to maintain and scale the codebase.
GitLab Zen of Python
As a supplement of the Zen of Python
, we had an intention to underscore a couple of more viewpoints to ensure and keep our code base in a good shape. This is half a joke, half a truth, but provides good general overview for the high standards we want to leverage.
Here is our GitLab Zen of Python
proposal:
G
ratitude and respect forPEP 8
I
nsist on writing well-structured codeT
rustPythonic
way of thinking and coding, and introduce a good habit of using it on a daily basisL
everage and promote proper comments and documentationA
lways haveZen of Python
on the top of your mindB
oost usage of a modular code style over script-like approach
Probably a couple more of them are count:
- Advocate for proper naming of variables, classes, functions and modules
- Favor a modular code style over script-like approach
- Prefer using a virtual environment over existing interpreter
Specific guidelines
With a good framework definition on the high level of using the proper architectural approach for designing and structuring the Python
code, now it is time to do a deep dive and leverage details should make the difference between good and the outstanding code.
Project setup - Poetry
For project setup, we are using poetry.
Poetry
is a python dependency management tool to manage dependencies, packages, and libraries in your python project. It will help you to make your project more simple by resolving dependency complexities in your project and managing to install and update for you.
Here is the set of command we are using to initiate and install Python
(in our case it is 3.10.3
), refer to the file onboarding_script.zsh.
One good example of how we define pyproject.toml
file to run and execute the poetry
setting is exposed 👇 below:
[tool.poetry]
name = "app"
version = "0.1.0"
description = "Automated Service Ping Metrics Check"
authors = ["{}"]
readme = "README.md"
homepage = "https://gitlab.com/gitlab-data/service-ping-metrics-check/README.md"
repository = "https://gitlab.com/gitlab-data/service-ping-metrics-check"
[tool.poetry.dependencies]
python = "^3.10"
fastapi = "^0.85.0"
uvicorn = "^0.18.3"
black = "^22.8.0"
pylint = "^2.15.3"
flake8 = "^5.0.4"
mypy = "^0.971"
vulture = "^2.6"
pytest = "^7.1.3"
python-decouple = "^3.6"
flatten-dict = "^0.4.2"
sqlparse = "^0.4.3"
snowflake-connector-python = "^2.7.12"
[tool.poetry.dev-dependencies]
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
[tool.poetry.scripts]
start = "app.sharehouse.main:start"
The full file is exposed in the file service-ping-metrics-check/pyproject.toml as a part of Service ping metrics check project.
Idioms
A programming idiom, put simply, is nothing else but a way to write code. Idiomatic Python code is often referred to as being Pythonic
. Although there is usually one — and preferably not only one — obvious way to do it. The way to write idiomatic Python code can be non-obvious to Python beginners.
So, good idioms must be consciously acquired.
Explicit code
While (almost) any kind of magic is possible with Python, the most explicit and straightforward manner is preferred. Keep it simple and smart.
## Bad
def foo(*args):
x, y = args
return dict(**locals())
## Good
def bar(x, y):
return {'x': x, 'y': y}
Function arguments
Arguments can be passed to routines in four different ways:
Positional arguments
- for instance,foo(message, recipient)
Keyword arguments
- for instance,foo(message, to, cc=None, bcc=None)
. Herecc
andbcc
are optional, and evaluate toNone
when they are not passed another value.Arbitrary argument list
(*args
)Arbitrary keyword argument dictionary
(**kwargs
)
It is up to the engineer writing the function to determine which arguments are positional arguments and which are optional keyword arguments, and to decide whether to use the advanced techniques of arbitrary argument passing. If the advice above is followed wisely, it is possible and enjoyable to write Python
functions that are:
- easy to read (the name and arguments need no explanations)
- easy to change (adding a new keyword argument does not break other parts of the code)
Returning values
When a function grows in complexity, it is not uncommon to use multiple return statements inside the function’s body. However, in order to keep a clear intent and sustainable readability, it is preferable to avoid returning meaningful values from many output points in the body. When a function has multiple main exit points for its normal course, it becomes difficult to debug the returned result, so it may be preferable to keep a single exit point
## Bad! Probably too complex and difficult to read
def foo(a, b, c):
if not a:
return None # Raising an exception might be better
if not b:
return None # Raising an exception might be better
# Some complex code trying to compute x from a, b and c
# Resist temptation to return x if succeeded
if not x:
# Some Plan-B computation of x
return x # One single exit point for the returned value x will help
# when maintaining the code.
## Good
def bar(a, b, c):
res = None
if not a:
res = None
if not b:
res = None
# Some complex code trying to compute x from a, b and c
# Resist temptation to return x if succeeded
if not x:
# Some Plan-B computation of x
res = 42
return res
Unpacking
If you know the length of a list or tuple, you can assign names to its elements with unpacking. For example, since enumerate()
will provide a tuple of two elements for each item in list:
## Bad! - can be difficult to read and maintain
for index in range(0:len(foo_list)):
# do something with foo_list[index]
## Good! - this is optimized way to use for loop
for index, item in enumerate(foo_list):
# do something with index and item
You can use this to swap variables as well:
## Good
a, b = b, a
Nested unpacking works fine as well:
## Good
a, (b, c) = 1, (2, 3)
New method of extended unpacking was introduced by PEP 3132
:
## Good
a, *rest = [1, 2, 3]
## a = 1, rest = [2, 3]
a, *middle, c = [1, 2, 3, 4]
## a = 1, middle = [2, 3], c = 4
Ignored variable _
can be part of unpacking as well:
## Bad
a, _ , c = [1, 2, 3, 4] # This will raise an error
## Good! This will work (* is going before _)
a, *_ , c = [1, 2, 3, 4]
## Good
_, *rest = [1, 2, 3]
## rest = [2, 3]
Note: bad praxis is to unpack more than 3 values. Yes, it is allowed to do that, but will rapidly decrease code readability.
## Bad! (if you have more than 3 values)
a, b, c, d = 1, 2, 3, 4
## Good
a, b, c = 1, 2, 3
d = 4
## Better
a = 1
b = 2
c = 3
d = 4
Conventions
This section should expose effective techniques to deal with conventions and how to integrate them into your toolbox.
Check if a variable equals a constant
It is not needed to explicitly compare a value to True, or None, or 0 – you can just add it to the if statement. See Truth Value Testing for a list of what is considered false.
## Bad
if attr == True:
print('True!')
if attr == None:
print('attr is None!')
## Good
## Just check the value
if attr:
print('attr is truthy!')
## or check for the opposite
if not attr:
print('attr is falsey!')
## or, since None is considered false, explicitly check for it
if attr is None:
print('attr is None!')
## same goes for dict, list sets
check_list = []
if check_list:
print('This is not empty list.')
else:
print('The list is empty.')
String concatenation
There are many ways to do string concatenation. Here is just a short exercise on how to do that in an efficient way.
string1 = 'Python'
string2 = 'Guideline'
## Bad
print(string1 + " " + string2)
## Python Guideline
## Good
print('{} {}'.format(string1, string2))
## Python Guideline
## Better
print(f"{string1} {string2}")
## Python Guideline
As you noticed here, the result is the same, but details make a difference. In the example above, we dealt with strings, but what will gonna happen when we introduce more data types within the same code.
string1 = 'Python'
int1 = 2 # now, this is int
## Bad
## print(string1 + " " + int1)
## TypeError: can only concatenate str (not "int") to str
## Good
print('{} {}'.format(string1, int1))
## Python 2
## Better
print(f"{string1} {int1}")
## Python 2
See why it is better to use placeholders instead of simple string concatenation (ie. a + b
).
Short Ways to Manipulate Lists
List comprehensions
provides a powerful, concise way to work with lists.Generator expressions
follows almost the same syntax as list comprehensions but returns a generator instead of a list. This is crucial to remember: performance and memory matter, and it is a great consideration to understand the leverage of using generators where it is appropriate.
## Bad
## will return a list first and then do the max calculation, the trick is as [] stands for the list
foo = max([(student.id, student.name) for student in graduates])
## Good
## will return a generator object first and then do the max calculation till the generator exhausted, the trick is as () stands for the generator object
bar = max((student.id, student.name) for student in graduates)
Note: performance and memory resources matter!
Comprehensions
Comprehensions in Python provide us with a short and concise way to construct new sequences (such as lists, set, dictionary etc.) using sequences which have been already defined. Python supports the following 4 types of comprehensions:
List
ComprehensionsDictionary
ComprehensionsSet
ComprehensionsGenerator
Comprehensions - Kindly suggest spending some time to understand generators and how they are implemented inPython
as they are memory optimized and should be an efficient choice for processing a massive set of data.
In other words, any iterable can be part of comprehensions.
Comprehensions are the great example of Pythonic
way of thinking, as provides clean and neat coding standards.
output_list = [output_exp for var in input_list if (var satisfies this condition)]
Note: nested comprehensions are allowed as well, but it is not best practice to have more than 2 comprehensions in one statement.
Filtering an iterables
There are plenty of ways to filter an iterables. Let see some of them and how they fit in a high coding standards.
## Bad
## Never remove items from a list while you are iterating through it
## Filter elements greater than 4
foo = [3, 4, 5]
for i in foo:
if i > 4:
foo.remove(i)
## Bad
## Don't make multiple passes through the list
while i in foo:
foo.remove(i)
## Good
## Use a list comprehension or generator expression
## comprehensions create a new list object
filtered_values = [value for value in sequence if value != x]
## generators don't create another list
filtered_values = (value for value in sequence if value != x)
## Good
## you can use function as a filter
sequence= [1, 2, 3]
def dummy_filter(member: int) -> bool:
return member != 2
filtered_values = [value for value in sequence if dummy_filter(value)]
## [1, 3]
As an alternative, map
| filter
| reduce
functions can be used for this purpose. Handy link is Map, Filter and Reduce
Modifying the values in a list
Remember that assignment never creates a new object. If two or more variables refer to the same list, changing one of them changes them all.
## Add three to all list members
list_a = [3, 4, 5]
list_b = list_a # list_a and list_b refer to the same list object
for i in range(len(list_a)):
list_a[i] += 3 # list_b[i] also changes
## for copying a list, use .copy() method
list_a = [1, 2, 3]
list_b = list_a.copy()
## extend list_b as list_a will stay in a original shape
list_b.extend(list_b)
print(F"list_a: {list_a}")
print(F"list_b: {list_b}")
## list_a: [1, 2, 3]
## list_b: [1, 2, 3, 1, 2, 3]
It’s safer to create a new list object and leave the original alone.
list_a = [3, 4, 5]
list_b = list_a
## assign the variable "list_a" to a new list without changing "list_b"
list_a = [i + 3 for i in list_a]
Use enumerate()
keep a count of your place in the list.
## Good
foo = [3, 4, 5]
for i, item in enumerate(foo):
print(i, item)
## prints
## 0 3
## 1 4
## 2 5
Note: The
enumerate()
function has better readability than handling a counter manually. Moreover, it is better optimized for iterators.
Read From a File
Always good advice is to use context manager
over value assigning when loading data from a file. This approach will automatically close a file for you.
## Bad
f = open('file.txt')
a = f.read()
print(a)
f.close() # we always forgot something like this.
## Good
with open('file.txt') as f:
for line in f:
print(line)
## This approach will close a file for you
Line Continuations/Line Length
When a logical line of code is longer than the accepted limit, you need to split it over multiple physical lines. The Python interpreter will join consecutive lines if the last character of the line is a backslash. This is helpful in some cases, but should usually be avoided because of its fragility: a white space added to the end of the line, after the backslash, will break the code and may have unexpected results
## Bad
my_very_big_string = """When a logical line of code is longer than the accepted limit, \
you need to split it over multiple physical lines. \
The Python interpreter will join consecutive lines if the last character of the line is a backslash.""""
from some.deep.module.inside.a.module import a_nice_function, another_nice_function, \
yet_another_nice_function
## Good
my_very_big_string = (
"When a logical line of code is longer than the accepted limit, "
"you need to split it over multiple physical lines. "
"The Python interpreter will join consecutive lines if the last character of the line is a backslash.""
)
from some.deep.module.inside.a.module import (
a_nice_function, another_nice_function, yet_another_nice_function)
Spacing
Following PEP8 we recommend you put blank lines around logical sections of code.
When starting a for
loop or if/else
block, add a new line above the section to give the code some breathing room. Newlines are cheap - brain time is expensive.
## Bad
def foo(input_number:int) -> int:
"""
Do some simple comparing
"""
res = input_number
if res == 2:
return res
else:
return res ** 2
## Good
def bar(input_number:int) -> int:
"""
Do some simple comparing
"""
res = input_number
if res == 2:
return res
else:
return res ** 2
Type Hints
All function signatures should contain type hints, including for the return type, even if it is None
.
This is good documentation and can also be used with mypy
for type checking and error checking.
## Bad
def foo(x, y):
"""
Add two numbers together and return.
"""
return x + y
## Good
def foo(x: int, y: int) -> int:
"""
Add two numbers together and return.
"""
return x + y
## Good! (for None as return type)
def bar(some_str: str) -> None:
"""
Print a string.
"""
print(some_str)
return
Import Order
Imports should follow the PEP8 rules and furthermore should be ordered with any import ...
statements coming before from .... import ...
## Bad
from os import environ
import logging
import some_local_module
from requests import get
import pandas as pd
from another_local_module import something
import sys
## Good
import logging
import sys
from os import environ
import pandas as pd
from requests import get
import some_local_module
from another_local_module import something
Also, linters should help you with this issue: iSort
, mypy
, flake8
, pylint
.
Docstrings
- Docstrings should be used in every single file.
- Docstrings should be used in every single function. Since we are using type hints in the function signature there is no requirement to describe each parameter.
- Docstrings should use triple double-quotes and use complete sentences with punctuation.
## Good
def foo(x: int, y: int) -> int:
"""
Add two numbers together and return the result.
"""
return x + y
## Good! (for None as return type)
def bar(some_str: str) -> None:
"""
Print a string.
This is another proper sentence.
"""
print(some_str)
return
## Better! Have Docstring on a module level
"""
This is a Docstrings on a module level.
Should be handy to describe a purpose of your module
"""
def bar(some_str: str) -> None:
"""
Print a string.
This is another proper sentence.
"""
print(some_str)
return
How to integrate Environment Variables
To make functions as reusable as possible, it is highly discouraged (unless there is a “very” good reason) from using environment variables directly in functions (there is an example of this below). Instead, the best practice is to either pass in the variable you want to use specifically or pass all of the environment variables in as a dictionary. This allows you to pass in any dictionary and have it be compatible while also not requiring the variables to being defined at the environment level.
import os
from typing import Dict
## Bad
def foo(x: int) -> int:
"""
Add two numbers together and return.
"""
return x + os.environ["y"]
foo(1)
## Good
env_vars = os.environ.copy() # The copy method returns a normal dict of the env vars.
def bar(some_str: str, another_string: str) -> None:
"""
Print two strings concatenated together.
"""
print(f"{some_str} {another_string}")
return
bar("foo", env_vars["bar"])
## Better
def bar(some_str: str, env_vars: Dict[str, str]) -> None:
"""
Print two strings concatenated together.
"""
print({some_str} + {env_vars["another_string"]})
return
bar("foo", env_vars)
Parsing Dates
Ideally, never hardcode the date format using datetime.strptime unless absolutely necessary in cases where the format is very unusual. A better solution is to use the generic date parser in the dateutil library, as it handles a large variety of formats very reliably:
## Bad
datevar = datetime.strptime(tstamp, timestamp_format = "%Y-%m-%dT%H:%M:%S%z")
## Good
from dateutil import parser as date_parser
...
datevar = date_parser.parse(tstamp)
Package Aliases
We use a few standard aliases for common third-party packages. They are as follows:
import pandas as pd
import numpy as np
Variable Naming Conventions
Adding the type to the name is good self-documenting code. When possible, always use descriptive naming for variables, especially with regards to data type. Here are some examples:
data_df
is a dataframeparams_dict
is a dictionaryretries_int
is an intbash_command_str
is a string
If passing a constant through to a function, name each variable that is being passed so that it is clear what each thing is.
Lastly, try and avoid redundant variable naming.
def bar(some_str: str, another_string: str) -> None:
"""
Print two strings concatenated together.
"""
print(some_str + another_string)
return
## Good
bar(some_str="foo", another_string="bar")
## Better
some_str = "foo"
another_string = "bar"
bar(some_str, another_string)
## But Bad
bar(some_str=some_str, another_string=another_string)
Making your script executable
If your script is not able to be run even though you’ve just made it, it most likely needs to be executable. Run the following:
chmod 755 yourscript.py
For an explanation of chmod 755 read this askubuntu page.
Mutable default function arguments
Using mutable data structures as default arguments in functions can introduce bugs into your code. This is because a new mutable data structure is created once when the function is defined, and the data structure is used in each successive call.
def append_to(element, to=[]):
to.append(element)
return to
my_list = append_to(12)
print(my_list)
my_other_list = append_to(42)
print(my_other_list)
Output:
[12]
[12, 42]
Note: Handy link for this topic: Python gotchas
Exception handling
When writing a python class to extract data from an API it is the responsibility of that class to highlight any errors in the API process. Data modelling, source freshness and formatting issues should be highlighted using dbt tests
.
Avoid use of general try/except
blocks as it is too broad, and will be difficult to find the real error:
## Bad
try:
print("Do something")
except:
print("Caught every type of exception")
## Good
while maximum_backoff_sec > (2 ** n):
try:
print("Do something")
except APIError as gspread_error:
if gspread_error.response.status_code in (429, 500, 502, 503):
self.wait_exponential_backoff(n)
n = n + 1
else:
raise
else:
error(f"Max retries exceeded, giving up on {file_name}")
## Better! fine error granulation
while maximum_backoff_sec > (2 ** n):
try:
print("Do something")
except APIError as gspread_error:
if gspread_error.response.status_code in (429, 500, 502, 503):
self.wait_exponential_backoff(n)
n = n + 1
else:
raise
except AttributeError as attribute_error:
raise
except KeyError as key_error:
print('Caught this error: ' + repr(key_error))
Folder structure for new extracts
- All client specific logic should be stored in
/extract
folder, any API Clients which may be reused should be stored in/orchestration
folder under the/analytics
repo - Pipeline specific operations should be stored in /extract.
- The folder structure in extract should include a file called
extract_{source}_{dataset_name}
likeextract_qualtrics_mailingsends
orextract_qualtrics
if the script extracts multiple datasets. This script can be considered the main function of the extract, and is the file which gets run as the starting point of the extract DAG.
Unit Testing with pytest
Pytest is used to run unit tests in the /analytics
project. The tests are executed from the root directory of the project with the python_pytest
CI pipeline job. The job produces a JUnit
report of test results which is then processed by GitLab
and displayed on merge requests.
Most functional tests frameworks, and pytest
as well, follow the Arrange-Act-Assert
model:
Arrange
- or set up, the conditions for the testAct
- by calling some function or methodAssert
- that some end condition isTrue
(a test will pass) orFalse
(a test will fail)
pytest
simplifies testing workflow by allowing you to use Python’s assert keyword directly without any boilerplate code.
Writing New Tests
New testing file names should follow the pattern test_*.py
so they are found by pytest
and easily recognizable in the repository.
New testing files should be placed in a directory named test
, usually under the current working folder. The test directory should share the same parent directory as the file that is being tested. For instance, you are working on integration xyz
, probably you should have folder structure like following:
## a typical folder structure for xyz integration
|-- xyz
|-- src
| -- __init__.py
| -- execute.py
|-- test # here you should put your test files
| -- test_xyz.py
A testing file should contain one or more tests.
Test functions should have test_*
naming pattern in their name.
An individual test is created by defining a function that has one or many plain Python assert
statements.
- If the asserts are all
True
, the test passes. - If any asserts are
Fals
e, then the test will fail.
Note: When writing imports, it is important to remember that tests are executed from the root directory.
In the future, additional directories may be added to the PythonPath
for ease of testing as need allows.
Basic Pytest Usage
When create a test case, keep it simple and clear. The main point is to have small test cases be able to keep it consistent and easy to maintain..
## example when test passed
import pytest
def test_example_pass():
assert 1 == 1
## test.py::test_example_pass PASSED
## example when test failed
import pytest
def test_example_failed():
assert 1 == 2
## test.py::test_example_failed FAILED
Using Pytest Fixtures
pytest
fixtures are a way of providing data, configs or state setup to tests. Fixtures are functions that can return a wide range of values, especially for repeatable tasks and config items. Each test functions that depend on a fixture must explicitly accept that fixture as an argument along with decorator @pytest.fixture
.
import pytest
@pytest.fixture()
def myfixture():
# define some boring repeatable task needed for test cases
return "This is my fixture"
## this will pass
def test_example(myfixture):
assert myfixture == "This is my fixture"
## test.py::test_example PASSED
## this will also pass as myfixture is reused
def test_example_additional(myfixture):
assert type(myfixture) == str
## test.py::test_example_additional PASSED
Parametrized Test Functions
A great way to avoid code duplication is to use Parametrizing tests and for that purpose, the magic happens behind the @pytest.mark.parametrize
decorator.
The builtin pytest.mark.parametrize
decorator enables parametrization of arguments for a test function.
This enables us to test different scenarios, all in one function. We make use of the @pytest.mark.parametrize
decorator, where we are able to specify the names of the arguments that will be passed to the test function, and a list of arguments corresponding to the names.
import pytest
## here is the magic word to parametrize more than one scenario
import pytest
@pytest.mark.parametrize("test_value, expected_value", [("1+1", 2), ("2+3", 5), ("6*9", 54)])
def test_eval(test_value, expected_value):
assert eval(test_value) == expected_value
## test.py::test_eval[1+1-2] PASSED [ 33%]
## test.py::test_eval[2+3-5] PASSED [ 66%]
## test.py::test_eval[6*9-54] PASSED [100%]
In other words, you can think of this decorator behaving as a zip*
function and returning a tuple for the 2 list for more scenarios.
Note: for the decorator
@pytest.mark.parametrize
the first argument to parametrize() is a comma-delimited string of parameter names. The second argument is a list of either tuples or single values that represent the parameter value(s).
Categorizing Tests using marks
In any large test suite, some of the tests will inevitably be slow. They might test timeout behavior, for example, or they might exercise a broad area of the code. Whatever the reason, it would be nice to avoid running all the slow tests when you’re trying to iterate quickly on a new feature. pytest enables you to define categories for your tests and provides options for including or excluding categories when you run your suite. You can mark a test with any number of categories.
Marking tests is useful for categorizing tests by subsystem or dependencies. If some of your tests require access to a network, for example, then you could create a @pytest.mark.network_access
mark for them.
- First, need to define markers in
pytest.ini
file:
[pytest]
markers =
network_access: requires network access
local_test: can run locally
- Create a test file
import pytest
@pytest.mark.network_access
def test_network():
assert 1 == 2
@pytest.mark.local_test
def test_local():
assert 1 == 1
- run just
network_access
test(s):
## will fail, just to recognize what we run
╰─$ pytest test.py -m network_access
...
collected 2 items / 1 deselected / 1 selected
test.py F
- run just
local_test
test(s):
## this will pass
╰─$ pytest test.py -m local_test 1 ↵
...
collected 2 items / 1 deselected / 1 selected
test.py .
Duration Report
If you plan to improve the speed of your tests, then it’s useful to know which tests might offer the biggest improvements. pytest
can automatically record test durations for you and report the top offenders.
Use the --durations
option to the pytest command to include a duration report in your test results. --durations
expects an integer value n and will report the slowest n number of tests.
import pytest
from time import sleep
def test_slow():
sleep(1) # make it sleep 1s
assert 1 == 1
def test_slower():
sleep(2) # make it sleep 2s
assert 1 == 1
def test_slowest():
sleep(3) # make it sleep 3s
assert 1 == 1
- Calling
pytest
:
╰─$ pytest test.py --durations=1
= test session starts =
...
collected 3 items
test.py ... [100%]
= slowest 1 durations =
3.00s call test.py::test_slowest
= 3 passed in 6.03s =
Using pytest with exceptions
Sometimes, there is a use case you want to force expectations in pytest
module. The solution is fairly simple, here is an example:
def test_namespace_file_error(usage_ping):
"""
Test file loading
"""
with pytest.raises(FileNotFoundError):
get_meta_data_from_file(file_name="THIS_DOES_NOT_EXITS.json")
## in case this file doesn't exist, the function will raise an exception, and test will pass
## Also, the handy option is to enrich pytest.raises with match statement
with pytest.raises(ValueError, match="Raising error to.*"):
run_metric_checks()
Using pytest with fake RESTful API
Use the unitest.mock
library when you need to test your code against a RESTful API. unitest.mock
allows you define API calls and responses, enabling you to test such behaviors within a test environment.
Usage is pretty simple:
from unittest import mock
import pytest
import requests
import responses
## define fake response, use fixture mechanism
@pytest.fixture(name="fake_response")
def mocked_responses():
"""
Mock routine to create fake response
"""
with responses.RequestsMock() as rsps:
yield rsps
## test fake response
def test_convert_response_to_json(fake_response):
"""
Test function: convert_response_to_json
"""
expected = {"test1": "pro", "test2": "1"}
fake_response.get(
"https://some_gitlab_api_url/test",
body='{"test1": "pro", "test2": "1"}',
status=200,
content_type="application/json",
)
resp = requests.get("https://some_gitlab_api_url/test")
assert resp == expected
## Let's combine unitest.mock and pytest.raises
def test_get_response(utils):
"""
Force fake url and raise a Connection Error
"""
with pytest.raises(ConnectionError):
_ = utils.get_response("https://fake_url/test")
Beyond pytest: Useful pytest Plugins
When pytest
is not able to answer your needs is more complicated scenarios, handy plugins should be found. By now, didn’t find any usage outside of pytest
in /analytics
repo, and it is good to know there are some useful tools can help you do your work.
- pytest-randomly - Pytest plugin to randomly order tests and control
random.seed
. - pytest-cov - This plugin produces coverage reports. Compared to just using coverage run this plugin does some extras:
Subprocess support
,Xdist support
,Consistent pytest behavior
- Full plugin list - list of
pytest
plugins.
Tools for supporting coding quality
Python community provides a comprehensive set of tools to ensure code quality and high standards for the codebase. The idea is to list a couple of tools from our toolbox for the code quality and expose some interesting items worth considering as a potential option for future use.
Black
Our main linter is Black. We use the default configuration of it.
There is a manual CI job in the review
stage (Python
section) that will lint the entire repo and return a non-zero exit code if files need to be formatted. It is up to both the MR author and the reviewer to make sure that this job passes before the MR is merged. To lint the entire repo, run the command:
jump analytics
black .
If you want to check all files (without formatting) in the entire repo or any particular folder of rile using black
, run:
## for the entire repo, run
$ jump analytics
$ run black --check .
## for particular folder, run
$ run black --check extract/
## for particular folder, run
$ run black --check extract/saas_usage_ping/usage_ping.py
mypy
Mypy is an optional static type checker for
Python
that aims to combine the benefits of dynamic (orduck
) typing and static typing. Mypy combines the expressive power and convenience of Python with a powerful type system and compile-time type checking. Mypy type checks standard Python programs.
## Run mypy for extract/ folder
mypy extract/ --ignore-missing-imports
flake8
Your tool for style guide enforcement: Flake8 is a popular lint wrapper for python. Under the hood, it runs three other tools and combines their results: pep8 for checking style. pyflakes for checking syntax. mccabe for checking complexity.
flake8 . --ignore=E203,E501,W503,W605
pylint
Pylint is a static code analyser for Python.
Pylint analyses your code without actually running it. It checks for errors, enforces a coding standard, looks for code smells, and can make suggestions about how the code could be refactored. Pylint can infer actual values from your code using its internal code representation (astroid). If your code is import logging as argparse, Pylint will know that argparse.error(…) is in fact a logging call and not an argparse call.
Pylint
is highly configurable and permits to write plugins in order to add your own checks (for example, for internal libraries or an internal rule). Pylint also has an ecosystem of existing plugins for popular frameworks and third party libraries.
## we use pylint and exclude some of check to reduce the noise (ie. line-too-long)
pylint extract/ --ignore=analytics/dags --disable=line-too-long,E0401,E0611,W1203,W1202,C0103,R0801,R0902,W0212
xenon
Xenon is a monitoring tool based on Radon. It monitors your code’s complexity. Ideally, Xenon is run every time you commit code. Through command line options, you can set various thresholds for the complexity of your code. It will fail (i.e. it will exit with a non-zero exit code) when any of these requirements is not met.
run xenon --max-absolute B --max-modules B --max-average A . -i transform,shared_modules
vulture
Vulture finds unused code in Python programs. This is useful for cleaning up and finding errors in large code bases. If you run Vulture on both your library and test suite you can find untested code.
Due to 🐍Python’s dynamic nature, static code analyzers like Vulture are likely to miss some dead code. Also, code that is only called implicitly may be reported as unused. Nonetheless, Vulture can be a very helpful tool for higher code quality.
run vulture . --min-confidence 100
Few more handy libraries
For more elements on how we automated linter to keep code quality on the elevated level, refer to 🐍 Python CI jobs. All of these linters can be tested automatically and for that purpose, we created a comprehensive set of commands using Makefile.
In addition, recommendation to check, explore and considering:
pycodestyle
yapf
autopep8
HowDOI
- good tool for a quick search.iSort
- isort is a Python utility / library to sort imports alphabetically, and automatically separated into sections and by type.
Tools for automating our coding quality standards
For automating code quality and testing, we are using our own product GitLab CI/CD pipeline. Details of pipelines we use for python should be found on the page CI jobs (Python).
When not to use Python
Since this style guide is for the entire data team, it is important to remember that there is a time and place for using Python
and it is usually outside of the data modeling phase.
Stick to SQL
for data manipulation tasks where possible.
ac0e3d5e
)