SQL Style Guide
SQL Style Guide
This guide establishes our standards for SQL and are enforced by the SQLFluff linter and by code review. The target code changes that this stile guide apply to are those made using dbt.
If you are not on the Data Team or you are developing SQL outside of dbt, keep in mind that the linting tools may be more difficult to apply, but you are welcome to follow the guidance given within this guide.
Usage
We expect that people will use the style presented in this guide during there development. Enforcement is still expected to happen at the time of review as the CI pipeline is only run manually. At a latter date the pipeline will run with every change and eventually set to be mandatory for the pipeline to pass.
The previous style guide used a mixed indentation style that the linter does not support so every model will likely fail the first time it is checked. The intention is that models will be updated to the new style as they are worked on normally as not all of the rules can automatically enforced (such as explicit column aliasing). The intention of working updating them in the course of other development is to keep the changes manageable.
SQLFluff
SQLFLuff is a SQL linter that works with templating tools like dbt. We use it to define the basic structure and style of the SQL that we write and move the review of that structure and style into the hands of the authors. SQLFluff is included within the dbt development environment and it uses the dbt templating engine during the linting process. It can be used with the following command:
sqlfluff lint models/path/to/file/file-to-lint.sql
If you are in the pytohn virtual environemtn for dbt then a make
command with a dbt model name can be used:
make lint-models MODEL="model_name"
A dbt command can also be used to get a list of files to lint:
sqlfluff lint $(dbt list --model model_name --output path)
If you are writing SQL that is not templated using dbt then you can install and use SQLFluff directly as it is a stand alone python package.
pip install sqlfluff
sqlfluff lint path/to/file/file-to-lint.sql
SQLFluff includes a fix
command that will apply fixes to rule violations when possible. Not all rule violations are automatically fixable; therefore, you are encouraged to run the lint
command after using the fix
command to ensure that all rule violations have been resolved.
Changes from the default configuration
- selecting the dialect to snowflake
- selecting the templater to be dbt
- excluding the following rules:
- L016 Line length
- L031 No table aliases in FROM or JOIN clauses
- L034 Order of simple targets and calcuations in SELECT statment
- aligning column aliases within the SELECT statement
- excluding the use of hanging indents
- selecting the tab size to be 2 spaces
- selecting Key words, Data Types and Functions to always be upper case
- require table aliases to be a minimum of four characters
The configuration file that the Data Team uses can be found in the GitLab Data Team repository.
General Guidance
-
Do not optimize for fewer lines of code, new lines are cheap but brain time is expensive.
-
Familiarize yourself with the DRY Principal. Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places.
-
Be consistent. Even if you are not sure of the best way to do something do it the same way throughout your code, it will be easier to read and make changes if they are needed.
-
Be explicit. Defining something explicitly will ensure that it works the way you expect and it is easier for the next person, which may be you, when you are explicit in SQL.
Best Practices
-
No tabs should be used - only spaces. Your editor should be setup to convert tabs to spaces - see our onboarding template for more details.
-
Wrap long lines of code, between 80 and 100, to a new line.
-
Do not use the
USING
command in joins because it produces inaccurate results in Snowflake. Create an account to view the forum discussion on this topic. -
Understand the difference between the following related statements and use appropriately:
UNION ALL
andUNION
LIKE
andILIKE
NOT
and!
and<>
DATE_PART()
andDATE_TRUNC()
-
Use the
AS
operator when aliasing a column or table. -
Prefer
DATEDIFF
to inline additionsdate_column + interval_column
. The function is more explicit and will work for a wider variety of date parts. -
Prefer
!=
to<>
. This is because!=
is more common in other programming languages and reads like “not equal” which is how we’re more likely to speak. -
Prefer
LOWER(column) LIKE '%match%'
tocolumn ILIKE '%Match%'
. This lowers the chance of stray capital letters leading to an unexpected result. -
Prefer
WHERE
toHAVING
when either would suffice.
Commenting
- When making single line comments in a model use the
--
syntax - When making multi-line comments in a model use the
/* */
syntax - Respect the character line limit when making comments. Move to a new line or to the model documentation if the comment is too long
- Utilize the dbt model documentation when it is available
- Calculations made in SQL should have a brief description of what’s going on and if available, a link to the handbook defining the metric (and how it’s calculated)
- Instead of leaving
TODO
comments, create new issues for improvement
Naming Conventions
-
An ambiguous field name such as
id
,name
, ortype
should always be prefixed by what it is identifying or naming:-- Preferred SELECT id AS account_id, name AS account_name, type AS account_type, ... -- vs -- Not Preferred SELECT id, name, type, ...
-
All field names should be snake-cased:
-- Preferred SELECT dvcecreatedtstamp AS device_created_timestamp ... -- vs -- Not Preferred SELECT dvcecreatedtstamp AS DeviceCreatedTimestamp ...
-
Boolean field names should start with
has_
,is_
, ordoes_
:-- Preferred SELECT deleted AS is_deleted, sla AS has_sla ... -- vs -- Not Preferred SELECT deleted, sla, ...
-
Timestamps should end with
_at
and should always be in UTC. -
Dates should end with
_date
. -
When truncating dates name the column in accordance with the truncation.
SELECT original_at, -- 2020-01-15 12:15:00.00 original_date, -- 2020-01-15 DATE_TRUNC('month',original_date) AS original_month -- 2020-01-01 ...
-
Avoid key words like
date
ormonth
as a column name.
Reference Conventions
-
When joining tables and referencing columns from both tables consider the following:
-
reference the full table name instead of an alias when the table name is shorter, maybe less than 20 characters. (try to rename the CTE if possible, and lastly consider aliasing to something descriptive)
-
always qualify each column in the SELECT statement with the table name / alias for easy navigation
-- Preferred SELECT budget_forecast.account_id, date_details.fiscal_year, date_details.fiscal_quarter, date_details.fiscal_quarter_name, cost_category.cost_category_level_1, cost_category.cost_category_level_2 FROM budget_forecast_cogs_opex AS budget_forecast LEFT JOIN date_details ON date_details.first_day_of_month = budget_forecast.accounting_period LEFT JOIN cost_category ON budget_forecast.unique_account_name = cost_category.unique_account_name -- vs -- Not Preferred SELECT a.account_id, b.fiscal_year, b.fiscal_quarter, b.fiscal_quarter_name, c.cost_category_level_1, c.cost_category_level_2 FROM budget_forecast_cogs_opex a LEFT JOIN date_details b ON b.first_day_of_month = a.accounting_period LEFT JOIN cost_category c ON b.unique_account_name = c.unique_account_name
-
-
Only use double quotes when necessary, such as columns that contain special characters or are case sensitive.
-- Preferred SELECT "First_Name_&_" AS first_name, ... -- vs -- Not Preferred SELECT FIRST_NAME AS first_name, ...
-
Prefer accessing JSON using the bracket syntax.
-- Preferred SELECT data_by_row['id']::bigint as id_value ... -- vs -- Not Preferred SELECT data_by_row:"id"::bigint as id_value ...
-
Prefer explicit join statements.
-- Preferred SELECT * FROM first_table INNER JOIN second_table ... -- vs -- Not Preferred SELECT * FROM first_table, second_table ...
-
Also prefer explicit join statements for
LATERAL FLATTEN
. However, it should be noted that the current code-base does not consistently adhere to this practice.-- Preferred SELECT data.value, source.uploaded_at FROM source INNER JOIN LATERAL FLATTEN(input => source.jsontext['data']) AS data ... -- vs -- Not Preferred SELECT data.value, source.uploaded_at FROM source, LATERAL FLATTEN(input => source.jsontext['data']) AS data ...
Common Table Expressions (CTEs)
-
Prefer CTEs over sub-queries as CTEs make SQL more readable and are more performant:
-- Preferred WITH important_list AS ( SELECT DISTINCT specific_column FROM other_table WHERE specific_column != 'foo' ) SELECT primary_table.column_1, primary_table.column_2 FROM primary_table INNER JOIN important_list ON primary_table.column_3 = important_list.specific_column -- vs -- Not Preferred SELECT primary_table.column_1, primary_table.column_2 FROM primary_table WHERE primary_table.column_3 IN ( SELECT DISTINCT specific_column FROM other_table WHERE specific_column != 'foo')
-
Use CTEs to reference other tables.
-
CTEs should be placed at the top of the query.
-
Where performance permits, CTEs should perform a single, logical unit of work.
-
CTE names should be as concise as possible while still being clear.
- Avoid long names like
replace_sfdc_account_id_with_master_record_id
and prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joins.
- Avoid long names like
-
CTEs with confusing or notable logic should be commented in file and documented in dbt docs.
-
CTEs that are duplicated across models should be pulled out into their own models.
Data Types
- Use default data types and not aliases. Review the Snowflake summary of data types for more details. The defaults are:
-
NUMBER
instead ofDECIMAL
,NUMERIC
,INTEGER
,BIGINT
, etc. -
FLOAT
instead ofDOUBLE
,REAL
, etc. -
VARCHAR
instead ofSTRING
,TEXT
, etc. -
TIMESTAMP
instead ofDATETIME
The exception to this is for timestamps. Prefer
TIMESTAMP
toTIME
. Note that the default forTIMESTAMP
isTIMESTAMP_NTZ
which does not include a time zone.
-
Functions
-
Prefer
IFNULL
toNVL
. -
Prefer
IFF
to a single lineCASE
statement:-- Preferred SELECT IFF(column_1 = 'foo', column_2,column_3) AS logic_switch, ... -- vs -- Not Preferred SELECT CASE WHEN column_1 = 'foo' THEN column_2 ELSE column_3 END AS logic_switch, ...
-
Prefer
IFF
to selecting a boolean statement:-- Preferred SELECT IFF(amount < 10,TRUE,FALSE) AS is_less_than_ten, ... -- vs -- Not Preferred SELECT (amount < 10) AS is_less_than_ten, ...
-
Prefer simplifying repetitive
CASE
statements where possible:-- Preferred SELECT CASE field_id WHEN 1 THEN 'date' WHEN 2 THEN 'integer' WHEN 3 THEN 'currency' WHEN 4 THEN 'boolean' WHEN 5 THEN 'variant' WHEN 6 THEN 'text' END AS field_type, ... -- vs -- Not Preferred SELECT CASE WHEN field_id = 1 THEN 'date' WHEN field_id = 2 THEN 'integer' WHEN field_id = 3 THEN 'currency' WHEN field_id = 4 THEN 'boolean' WHEN field_id = 5 THEN 'variant' WHEN field_id = 6 THEN 'text' END AS field_type, ...
-
Prefer the explicit date function over
date_part
, but preferdate_part
overextract
:DAYOFWEEK(created_at) > DATE_PART(dayofweek, 'created_at') > EXTRACT(dow FROM created_at)
-
Be mindful of date part interval when using the
DATEDIFF
function as the function will only return whole interval results.
Example Code
This example code has been processed though SQLFluff linter and had the style guide applied.
WITH my_data AS (
SELECT my_data.*
FROM prod.my_data_with_a_long_table_name AS my_data
INNER JOIN prod.other_thing
WHERE my_data.filter = 'my_filter'
),
some_cte AS (
SELECT DISTINCT
id AS other_id,
other_field_1,
other_field_2,
date_field_at,
data_by_row,
field_4,
field_5,
LAG(
other_field_2
) OVER (PARTITION BY other_id, other_field_1 ORDER BY 5) AS previous_other_field_2
FROM prod.my_other_data
),
/*
This is a very long comment: It is good practice to leave comments in code to
explain complex logic in CTEs or business logic which may not be intuitive to
someone who does not have intimate knowledge of the data source. This can help
new users familiarize themselves with the code quickly.
*/
final AS (
SELECT
-- This is a singel line comment
my_data.field_1 AS detailed_field_1,
my_data.field_2 AS detailed_field_2,
my_data.detailed_field_3,
DATE_TRUNC('month', some_cte.date_field_at) AS date_field_month,
some_cte.data_by_row['id']::NUMBER AS id_field,
IFF(my_data.detailed_field_3 > my_data.field_2, TRUE, FALSE) AS is_boolean,
CASE
WHEN my_data.cancellation_date IS NULL
AND my_data.expiration_date IS NOT NULL
THEN my_data.expiration_date
WHEN my_data.cancellation_date IS NULL
THEN my_data.start_date + 7 -- There is a reason for this number
ELSE my_data.cancellation_date
END AS adjusted_cancellation_date,
COUNT(*) AS number_of_records,
SUM(some_cte.field_4) AS field_4_sum,
MAX(some_cte.field_5) AS field_5_max
FROM my_data
LEFT JOIN some_cte
ON my_data.id = some_cte.id
WHERE my_data.field_1 = 'abc'
AND (my_data.field_2 = 'def' OR my_data.field_2 = 'ghi')
GROUP BY 1, 2, 3, 4, 5, 6
HAVING COUNT(*) > 1
ORDER BY 8 DESC
)
SELECT *
FROM final
Other SQL Style Guides
e353870b
)