Data Team Platform

GitLab Data Team Platform

Our Data Platform Vision

These ambitions are set to be a guiding vision for GitLab’s data platform.

Makes it Easier to Contribute

Contributing to GitLab’s Data Platform is easy and using the platform is intuitive

Documentation is complete and relevant for users and contributors
All data transformations are implemented in dbt
CI/CD is seamless, intuitive, automated for contributors and reviewers
Data state is derivative from sources and transformations
Data pipelines are idempotent

Is Reliable

The data platform along with the data it delivers is consistent in availability and accuracy

All breaking changes are testable in Dev and/or Staging Environments
Automated tests are implemented at every stage of the data delivery process
Every component of the platform can and should be defined in code and version controlled

Is Secure

The Data Platform doesn’t put people at risk

Data is only accessible to those authorized by documented approvals
The GitLab data team will stick with the Principle of Least Privilege regarding the authorization and authentication

Is Maintainable

The Data Platform component will be created with considering the good engineering practices for the ease of maintainability. It means tracking of maintainability is intended to help reduce or reverse a system’s tendency toward “code entropy” or degraded integrity

Benefits a Larger Community

GitLab’s Data Platform is relevant to a community larger than GitLab’s and depends on a larger community of engineers.

Relevant Platform code is open sourced
Platform enhancements are contributed back into community projects
We prefer generalizable specifications and standards over idiosyncratic custom development

Purpose

The Data Platform is used for data analytics purposes. This document conceptually describes on high level the components which all together are defined as the Data Platform.

Scope

This document is limited to describe the Data Platform conceptually. There are other resources that describe it in more detail (i.e. the data pipelines and the infrastructure).

Roles & Responsibilities

Role	Responsibility
GitLab Team Members	Responsible for taking notice of how the standard that forms the Data Platform
Data Platform Team Members	Responsible for implementing and executing data use cases based on this standard
Data Management Team	Responsible for approving significant changes and exceptions to this standard

Standards

Quick Links

Our Data Stack

Enterprise Data Platform

We use GitLab to operate and manage the analytics function. Everything starts with an issue. Changes are implemented via merge requests, including changes to our pipelines, extraction, loading, transformations, and parts of our analytics.

Stage	Tools
Extraction	Stitch, Fivetran, Tableau Prep, and Custom Code
Loading	Stitch, Fivetran, Tableau Prep, and Custom Code
Orchestration	Airflow, Tableau Prep
Data Warehouse	Snowflake Enterprise Edition
Transformations	dbt and Python scripts
Data Visualization	Tableau
Advanced Analytics	jupyter

Extract and Load

We currently use Stitch and Fivetran for some of our data sources. These are off-the-shelf ELT tools that remove the responsibility of building, maintaining, or orchestrating the movement of data from some data sources into our Snowflake data warehouse.

Stitch and Fivetran handle the start of the data pipeline themselves. This means that Airflow does not play a role in the orchestration of the Stitch- and Fivetran schedules.

Other solutions we use to extract data are:

Meltano
Custom pipelines built in Python and orchestrated via Airflow
Flows built in Tableau Prep and orchestracted by Tableau Cloud
Snowflake data share

For source ownership please see the Tech Stack Applications data file.

Data Sources

The following table indexes all of the RAW data sources we are loading into the data warehouse from external locations. We manage the development backlog and priorities in the New Data Source/Pipeline Project Management sheet, with links to GitLab issues for up-to-date status and progress management. The new data source handbook page describes how the Data Team handles any request for new data sources.

The link in the pipeline column in the table below will bring you to the detailed page of the specific data pipeline if applicable.

Key

Pipeline: The technology we use to replicate data.
RF (Replication Frequency): How often we load new and updated data.
Raw Schema: The schema in the RAW database where data is stored.
Prep Schema: The schema in the PREP database where source models are materialized.
Audience: The primary users of the data.
SLO: Service Level Objective. Our SLO is the time between real-time and the data made available for consumption.
- Technically, this means the time between when an entry is made in an upstream system and when the data is available in the Snowflake PROD layer (which includes transformations in dbt). x indicates undefined or not run

Name	Pipeline	Raw Schema	Prep Schema	Audience	RF / SLO	MNPI	Tier
Adaptive	Airflow	`adaptive_custom`	`x`	Finance		Yes	Tier 2
Adobe / Bizible	Airflow	`bizible`	`sensitive`	Marketing	24h / 36h	No	Tier 2
Airflow	Stitch	`airflow_stitch`	`airflow`	Data Team	24h / 24h	No	Tier 3
AWS Billing	Snowflake external tables	`aws_billing`	`aws_billing`	Engineering	24h / 24h	No	Tier 2
Clari	Airflow	`clari`	`clari`	Sales	24h / 24h	Yes	Tier 2
Clearbit	x	`x`	`x`	x / x		No	Tier 3
ClickHouse billing	Airflow	`clickhouse_billing`	`clickhouse_billing`	Engineering	24h / 24h	No	Tier 2
Common Room	Snowflake task	`commonroom`	`commonroom`	DevRels/Developer Advocates		No	Tier 3
Cornerstone	Fivetran	`cornerstone`	`cornerstone`	People	6h / 12h	No	Tier 2
Coupa Production	Fivetran	`coupa`	`coupa`	Marketing	24h / 48h	No	Tier 2
Coupa Sandbox	Fivetran	`coupa_sandbox`	`coupa_sandbox`	Marketing	Ad-hoc	No	Tier 3
CustomersDot	pgp	`tap_postgres`	`customers`	Product	24h / x	No	Tier 1
Demandbase	Snowflake task	`demandbase`	`demandbase`	Marketing	24h / x	No	Tier 2
Demo Architecture Portal	Stitch	`demo_architecture_portal`	`demo_architecture_portal`	Sales and marketing	7 Days/7 Days	No	Tier 3
Elastic Search Billing	Airflow	`elasticsearch_billing`	`elastic_billing`	Engineering	24h / 24h	No	Tier 2
End to End test metrics	Snowflake tasks	`e2e_metrics`	`e2e_metrics`	Engineering	24h / 48h	No	Tier 2
Ecosystems BVA	Airflow	`ecosystems`	`ecosystems`	Sales	24h / 48h	No	Tier 3
Facebook_ads	Fivetran	`facebook_ads`	`facebook_ads`	Marketing	24h / 48h	No	Tier 3
Fivetran_Logs	Fivetran	`N/A`	`N/A`	Data	24h / 48h	No	Tier 3
Flaky test Metrics	Snowflake tasks	`flaky_tests`	`flaky_tests`	Engineering	24h / 48h	No	Tier 2
Gainsight Customer Success	Fivetran	`gainsight_customer_success`	`gainsight_customer_success`	Customer Success	24h / 48h	No	Tier 3
GitLab Availability	Snowflake tasks	`gitlab_availability`	`gitlab_availability`	Product, Engineering	24h / 48h	No	Tier 2
GitLab.com	pgp	`tap_postgres`	`gitlab_dotcom`	Product, Engineering	12h / 55h	No	Tier 1
GitLab Profiler DB	x	`x`	`x`	x	x / x	No	Tier 3
GitLab Container Registry Logs	Airflow	`Container Registry`	`Container Registry`	Engineering	x	No	Tier 2
Google Ads	Fivetran	`google_ads`	`google_ads`	Marketing	24h / 48h	No	Tier 2
Google Analytics 360	Fivetran	`google_analytics_360_fivetran`	`google_analytics_360`	Marketing	6h / 32h	No	Tier 2
Google Analytics 4	BigQuery Exporter	`google_analytics_4_bigquery`	`google_analytics_4`	Marketing	24h / 48h	No	Tier 2
Google Cloud Billing	BigQuery Exporter	`gcp_billing`	`gcp_billing`	Engineering	24h / x	No	Tier 1
Google Search Console	Fivetran	`google_search_console`	`google_search_console`	Marketing	24h / 48h	No	Tier 2
Graphite API	Airflow	`engineering_extracts`	`x`	Engineering	24h / 48h	No	Tier 3
Greenhouse	Sheetload	`greenhouse`	`greenhouse`	People	24h / 48h	No	Tier 2
Hackerone	Airflow	`hackerone`	`x`	Security/Engineering	24h / 48h	No	Tier 2
Handbook YAML Files	Airflow	`gitlab_data_yaml`	`gitlab_data_yaml`	Multiple	8h / 24h	No	Tier 2
Handbook MR Data	Airflow	`handbook`	`handbook`	Multiple	24h / 24h	No	Tier 2
Handbook Git Log Data	Airflow	`handbook`	`handbook`	Multiple	1w / 1m	No	Tier 2
Iterable	Fivetran	`iterable`	`n/a`	Multiple	24h / 48h	No	Tier 3
Just Global Campaigns	Snowflake task	`just_global_campaigns`	`just_global_campaigns`	Marketing	7d / 14d	No	Tier 3
Kantata	Airflow	`kantata`	`kantata`	Customer Success	24h / 48h	Yes	Tier 3
Level Up/Thought Industries	Airflow	`level_up`	`level_up`	People	24h / 24h	No	Tier 3
LinkedIn ads	Fivetran	`linkedin_ads`	`n/a`	Marketing	24h / 48h	No	Tier 3
MailGun	Airflow	`mailgun`	`sensitive`	Sales, Marketing, Customer Success, Digital Success	24h / 24h	No	Tier 3
Marketo	Fivetran	`marketo`	`x`	Marketing	24h / 24h	No	Tier 2
ModernLoop	Airflow	`modernLoop`	`modernLoop`	People	24h / 48h	No	Tier 3
Monte Carlo	Snowflake Share	`n/a`	`prep_legacy`	Data	12h / 24h	No	Tier 3
Netsuite	Fivetran	`netsuite_fivetran`	`netsuite`	Finance	6h / 24h	Yes	Tier 2
Omamori	Airflow	`omamori`	`omamori`	Engineering	1h / 24h	No	Tier 2
Pajamas Adoption Scanner	Airflow	`pajamas_adoption_scanner`	`pajamas_adoption_scanner`	Engineering	24h / 48h	No	Tier 3
PMG	x	`pmg`	`pmg`	x	x / x	No	Tier 3
Time Off by Deel	Snowpipe	`pto`	`gitlab_pto`	Engineering Productivity / People	7 days / x	No	Tier 3
Qualtrics	Airflow	`qualitrics`	`qualtrics`	Marketing	12h / 48h	No	Tier 2
Rally	Stitch Webhook	`rally_webhook_stitch`	`sensitive`	UX	24h / 48h	No	Tier 3
SaaS Service Ping	Airflow	`saas_usage_ping`	`saas_usage_ping`	Product	1 week / 24h	No	Tier 1
Salesforce	Stitch	`salesforce_v2_stitch`	`sfdc`	Sales	6h / 24h	Yes	Tier 1
Salesforce (for Gong Data)	Fivetran	`RAW.gong_salesforce`	`prep.gong_salesforce`	Sales	6h / 24h	Yes	Tier 1
Salesforce Sandbox	Stitch	`salesforce_stitch_sandbox_v2`	`TBC`	Sales	24h / 48h	Yes	Tier 3
Salesforce Sandbox Test 2	Stitch	`salesforce_stitch_sandbox_test2`	`TBC`	Sales	24h / 48h	Yes	Tier 3
SheetLoad	SheetLoad	`sheetload`	`sheetload`	Multiple	24h / 48h	Yes	Tier 1
SIRT Alertapp	Snowflake task	`sirt_alertapp`	`sirt_alertapp`	Engineering	24h / 48h	No	Tier 3
Snowplow	Snowpipe	`snowplow`	`snowplow`	Product	15m / 24h	No	Tier 1
Tableau Cloud	Tableau Prep	`tableau_cloud`	`tableau_cloud`	Data Team	24h / 24h	No	Tier 3
Tableau Back-end Data	Fivetran	`tableau_fivetran`	`N/A`	Data Team	24h / 48h	No	Tier 3
Thanos	Snowflake Task	`prometheus`	`prometheus`	Engineering	24 h / x	No	Tier 3
Version DB	Automatic Process	`version_db`	`version_db`	Product	24 h / 48 h	No	Tier 1
Workday	Fivetran	`workday`	`workday`	People	6h / 24h	No	Tier 2
Xactly	Meltano	`tap_xactly`	`N/A`	Sales	24h / N/A	Yes	Tier 2
Zendesk	Meltano	`tap_zendesk`	`zendesk`	Support	24h / 48h	No	Tier 2
Zuora	Stitch	`zuora_stitch`	`zuora`	Finance	6h / 24h	Yes	Tier 1
Zuora API Sandbox	Stitch	`zuora_api_sandbox_stitch`	`Legacy`	Finance	24h / 24h	Yes	Tier 3
Zuora Central Sandbox	Fivetran	`zuora_central_sandbox_fivetran`	`zuora_central_sandbox`	Finance Sandbox	-	Yes	Tier 3
Zuora Central Sandbox 2	Fivetran	`zuora_central_sandbox_2`	`zuora_central_sandbox_2`	Finance Sandbox	-	Yes	Tier 3
Zuora Developer Sandbox	Fivetran	`zuora_dev_sandbox_fivetran`	`TBD`	Finance Sandbox	-	Yes	Tier 3
Zuora Data Query	Airflow	`zuora_query_api`	`zuora_query_api`	Finance	24h / 48h	Yes	Tier 1
Zuora Revenue	Airflow	`zuora_revenue`	`zuora_revenue`	Finance	24h / 48h	Yes	Tier 1
Integrate DAP	Fivetran	`integrate_dap`	`integrate_dap`	Marketing	6h / 12h	No	Tier 3

Source contacts

See the source contact spreadsheet for who to contact if there are any external errors.

Tier definition

Aspect	Tier 1	Tier 2	Tier 3
Description	- Trusted Data solutions that are most important and business critical. - Components needs to be available and refreshed to ensure day-by-day operation	- Data solutions that are important and beneficial for gathering insights. - Components should be available and refreshed to supporting day-by-day operation	- Data solutions that are important for Ad-Hoc, periodically or one-time analysis. - Components could be unavailable or data not refreshed.
Criteria	- Any data, process, or related service that would result in a $100k or higher business impact if unavailable for 24 hours - Affecting more than 15 business users	- Any data, process, or related service that would result in less than $100k business impact if unavailable for 24 hours - Affecting between 5 and 15 business users	- Any data, process or related service that would not result in a immediate business impact if unavailable for more than 5 working days - Affecting less then 5 business users
Impact due to outage	Severe	Lenient	Negligible

Data Team Access to Data Sources

In order to integrate new data sources into the data warehouse, specific members of the Data team will need admin-level access to data sources, both in the UI and through the API. We need this admin-level access through the API in order to pull all the data needed to build the appropriately analyses and through the UI to compare the results of prepared analyses to the UI.

Sensitive data sources can be limited to no less than 1 data engineer and 1 data analyst having access to build the require reporting. In some cases, it may only be 2 data engineers. We will likely request an additional account for the automated extraction process.

Sensitive data is locked down through the security paradigms listed below;

Data Source Overviews

Customer Success Dashboards
Netsuite
- Netsuite and Campaign Data
Version (pings)
- Note that up until October 2019, the data team referred to the entire version data source as “pings”. However, usage ping is only one subset of the version data source which is why we now use “version” or “version app” to refer to the version.gitlab.com data source and “usage data” or “usage pings” or “pings” to refer to the specific usage data feature of the version data source. In the context of Data extraction, when it comes to Service ping data ingestion, specific details should be found in the Service ping page or in the Readme.md page for Service ping
Salesforce
Zendesk

Snowplow Infrastructure

Refer to the Snowplow Infrastructure page for more information on our setup.

AI

We use AI to develop data products and we use AI in our data products.

Using AI to Develop Data Products

We leverage AI to enhance our data development process by connecting GitLab Duo Agentic Platform (more details in the internal handbook page), Tableau data sources, and Claude to our Snowflake data warehouse via a Model Context Protocol (MCP) connection. This integration enables us to accelerate development workflows and improve data product quality.

Building AI Products

Our AI product development is categorized into two main areas:

1. Snowflake Cortex AI

We utilize Snowflake Cortex AI within our Data Platform for native AI capabilities directly in our data warehouse. For detailed implementation guidance and best practices, refer to our internal handbook page on AI to Data.

2. Federated AI Tools

We implement federated AI tools including:

Weaviate: A vector database that enables semantic search and AI-powered data retrieval by storing and querying high-dimensional vector representations of our data
External AI Integration: We connect external Large Language Models (LLMs) to our vector database infrastructure to enable advanced AI capabilities across our data products

Orchestration

We use Airflow on Kubernetes for our orchestration. Our specific setup/implementation can be found here. Also see the Data Infrastructure page for more information.

Data Warehouse

We currently use Snowflake as our data warehouse. The Enterprise Data Warehouse (EDW) is the single source of truth for GitLab’s corporate data, performance analytics, and enterprise-wide data such as Key Performance Indicators. The EDW supports GitLab’s data-driven initiatives by providing all teams a common platform and framework for reporting, dashboarding, and analytics. With the exception of point-to-point application integrations all current and future data projects will be driven from the EDW. As a recipient of data from a variety of GitLab source systems, the EDW will also help inform and drive Data Quality best-practices, measures, and remediation to help ensure all decisions are made using the best data possible.

Snowplow updating columns

Snowplow nullify geo columns

Issue: Snowflake documentation

In order not to extract geo data into Snowplow, the following columns were nullified:

geo_zipcode
geo_latitude
geo_longitude
user_ipaddress

This nullified is applied in Snowplow from 2023-02-01 and the files have the same structure, just column values are set to NULL. The Data Team updated old files and set mentioned columns to NULL, and also set columns to NULL in Snowflake. This is applicable to the RAW, PREP and PROD layers in Snowflake.

As desired to avoid a duplicate load of the updated files in the S3 bucket as per Snowflake documentation, the folder structure is modified from:

- gitlab-com-snowplow-events/
    output/ <---- all files are located here
        2019/01/01
        ...
        (present day)

to the new structure:

- gitlab-com-snowplow-events/
    output_nullified_columns/ <---- all files are nullified and updated
        2019/01/01
        ...
        2023/01/31
    output/ <---- new files will land here and will be loaded by Snowpipe
        2023/02/01
        ...
        (present day)

Snowplow nullify `page_url_path` columns

Issue: s3: Pseudonymize page_url_path in Snowflake and s3 bucket

In order to be compliant with data into Snowplow, the following columns were pseudo-anonymized:

page_url_path

This pseudo-anonymization is applied for Snowplow data, for the period 2022-10-26 - 2024-12-01 and the files have the same structure, just column values are pseudonymized. The Data Team updated old files and pseudo-anonymized page_url_path column, and also pseudo-anonymized page_url_path column in Snowflake. This is applicable to the RAW, PREP and PROD layers in Snowflake.

As desired to avoid a duplicate load of the updated files in the S3 bucket as per s3: Pseudonymize page_url_path in Snowflake and s3 bucket, the folder structure is modified from:

- gitlab-com-snowplow-events/
    output_nullified_columns/ <---- all files are nullified and updated (in the previous iteration)
        2022/10/26
        ...
        2023/
            02/
    output/
        2023/
            02/
            03/

to the new structure:

- gitlab-com-snowplow-events/
    output_nullified_columns/
        2019/01/01
        ...
        2022/10/25
    output_mask_page_url_path/ <---- all files are pseudonimized
        2022/10/26
        ...
        2023/12/01
    output/ <---- new files will land here and will be loaded by Snowpipe
        2023/12/02
        ...
        (present day)

Note: All new loads in the S3 bucket will go into the same folder as before gitlab-com-snowplow-events/output.

Snowflake support portal access

Prerequisites:

Team member must have a verified email address in Snowflake
Email verification can be completed via Settings → My Profile in Snowsight

Access Options:

There are three privilege levels for support portal access:

MANAGE ORGANIZATION SUPPORT CASES - View and manage all support cases across the organization
MANAGE ACCOUNT SUPPORT CASES - View and manage all support cases for the account
MANAGE USER SUPPORT CASES - View and manage cases opened by the user themselves

Access Policy:

Always start with MANAGE USER SUPPORT CASES for individual requests. Higher access levels (MANAGE ACCOUNT or MANAGE ORGANIZATION SUPPORT CASES) require a business justification showing why individual-level access is insufficient.

Granting Access:

Only ACCOUNTADMIN can grant MANAGE ACCOUNT SUPPORT CASES or MANAGE USER SUPPORT CASES privileges. Only ORGADMIN can grant MANAGE ORGANIZATION SUPPORT CASES privilege.

Execute the appropriate grant statement:

– MANAGE USER SUPPORT CASES

USE ROLE ACCOUNTADMIN;
GRANT MANAGE USER SUPPORT CASES ON ACCOUNT TO ROLE <role_name>;

– MANAGE ACCOUNT SUPPORT CASES

USE ROLE ACCOUNTADMIN;
GRANT MANAGE ACCOUNT SUPPORT CASES ON ACCOUNT TO ROLE <role_name>;

– MANAGE ORGANIZATION SUPPORT CASES

USE ORGADMIN;
GRANT MANAGE ORGANIZATION SUPPORT CASES ON ACCOUNT TO ROLE <role_name>;

<role_name> :- This is individual role. Note, users in Snowflake via Okta don’t have an user role.

First-Time Access: When accessing Support for the first time, users must select “Enable Support” in Snowsight.

To get access to snowflake support portal, please follow the below steps.

When you are in your Snowsight instance, open your account (bottom-left corner) and go to the Support option

Account

On the panel, you can see the already open cases

Open cases

In the top-right corner, to open a new case, press + Support Case button
Fill in the data to describe your issue and the Snowflake team will handle it

Support case

For each update on your case, you will be informed by email

Warehouse Access

To gain access to Snowflake:

In order to be granted access to a default snowflake_analyst role, utilize a Lumos Access Request. A new user will be created with access to query the PROD database. There are 2 levels of default data access:

General data –> Lumos adds the Snowflake snowflake_analyst role to their account.
SAFE data (you must be or will become a designated insider) –> Lumos adds the Snowflake snowflake_analyst_safe to their account. See the SAFE Guide for the needed approvals.

All users will have access to dev_xs and reporting -(size M) warehouse. When creating the user, the dev_xs warehouse as default warehouse.

Snowflake can be used to perform analyses on the data that is available by writing SQL-code. Anything created and any outcome of the analyses is considered as an ad-hoc analyses. It is important to know that anything that is created (i.e. worksheets and dashboards) is not version controlled and not supported or managed by the Central Data Team. I.e. When a team member off-boards from GitLab, the worksheets and dashboards are not accessible anymore. In order to persist analyses, team members can build Tableau workbooks, store code snippets in a GitLab project, or commit code to the Data Team’s dbt project.

Additional Access

If a user was created in Snowflake either before the implementation of SCIM via Lumos & okta or if access beyond the default snowflake_analyst and snowflake_analyst_safe is needed, including the creation of a dev database needed for contributing to our dbt project, then please use an Access Request documenting the level of access required.

Provisioning Additional Access

When a user needs a role , i.e analyst_core role beyond the standard snowflake_analyst or snowflake_analyst_safe roles, this needs to go through permifrost and so roles.yml must be updated. Please refer to our Snowflake Permissions Paradigm for more information about how we use permifrost.

instructions

Step 1: Check current access state:

Check to see if the user is already provisioned via permifrost by searching for the username in roles.yml. If the username is not present in roles.yml, you will likely need to create the role and maybe the user in Snowflake.

This state can be verified in Snowflake using the following query, which will return a row for the user and the user role in Snowflake if they exist.

SET username = 'username';
SELECT
 'user' AS record,
 name, created_on,deleted_on,
 disabled, owner
FROM SNOWFLAKE.ACCOUNT_USAGE.users
 WHERE LOWER(name) IN ($username)
UNION
SELECT
  'role' AS record,
  name, created_on, deleted_on,
  null AS disabled, owner
FROM SNOWFLAKE.ACCOUNT_USAGE.ROLES
WHERE LOWER(name) IN ($username);

Step 2: Handle accordingly based on user’s current state:

State: User already exists in roles.yml:
- Update roles.yml with the additional roles and/or permissions needed in accordance with our Snowflake Permissions Paradigm
State: User exists in Snowflake, but needs a user role created for roles.yml:
- Add the user role to snowflake-infrastructure/infra/roles.tf
- Update roles.yml with the additional roles and/or permissions needed in accordance with our Snowflake Permissions Paradigm
State: User does not exist in Snowflake:
- Create user in Snowflake
  - via CI job 👥 users_snowflake_provisioning_snowflake
- Update roles.yml with the additional roles and/or permissions needed in accordance with our Snowflake Permissions Paradigm

Additional Access Reminders

We loosely follow the paradigm explained in this blog post around permissioning users.
When asking to mirror an existing account, please note that access to restricted SAFE data will not be provisioned/mirrored (currently provided via restricted_safe role).
Snowflake is part of the Access Review Procedure and manager will be asked on a quarterly basis to review the access their team members have in Snowflake. It is expected from the manager to understand the available roles(structure) in Snowflake if approving an AR or reviewing their team member access.
- In the access review, only the first level of Snowflake roles are reported (the ones that are directly attached to the user). I.e. If a team member does have the analyst_marketing role, only the analyst_marketing is reported and all inherited roles in the analyst_marketing are not.
  - Roles could be distinguished between functional roles and object roles (See permissions paradigm below)
    - See this list of functional roles in Snowflake and object roles.
    - Object roles are directly related to systems and gives Team Members access to all of the data we extract from those upstream source systems.
    - To know in all detail what a role entails check the roles.yml file.
    - If unsure, during AR process or Access Review, please reach out to a Data Platform Team Member to understand in detail what a specific role entails.

Historical Provisioning steps

If for whatever reason a user needs to be provisioned outside of Okta and Lumos we have historically use the following process:

Manually Managing Users and Roles for Snowflake

Make sure we have an issue in the GitLab Data Team project linking the original access request with the Provisioning label applied
Login to Snowflake and switch to securityadmin role
- All roles should be under securityadmin ownership
Copy the user_provision.sql script and replace the email, firstname, and lastname values in the initial block
Document in Snowflake roles.yml permifrost config file (this file is automatically loaded every day at 12:00a.m. UTC)
- Add the user and user role you created
- Assign the user role to new user
- Assign any additional roles to user
Ensure the user is assigned the application in Okta
Ensure the user is assigned to the okta-snowflake-users Google Group

Finally, are the proper steps for deprovisioning existing users which are not managed by Okta or have been given permissions beyond defaults:

Snowflake deprovision should be done via an offboarding issue or access request issue.
Make sure we have an issue in the GitLab Data Team project linking the original source request with the Deprovisioning label applied.
Login to Snowflake and switch to securityadmin role
- All roles should be under securityadmin ownership.
Copy the user_deprovision.sql script and replace the USER_NAME. The reason for not removing and leaving the user in snowflake and setting disabled = TRUE is to have a record of when the user lost access.
Remove the user from okta-snowflake-users Google Group
Remove the user records in Snowflake roles.yml permifrost config file (this file is automatically loaded every day at 12:00a.m. UTC)

For more information, watch this recorded pairing session (must be viewed as GitLab Unfiltered).

Snowflake Permissions Paradigm

We use Permifrost to help manage permissions for Snowflake. Our configuration file for our Snowflake instance is stored in this roles.yml file. Also available is our handbook page on Permifrost.

We follow this general strategy for role management:

Every user has an associated user role
Functional roles exist to represent common privilege sets (analyst_finance, data_manager, product_manager)
Logical groups of data have their own object roles
Object roles are assigned primarily to functional roles
Higher privilege roles (accountadmin, securityadmin, useradmin, sysadmin) are assigned directly to users
Service accounts have an identically named role
Additional roles can be assigned either to the service account role or the service account itself, depending on usage and needs
Individual privileges can be granted at the granularity of the table & view
Warehouse usage can be granted to any role as needed, but granting to functional roles is recommended

User Roles

Every user will have their own user role that should match their user name. Object level permissions (database, schemas, tables) in Snowflake can only be granted to roles. Roles can be granted to users or to other roles. We strive to have all privileges flow through the user role so that a user only has to use one role to interact with the database. Exceptions are privileged roles such as accountadmin, securityadmin, useradmin, and sysadmin. These roles grant higher access and should be intentionally selected when using.

Functional Roles

Functional roles represent a group of privileges and role grants that typically map to a job family. The major exception is the analyst roles. There are several variants of the analyst role which map to different areas of the organization. These include analyst_core, analyst_finance, analyst_people, and more. Analysts are assigned to relevant roles and are explicitly granted access to the schemas they need.

Functional roles can be created at any time. It makes the most sense when there are multiple people who have very similar job families and permissions.

Functional Role Assignment

This list of functional roles gives a high level understanding of what the role entails. If missing or to know in all detail what a role entails check this YAML file.

Functional Role	Description	SAFE Data Y/N
`data_team_analyst`	Access to all `PROD` data, sensitive marketing data, Data Platform metadata and some sources.	Yes
`analyst_core`	Access to all `PROD` data and meta data in the Data Platform	No
`analyst_engineering`	Access to all `PROD` data, meta data in the Data Platform and Engineering related data sources.	Yes
`analyst_growth`	Access to all `PROD` data, meta data in the Data Platform and various data sources.	Yes
`analyst_finance`	Access to all `PROD` data, meta data in the Data Platform and finance related data sources.	Yes
`analyst_marketing`	Access to all `PROD` data, meta data in the Data Platform and marketing related data sources.	Yes
`analyst_people`	Access to all `PROD` data, meta data in the Data Platform and various related data sources, including sensitive people data.	Yes
`analyst_sales`	Access to all `PROD` data, meta data in the Data Platform and various related data sources	Yes
`analyst_support`	Access to `PROD` data, meta data in the Data Platform and `raw` / `prep` Zendesk data, including sensitive Zendesk data	No
`analytics_engineer_core`	A combination of `analyst_core`, `data_team_analyst` role with some additions	Yes
`data_manager`	Extension access to Snowflake data	Yes
`engineer`	Extension access to Snowflake data to perform data operation tasks in Snowflake	Yes
`snowflake_analyst`	Access to `PROD` data in Snowflake, EDM schema and workspaces	No
`snowflake_analyst_safe`	Access to `PROD` data in Snowflake, EDM schema and workspaces including SAFE data	Yes
`sensitive_pii_data_viewer`	Access to all sensitive fields in person and contact data mastery models.	No

Object Roles

Object roles are for managing access to a set of data. Typically these represent all of the data for a given source. The zuora object role is an example. This role grants access to the raw Zuora data coming from Stitch, and also to the source models in the prep.zuora schema. When a user needs access to Zuora data, granting the zuora role to that user’s user role is the easiest solution. If for some reason access to the object role doesn’t make sense, individual privileges can be granted at the granularity of a table.

Masking Roles

Masking Roles manage how users interact with masked data. Masking is applied at the column level and which columns are masked is the decision of the source system owner. Masking is applied to a column in a schema.yml file within the dbt code base when a data object is created via dbt. As some users will need access to unmasked data the masking role allows for permissions to the unmasked data to be granted on a functional or object role level. For example if the masking role of people_data_masking is applied to the column locality then the functional role of analyst_people can be set as a member of the people_data_masking role to allow the analysts to see unmasked people data.

When a masking policy is created, it is created based on the masking roles and only one masking policy can be applied to a column.

Examples

This is an example role hierarchy for an Data Analyst, Core:

graph LR
    A([User: datwood]) -->|Member of| B[User Role: datwood]
    B -->|Member of| C[Functional Role: analyst_core]
    C -->|Member of| D[Object Role: workday]
    C -->|Member of| H[Object Role: dbt_analytics]
    C -->|Member of| E[Object Role: netsuite]
    C -->|Member of| F[Object Role: zuora]
    G{{Privileges: analytics_sensitive}} -->|Granted to| C

This is an example role hierarchy for an Data Engineer and Account Administrator:

graph LR
    A([User: tmurphy]) -->|Member of| B[User Role: tmurphy]
    B -->|Member of| C[Functional Role: engineer]
    C -->|Member of| F[Functional Role: loader]
    C -->|Member of| H[Functional Role: transformer]
    G{{ Privileges: Read/Write Raw}} -->|Granted to| C
    A -->|Member of| D[Privileged Role: sysadmin]
    A -->|Member of| E[Privileged Role: securityadmin]

This is an example role hierarchy for a Security Operations Engineer:

graph LR
    A([User: ssichak]) -->|Member of| B[User Role: ssichak]
    A -->|Member of| C[Privileged Role: securityadmin]

Snowflake CI jobs

In FY25-Q1, we are moving towards semi-automating the above Managing Roles for Snowflake process, OKR epic.

The main driver for this change was that there was anticipated increase for access by Engineering teams, and we needed a process to allow provisioning multiple members at once. Furthermore, this will enable all GitLab Team Members to create a Snowflake user themselves with minimal support by the Data Platform Team. This will speed up the provisioning process and shorten the time a GitLab Team member can get access to Snowflake.

All GitLab Team Members are encouraged to open a MR following this runbook if they need access to Snowflake.

High-level description of the process:

Open an Access Request and get the approvals in place
Open an MR
Run CI pipeline
Review from Data Platform Team codeowner.

The rest of the section is meant to describe the automated process in more detail.

The main processes that have been automated are:

create/remove users from Snowflake platform
update roles.yml which is used by Permifrost to update Snowflake role/user permissions

Both of these processes will be made accessible via CI jobs so that the user can potentially self-serve, requiring just MR review/approval from a data engineer.

Both CI jobs follow a common pattern, the end user simply has to add/remove users from within the snowflake_usernames.yml file, and the CI job will run based on the changes to the file.

1) Automate creating users/roles in Snowflake platform

Prior to running Permifrost, the users/roles need to be first created in Snowflake.

The snowflake_provisioning_snowflake_users CI job allows the user to create these users/roles in Snowflake.

See the CI jobs page for more information on the available arguments and default values.

2) Automating roles.yml

Once the users/roles have been created in Snowflake, roles.yml needs to be updated to reflect the desired permissions.

The snowflake_provisioning_roles_yaml CI job allows the end user to automatically update roles.yml with the desired permissions.

See the CI jobs page for more information on the available arguments and default values.

Furthermore, the next section provides additional details on optional templated arguments within snowflake_provisioning_roles_yaml CI job:

Optional Templated Arguments

Custom Templates

This is useful if you have many users that need a value different from the default. One option would be to run with the default values, and then manually update the MR, but depending on the number of users to update, a potentially better option is to pass in a custom values template.

The rest of the section will do two things:

Explain how templates work
For convenience, provide custom templates that represent common values currently used in roles.yml

To illustrate how templates work, let’s start with an example. This is the default roles template:

{
  "{{ username }}": {
    "member_of": [
      "snowflake_analyst"
    ],
    "warehouses": [
      "dev_xs"
    ]
  }
}

This is valid JSON, but note that it is templated. That is, {{ username }} is a Jinja template, and the template will be later rendered to an actual value within the script.

Now, an example of when we want to override the default value above. What happens if for the next batch of users, we want them to also have dev_m warehouse?

Within the CI job, we could pass in a custom template to override the default value like so:

ROLES_TEMPLATE: {"{{username}}": {"member_of": ["snowflake_analyst"],"warehouses": ["dev_xs", "dev_m"]}}

Currently, these are the available template-able values that will be rendered:

{{ username }}
{{ prod_db }}
{{ prep_db }}
{{ prod_schemas }}
{{ prep_schemas }}
{{ prod_tables }}
{{ prep_tables }}

Common Custom Templates

This section is meant to provide custom templates (non-default values) that represent common-occurring values in roles.yml that can be copy/pasted for use.

Default denotes that this is the template used if not explicitly overridden.
Common denotes that while the template is not used by default, these values are still commonly used within roles.yml

Databases

Default: None, no databases are added

Common: CI job argument to create a personal prep/prod database for each user:

DATABASES_TEMPLATE: [{"{{ prod_database }}": {"shared": false}}, {"{{ prep_database }}": {"shared": false}}]

Roles

Default:

ROLES_TEMPLATE: {"{{ username }}": {"member_of": ["snowflake_analyst"], "warehouses": ["dev_xs"]}}

Common- CI job argument to create a role for a data engineer:

ROLES_TEMPLATE: {"{{ username }}": {"member_of": ["engineer","restricted_safe"],"warehouses": ["dev_xs","dev_m","loading","reporting"],"owns": {"databases": ["{{ prep_database }}","{{ prod_database }}"],"schemas": ["{{ prep_schemas }}","{{ prod_schemas }}"],"tables": ["{{ prep_tables }}","{{ prod_tables }}"]},"privileges": {"databases": {"read": ["{{ prep_database }}","{{ prod_database }}"],"write": ["{{ prep_database }}","{{ prod_database }}"]},"schemas": {"read": ["{{ prep_schemas }}","{{ prod_schema }}"],"write": ["{{ prep_schemas }}","{{ prod_schema }}"]},"tables": {"read": ["{{ prep_tables }}","{{ prod_tables }}"],"write": ["{{ prep_tables }}","{{ prod_tables }}"]}}}}

Users

Default:

USERS_TEMPLATE: {"{{ username }}": {"can_login": true, "member_of": ["{{ username }}"]}}

Common: N/A. There are no other templates that we currently use for users

Automating roles.yml: Project Access Token

The snowflake_provisioning_roles_yaml CI job runs update_roles_yaml.py which updates roles.yml file.

The changes to roles.yml within the CI job are pushed back to the branch/MR.

In order to push to the repo from within the CI pipeline, a Project Access Token (PAT) is needed, more info pushing to the remote repo in this StackOverflow answer.

The PAT is named snowflake_provisioning_automation and was created in the ‘GitLab Data Team’ project, using the analyticsapi@gitlab.com account.

The PAT value is saved within 1Pass, and also as a CI environment variable so that it can be used by the GitLab runner.

snowflake_users.yml - end of file issue

When adding a user to the snowflake_users.yml file, specifically when appending to the bottom of the file, it causes unexpected behavior if done using the GitLab Single File Editor, more info in this issue.

The workaround is that at the bottom of snowflake_users.yml, it has this comment:

#### do not insert users below this line ####

Local Testing

Both update_roles_yaml and provision_users can be run locally for faster testing compared to CI jobs.

Setup for provision_users:

Export required environment variables:

export EMAIL_DOMAIN='gitlab.com'
export PERMISSION_BOT_USER="bot_user_123"
export PERMISSION_BOT_PASSWORD="random_pass_456"
export SNOWFLAKE_ACCOUNT="xy12345.us-east-1"
export PERMISSION_BOT_WAREHOUSE="COMPUTE_WH"

Python test run (no Snowflake user creation):

python provision_users.py --users-to-add some-user1 user2 --test-run

Snowflake Deprovisioning Users

Inactive Snowflake users will be deprovisioned weekly via snowflake_cleanup DAG, implemented in this issue.

All active Snowflake users/roles are declared within roles.yml. Therefore, if any users in Snowflake are missing within roles.yml, they are considered inactive and the process will drop them.

These users will be dropped by running the following deprovision_user.sql script.

This process is not exposed via CI job due to its sensitive nature and because it is less time sensitive. Therefore, a weekly ‘cleanup’ task via Airflow will be run instead.

Snowflake user/service account

The permifrost_bot_user is used to run both Snowflake provisioning and deprovisioning processes. This is for 2 reasons:

permifrost_bot_user already has the proper permissions to run provisioning/deprovisioning as the same perms are needed to run existing Permifrost jobs.
The permifrost_bot_user already runs existing Permifrost jobs using both Airflow and GitLab CI, so the applied NSP IP addresses will not be redundant when added for both provisioning (run via CI) /deprovisioning (run via Airflow).

Provisioning permissions to external tables to user roles

Provisioning USAGE permissions for external tables to user roles inside snowflake is not handled by permifrost in the moment. If you have to provision access for an external table to a user role, then it must be granted manually via GRANT command in snowflakedocs using a securityadmin role. This implies that the user role already has access to the schema and the db in which the external table is located, if not add them to the roles.yml.

Logging in and using the correct role

When you apply for priveldges to Snowflake via an AR and get access provisioned it takes until 3.00AM UTC for the change to take effect. This is because we have a script running daily to provision the access in Snowflake. When you can login, you can do this via Okta. After you logged in via Okta, you need to select the right role that is attached to your account. This is by default the same as your account and it follows the convention of your email addres minus @gitlab.com.

When you don’t select the right role in Snowflake, you only see the following Snowflake objects:

object_list

Selecting the right role can be done via the GUI. When in Snowsight home screen, in the up left corner.

select_role

Click on the arrow near your name
Select Switch Role
Select your role

When in Snowsight in a worksheet, in the up right corner.

select_role

Click on public
Select your role

You can set this to your default by running the following:

ALTER USER <YOUR_USER_NAME> SET DEFAULT_ROLE = '<YOUR_ROLE>'

Compute Resources

Compute resources in Snowflake are known as “warehouses”. To use our credit consumption effectively, we try to minimize the amount of warehouses. For development purposes (executing dbt jobs locally, running MR pipelines and querying in Snowflake) we use the dev_x warehouse. The names of the warehouse are appended with their size (dev_xs for extra small).

warehouse	purpose	max query (minutes)
`admin`	This is for permission bot and other admin tasks	10
`data_classification`	This is for running the data classification and labelling process in Snowflake	60
`dev_xs/m/l/xl`	This is used for development purposes, to be used when using the Snowflake UI and in CI-pipelines	180
`gainsight_xs`	This is used for gainsight data pump	30
`gitlab_postgres`	This is for extraction jobs that pull from GitLab internal Postgres databases	10
`grafana`	This is exclusively for Grafana to use	60
`loading`	This is for our Extract and Load jobs and testing new Meltano loaders	120
`reporting`	This is for the BI tool for querying.	30*
`transforming_xs`	These are for production dbt jobs	180
`transforming_s`	These are for production dbt jobs	180
`transforming_l`	These are for production dbt jobs	240
`transforming_xl`	These are for production dbt jobs	180
`transforming_2xl`	For refreshing Snowplow models	120
`transforming_4xl`	This is for the Airflow dag: `dbt_full_refresh`	180
`usage_ping`	This is used for the service_ping and service_ping_backfill load.	120

If you’re running into query time limits please check your query for optimisation. A bad performing query in development will result in a bad performing query in production, having impact on a daily basis. Please always use the right (size) warehouse. Ground rules of using/selecting a warehouse:

Warehouses are set as t-shirt sizes. Larger warehouses are more costly for GitLab
Consider using a running warehouse
- If you resume a paused warehouse, there is a initial start cost
- Every warehouse suspends after a set period, but when idle (time between query result and the suspend time), we still consume snowflake credits
- In general we don’t spend more money if we run concurrent queries.
The query timeout in Snowflake is set to 30 minutes for the REPORTING warehouse.

Data Storage

We use three primary databases: raw, prep, and prod. The raw database is where data is first loaded into Snowflake; the other databases are for data that is ready for analysis (or getting there).

All tables and views in prep and prod are controlled (created, updated) via dbt. Every Quarter the Data Platform Team runs a check for tables and views that are not related to a dbt model and will be removed.

The following list of schema are exceptions and not checked:

SNOWPLOW_%
DOTCOM_USAGE_EVENTS_%
INFORMATION_SCHEMA
BONEYARD
TDF
CONTAINER_REGISTRY
FULL_TABLE_CLONES
QUALTRICS_MAILING_LIST
NETSUITE_FIVETRAN

There is a snowflake database, which contains information about the entire GitLab instance. This includes all tables, views, queries, users, etc.

There is a covid19 database, which is a shared database managed through the Snowflake Data Exchange.

There is a testing_db database, which is used for testing Permifrost.

There is a bi_tool_eval database, which is used for testing bi tooling. Users are able create own testing sets manually.

All databases not defined in our roles.yml Permifrost file are removed on a weekly basis.

Database	Suitable to use in Tableau
raw	No
prep	No
prod	Yes

Only the prod database should be used in Tableau as this data has been transformed and modeled for business use. Using raw and prep databases in Tableau could result in incorrect data and or broken queries/dashboards now or in the future. Important to keep in mind that data transformations are checked and tested only for the prod database results. This means if dashboards are directly connected to the raw or prep database it could break or report wrong data.

Raw

No dbt models exist for this data and so it may be the case that the data needs review or transformation in order to be useful or accurate. This review, documentation, and transformation all happens downstream in dbt for PREP and PROD. This database should not be used in Tableau.

Raw may contain sensitive data, so permissions need to be carefully controlled
RAW will contain data that isn’t ready for business use.
Data is stored in different schemas based on the source
User access can be controlled by schema and tables

Snowflake data sharing enables sharing various Snowflake objects like databases, tables, secure views, and a couple more from one Snowflake account to another. Snowflake shares can be both Inbound and outbound. Inbound share, which is being used at GitLab, is for accessing third-party data sources like Zuora Revenue; the mechanism followed here is direct share, where data providers share specific database objects to our Snowflake account. Outbound sharing is when we want to share our data with a third party. This involves creating an outbound share of a snowflake object in their account and granting access to the snowflake object (table, view, database, etc.) that needs to be shared to an external account using either a web interface or SQL.

Snowflake Data Shares can be seen as an extension of the raw layer, but sharded (and) in different accounts. We don’t see Snowflake Data Shares as a source from which data needs to be copied, but rather we connect directly to Snowflake Data Shares as we do to the raw layer (i.e., with dbt). This approach helps avoid creating extra processes and makes the pipeline more efficient.

Prep

This is the first layer of verification and transformation in the warehouse, but is not yet ready for general business use. This database should not be used in Tableau.

Source models are built in logical schemas corresponding to the data source (i.e. sfdc, zuora)
PREPARATION - this is the default schema where dbt models are built
SENSITIVE

Prod

This database and all schemas and tables in it are queryable by Tableau. This data has been transformed and modeled for business use.

With the exception of public, and boneyard, all schemas are controlled by dbt. See the dbt guide for more information.

Folder Structure in Analytics Project

The table below shows a mapping of how models stored within folders in the models/ directory in the analytics project will be materialized in the data warehouse.

The source of truth for this is in the dbt_project.yml configuration file.

Folder in snowflake-dbt/models/	db.schema	Details	Queryable in Tableau
common/	prod.common	Top-level folder for facts and dimensions. Do not put models here.	Yes
common/bridge	prod.common	Sub-folder for creating many-to-many mappings between data that come from different sources.	Yes
common/dimensions_local	prod.common	Sub-folder with directories containing dimensions for each analysis area.	Yes
common/dimensions_shared	prod.common	Sub-folder with dimensions that relate to every analysis area.	Yes
common/facts_financial	prod.common	Sub-folder with facts for the financial analysis area.	Yes
common/facts_product_and_engineering	prod.common	Sub-folder with facts for the product and engineering analysis area.	Yes
common/facts_sales_and_marketing	prod.common	Sub-folder with facts for the sales and marketing analysis area.	Yes
common/sensitive/	prep.sensitive	Facts/dims that contain sensitive data.	No
common_mapping/	prod.common_mapping	Used for creating one-to-one mappings between data that come from different sources.	Yes
common_mart/	prod.common_mart	Joined dims and facts that are relevant to all analysis areas.	Yes
common_mart_finance/	prod.common_mart	Joined dims and facts that are relevant to finance.	Yes
common_mart_marketing/	prod.common_mart	Joined dims and facts that are relevant to marketing.	Yes
common_mart_product/	prod.common_mart	Joined dims and facts that are relevant to product.	Yes
common_mart_sales/	prod.common_mart	Joined dims and facts that are relevant to sales.	Yes
common_prep/	prod.common_prep	Preparation tables for mapping, bridge, dims, and facts.	Yes
marts/	varies	Contains mart-level data and data pumps that send data to third party sources.	Yes
legacy/	prod.legacy	Contains models built in a non-dimensional manner	Yes
sources/	prep.`source`	Contains source models. Schema is based on data source	No
workspaces/	prod.workspace_`workspace`	Contains workspace models that aren’t subject to SQL or dbt standards.	Yes
common/restricted	prod.restricted_`domain`_common	Top-level folder for restricted facts and dimensions. Equivalent of the regular common schema, but for restricted data.	Yes
common_mapping/resticted	prod.restricted_`domain`_common_mapping	Contains restricted mapping, bridge, or look-up tables. Equivelement of regular common mapping schema, but for restricted data.	Yes
marts/restricted	prod.restricted_`domain`common`marts`	Yes
legacy/restricted	prod.restricted_`domain`_legacy	Contains restricted models built in a non-dimensional manner. Equivalent of the normal legacy schema, but for restricted data.	Yes

Static

For data warehouse use cases that require us to store data for our users without updating it automatically with dbt we use the STATIC database. This also allows for analysts and other users to create their own data resources (tables, views, temporary tables). There is a sensitive schema for sensitive data within the static database. If your use case for static requires the use or storage of sensitive data please create an issue for the data engineers.

Scenario’s we have been using the STATIC database:

A request comes in to upload a set of data into one of our data sources. This set of data is going to be uploaded once and never updated again.

In this case we have created a new table in the STATIC database and loaded the data there via BULK UPLOAD / COPY command. Then this model has been exposed to the PREP layer. The final model reads from this table via a UNION statement.

This way we have the data in the STATIC database and even if we perform a full-refresh of the data source, we will be able to include this manually uploaded set of records.

Examples of this implementation can be found below:

Qualtrics, Link to the MR
Clari, Link to the MR

Data Masking

We use data masking obfuscate private or sensitive information with our data warehouse. Masking can be applied in a dynamic or static manner depending on the particular data needs. Masking can be applied at the request of the data source system owner or at discretion of the Data Team. As our current data masking methods are applied procedurally using dbt they can only be applied in the PREP and PROD database. If masking is required in the RAW database alternant methods of masking should be investigated.

Static Masking

Static data masking is applied during the transformation of the data and the masked result is materialized into the table or view. This will mask the data for all users regardless of role or access permission. This is accomplished in the code with tools such as the hash_sensitive_columns macro within dbt.

Dynamic Masking

Dynamic masking is currently applied on tables or views in the prep and prod layer at query run time based on assigned policies and user roles using the Dynamic Data Masking capabilities of Snowflake. Dynamic masking allows for data to be unmasked for selected users wile masked for all other users. This is accomplished by creating masking policies that are then applied to the column at the time of table or view creation. Masking policies are maintained within the data warehouse source code repository. Please see the dbt guide to setup dynamic masking.

Note: Dynamic masking is not applied on raw database yet.

Timezones

All timestamp data in the warehouse should be stored in UTC. The default timezone for a Snowflake sessions is PT, but we have overridden this so that UTC is the default. This means that when current_timestamp() is queried, the result is returned in UTC.

Stitch explicitly converts timestamps to UTC. Fivetran does this as well (confirmed via support chat).

The only exception to this rule is the use of pacific time to create date_id in fact tables, which should always be created by the get_date_pt_id dbt macro and labeled with the _pt_id suffix.

Snapshots

We use the term snapshots in multiple places throughout the data team handbook and the term can be confusing depending on the context. Snapshots as defined by the dictionary is “a record of the contents of a storage location or data file at a given time”. We strive to use this definition whenever we use the word.

dbt

The most common usage is in reference to dbt snapshots. When dbt snapshots is run, it takes the current state of the source data and updates the corresponding snapshot table, which is a table that contains the full history of the source table. It has valid_to and valid_from fields indicating the time period for which that particular snapshot is valid. See the dbt snapshots section in our dbt guide for more technical information.

The tables generated and maintained by dbt snapshots are the raw historical snapshot tables. We will build downstream models on top of these raw historical snapshots for further querying. The snapshots folder is where we store the dbt models. One common model we may build is one that generate a single entry (i.e. a single snapshot) for a given day; this is useful when there are multiple snapshots taken in a 24 hour period. We also will build models to return the most current snapshot from the raw historical table.

Other uses

Our Greenhouse data can be thought of as a snapshot. We get a daily database dump provided by Greenhouse that we load into Snowflake. If we start taking dbt snapshots of these tables then we would be creating historical snapshots of the Greenhouse data.

The extracts we do for some yaml files can also be thought of as snapshots. This extraction works by taking the full file/table and storing it in its own, timestamped row in the warehouse. This means we have historical snapshots for these files/tables but these are not the same kind of snapshot as dbt. We’d have to do additional transformations to get the same valid_to and valid_from behavior.

Language

Snapshot - The state of data at a specific point in time
Take a snapshot - Run the job that takes the state of the data currently and stores it. Can be used in the dbt context. Not recommended to reference our yaml extract jobs - these would be “run the extract”.
Historical snapshots - A table that contains data for a given source table at multiple points in time. Most commonly used to reference dbt-generated snapshot tables. Can also be used to reference the yaml extract tables.
Latest snapshot - The most current state of the data we have stored. For dbt snapshots these are the records that have null for the valid_to. For yaml extracts this correspond to the last time the extraction job was run. For Greenhouse raw, this represents the data as it is in the warehouse. Were we to start taking snapshots of the Greenhouse data the speaker would have to clarify if they mean the raw table or the latest record in the historical snapshots table.

Backups

The scope of data backups at Data Platform level is to ensure data continuity and availability for reporting and analytics purposes. In case of an unforeseen circumstance happening with our data in Snowflake or with our Snowflake platform, the GitLab data team is able to recover and restore data to the desired state. In our backup policy we tried to find a balance between the risk of an unforeseen event and the impact of the mitigated solution.

Note: the (Snowflake) Data Platform doesn’t act as a data archival solution for upstream source systems i.e. for compliance reasons. The Data Platform relies on data that was and is made available in upstream source systems.

Unforeseen circumstances

We’ve identified currently 2 types of unforeseen circumstances:

Incorrect events happening inside the data platform.
Unavailability of the Snowflake environment.

Incorrect events happening inside the data platform

This can be data manipulation action done by a GitLab Team member or by services with access to the data in Snowflake. Some examples are accidentally dropping/truncating a table or running incorrect logic in a transformation.

The vast majority of data in snowflake is copied or derived from copies of our data sources, which is all managed idempotently with dbt and so the most common procedure for data restoration or recovery is through recreating or refreshing objects using dbt Full Refresh. For data in the RAW database, which comes from our extraction pipelines we follow the appropriate Data refresh procedure.

However, there are some exceptions to this. Any data in snowflake which is not a result of idempotent processes or that cannot be refreshed in a practical amount of time should be backed up. For this we use Snowflake Time travel. Which includes:

Storage in permanent (not transient) tables.
A data retention period of 30 days.

The data retention period is set via dbt This should be implemented in code via a dbt post-hook example.

The following set of rules and guidelines applies to backing up data/using time travel:

It is the responsibility of the CODEOWNER to ensure that the backup processes has been correctly implemented for the data that their code builds or maintains.
Backups (via Time Travel) need not be applied on dbt models by default since these are idempotent and this would result in a huge increase of the storage costs in Snowflake.
The retention period is set to 30 days.

At the moment the following snowflake objects are considered in scope for Time Travel recovery:

RAW.SNAPSHOTS.*

Once a table is permanent with a retention period we are able to use Time Travel (internal runbook) in the event we need to recover one of these tables.

Unavailability of the Snowflake environment

For the unlikely event that Snowflake becomes unavailable for an undetermined amount of time, we additionally backup the any business critical data, where Snowflake is the primary source, to Google Cloud Storage (GCS). We execute these backup jobs using dbt’s run-operation capabilities. Currently, we backup all of our snapshots daily and retain them for a period of 60 days (per GCS retention policy). If a table should be added to this GCS backup procedure it should be added via the backup manifest.

Snowflake Admin tasks

In order to keep Snowflake up and running, we perform administrative work.

Create new Snowflake external stage for GCS storage bucket

In order for Snowflake to access the files in GCS bucket, the files must be copied into a Snowflake external stage.

To create the external stage, the new path to the bucket must be included (included means appended to the existing list of storage locations) in the STORAGE_ALLOWED_LOCATIONS attribute. If it is not appended, but overwritten to the existing attributes, all existing storage locations will be erased and stop many pipelines from running.

Follow these instructions to append the new external stage: (Note: The GCS_INTEGRATION is Snowflake storage integration for gitlab-analysis project in GCP. If the bucket is in a different project, a new integration would need to be created.)

Get all current storage locations by running this:
```
DESC INTEGRATION GCS_INTEGRATION;
```
From the output, copy the value under property_value where property=STORAGE_ALLOWED_LOCATIONS. It will look something like: gcs://postgres_pipeline/,gcs://snowflake_backups/,...

Update the Claude prompt below with your specific values and paste it into Claude:

Replace <paste full output here> with the DESC INTEGRATION output from step 1
Replace <your new gcs://bucket-path/> with your new bucket path
Replace <database.schema.stage_name> with your desired stage name

Prompt to paste into Claude:

Paste output of DESC INTEGRATION GCS_INTEGRATION: <paste full output here>

New bucket path to add: <your new gcs://bucket-path/>
Stage name: <database.schema.stage_name>

Above are existing storage locations, can you please output the correct ALTER STORAGE INTEGRATION and CREATE STAGE commands?

It should be done like:
1. Update the Storage Integration instructions:
    * Take the 'current_paths' that you just copied and combine it with the 'new_path' that you want to add.
        * Each path needs to be separated by a `,`
        * Each path needs to have its own pair of `''`, these need to be added manually
    * ALTER statement template:

        ```sql
        ALTER STORAGE INTEGRATION GCS_INTEGRATION
        SET STORAGE_ALLOWED_LOCATIONS = ('current_path1','current_path2','new_path');
        ```

    * ALTER statement example:

        ```sql
        ALTER STORAGE INTEGRATION GCS_INTEGRATION
        SET STORAGE_ALLOWED_LOCATIONS = ('gcs://postgres_pipeline/','gcs://snowflake_backups/','gcs://snowflake_exports/');
        ```

2. After you run the ALTER statement, the new stage can now be created, like so:

    ```sql
    CREATE STAGE "RAW"."PTO".pto_load
    STORAGE_INTEGRATION = GCS_INTEGRATION URL = 'bucket location';
    ```

Take the output from Claude and run the ALTER STORAGE INTEGRATION using ACCOUNTADMIN role
Take the output from Claude and run the CREATE STAGE command using LOADER role
If COPY INTO fails later, go to the GCS bucket in GCP Console, navigate to the Permissions tab, and grant the service account dxglbtbppc@sfc-prod1-1-lu5.iam.gserviceaccount.com the Storage Object Viewer role

Create new Snowflake external stage for AWS S3 storage bucket

This guide explains how to grant Snowflake access to a new S3 bucket using the existing Snowflake storage integration.

Overview

The process involves:

Creating a new S3 bucket using terraform
Updating the IAM policy to allow Snowflake access to this bucket
Updating the Snowflake storage integration configuration

Prerequisites

Access to config-mgmt repo, specifically the aws-gitlab-analysis environment.
Snowflake account access with ACCOUNTADMIN role

Detailed Steps

Click to expand

1. Create the S3 Bucket

In the repository: gitlab-com/gl-infra/config-mgmt

Create a new S3 bucket via Terraform in the aws-gitlab-analysis environment:

resource "aws_s3_bucket" "some_new_bucket" {
  bucket = "your-new-bucket-name"
  # Add other configuration as needed
}

2. Update the IAM Policy

In the same repo as the previous step, navigate to the policy file in GitLab:
- File path: environments/aws-gitlab-analysis/templates/iam_policy_snowflake_s3_integration.json

Add the new bucket path under Resource array in the same pattern as of existing bucket.

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:GetObjectVersion",
    "s3:PutObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::your-new-bucket-name/*",
    "arn:aws:s3:::your-new-bucket-name"
  ]
}

Just like any change in config-mgmt repo, get approvals, and then run atlantis apply to deploy the change

3. Update the Snowflake Storage Integration

Add the new bucket to the allowed storage locations in Snowflake:

Use ACCOUNTADMIN role

Update the Snowflake storage integration, be sure you append the new bucket to the existing list of buckets:

ALTER STORAGE INTEGRATION S3_DATA_PUMP
SET STORAGE_ALLOWED_LOCATIONS = ('s3://existing-bucket-1/', 's3://existing-bucket-2/', 's3://your-new-bucket-name/');

Verify the integration settings:
```
DESC INTEGRATION S3_DATA_PUMP;
```

Note: We are treating the S3_DATA_PUMP Snowflake storage integration as the generic one which is responsible for establishing connection to S3 in the main AWS project where Snowplow instance is running. If we have a new bucket in different project, such as in a customer provided one, we would need to create a new Snowflake integration for that AWS project, Snowflake docs.

4. Verification

To verify everything is working correctly:

In Snowflake, attempt to create an external stage using the new bucket
Test reading from and writing to the bucket using Snowflake queries

Transformation

We use dbt for all of our transformations. See our dbt guide for more details on why and how we use this tool.

Trusted Data Framework

Data Customers expect Data Teams to provide data they can trust to make their important decisions. And Data Teams need to be confident in the quality of data they deliver. But this is a hard problem to solve: the Enterprise Data Platform is complex and involves multiple stages of data processing and transformation, with tens to hundreds of developers and end-users actively changing and querying data 24 hours a day. The Trusted Data Framework (TDF) supports these quality and trust needs by defining a standard framework for data testing and monitoring across data processing stages, accessible by technical teams and business teams. Implemented as a stand-alone module separate from existing data processing technology, the TDF fulfills the need for an independent data monitoring solution.

Enable everyone to contribute to trusted data, not just analysts and engineers
Enable data validations from top to bottom and across all stages of data processing
Validate data from source system data pipelines
Validate data transforms into dimensional models
Validate critical company data
Deployable independently from central data processing technology

Key Terms

Assertion or Test Case - An individual test and the smallest unit of a test that can be performed. In TDF the test case is expressed either as a SQL statement or via a YAML configuration within SQL-compilation tool, dbt.
Data Schema - The tables, columns, views, and other structural elements that make up a data subject area, create using SQL Data Definition Language (DDL).
Monitoring - Tracking the results of tests cases to help ensure data is ready for use.

Trusted Data Components

The primary elements of the TDF include:

A Virtuous Test Cycle that embeds quality as a normal part of daily data development, ranging from new data solutions to break-fix issue resolution.
Test Cases Expressed As SQL and YAML which can be developed by anyone.
The Trusted Data Schema saves test results for monitoring and alerting, and long-term analysis towards the path of developing wisdom around business processes and data platform performance.
Schema-to-Golden Record Coverage to provide broad coverage of the data warehouse domain, ranging from schema to critical “Golden” data.
The Trusted Data Dashboard, a business-friendly dashboard to visualize overall test coverage, successes, and failures.
The Test Run is when a Test Cases are executed.
Row Count test to reconsile the amount of rows between source system and Snowflake

Virtuous Test Cycle

The TDF embraces business users as the most important participant in establishing trusted data and uses a simple and accessible testing model. With SQL and YAML as a test agent, a broad group of people can contribute test cases. The test format is straightforward with simple PASS/FAIL results and just four test case types. Adoption grows quickly as TDF demonstrates value:

Data Customers and Business Users learn the testing framework and create tests themselves
Teams embrace testing as a valuable activity to include at all times, not as a last-minute activity
The Data Team learns to add new tests as part of production-down retrospectives to more rapidly identify issues before they become large problems
Teams develop operational rythms to continually develop new tests and expand test coverage

Over time, it is not uncommon to develop hundreds of tests cases which are run on a daily basis, continually validating data quality.

Test Cases Expressed As SQL and YAML

SQL is the universal language in databases and nearly everyone who works with data has some level of SQL competency. However, not everyone may be familiar with SQL and we don’t want that to limit who can contribute. We use dbt to support the TDF which enables the defining of tests via SQL and YAML.

Trusted Data Schema

With all tests being run via dbt, storing tests results is simple. We store the results of every test run in the data warehouse. Storing test results enables a variety of valuable features, including:

data visualization and pattern analysis test results (total tests run by date, PASS/FAIL rate by subject area, and so on)
measurement of test coverage over a data subject or schema (number of tests by area)
measurement of system quality improvements over time (an increase in the PASS rate)
development of an alerting system based on test result

These test results are parsed and are available for querying in Tableau.

The schema we store all test results is: WORKSPACE_DATA.
Note: This schema only containts views.

Schema To Golden Record Coverage

The Data Warehouse environment can change quickly and the TDF supports predictability, stability, and quality with test coverage of the areas in the Data Warehouse that are most likely to change:

Schema tests to validate the integrity of a schema
Column Value tests to determine if the data value in a column matches pre-defined thresholds or literals
Rowcount tests to determine if the number of rows in a table over a pre-defined period of time match pre-defined thresholds or literals

The implementation details of these tests are documented in our dbt guide.

Trusted Data Dashboard

The data team is working on either a dashboard or the use of collections to organize trusted data dashboards as well as published Tableau Data Sources which as certified as trusted data.

Test Run

More to come.

Row Count Test

The row count tests reconciles the amount of rows between source database and target database by extracting data from source DB tables and load into Snowflake table and extract similar stats from Snowflake and perform comparison between source and target. Their is a challenge to have an exact match between source and target, because;

There is timing difference.
Data warehouse might keep history.
Deletions takes place on source database.

Depending on the scenario, its advisable to check the row count not on the highest (table) level, but check the row counts on a lower granular level. This could be one or more fields with a logical distribution, but still on a aggregated level. An example could be an insert or update date in a table.

Based on the row counts from source and row counts on the target (Snowflake data warehouse), a reconciliation can take place to determine if all rows are loaded into the data warehouse.

Row Count Tests PGP

The framework is designed to handle execution of any kind of query to perform the test. As per the current architecture every query will create one Kubernetes pod, so grouping into one query reduces creation of the number of Kubernetes pods. For record count and data actual test between postgres DB and snowflake the approach followed is grouping low volume source tables together and large volume source tables run as an individual task.

A new yaml file is created which is supposed to do all types of reconciliation (so its not incorporated in the existing yaml extraction manifest). Manifest file combines a group of low volume tables together and a large volume table as individual tasks. Row count test comparisons from Postgres and snowflake are stored in a snowflake table named “PROD”.“WORKSPACE_DATA”.“PGP_SNOWFLAKE_COUNTS”.

Data Pump

graph LR

yml>pumps.yml]

dataModel[(data model)] --> o{{Airflow DAG}}

yml --> o
o --> S3

S3 --> workato{{Workato recipe}} --> target[(Target)]

In order to make it easy for anyone to send data from Snowflake to other applications in the GitLab tech stack we have partnered with the Enterprise Applications Integration Engineering team to create this data integration framework, which we are calling Data Pump.

This is all orchestrated in the Data Pump Airflow DAG, which runs the pump, and is set to run once daily at 05:00 UTC.

Adding a Data Pump

Step 1: Create a data model using dbt in /marts/pumps (or /marts/pumps_sensitive if the model contains RED or ORANGE Data), following our SQL and dbt style and documentation standards. Create an MR using dbt model changes template. Once this is merged and appears in Snowflake in PROD.PUMPS or PROD.PUMPS_SENSITIVE you are ready for steps two and three.

Step 2: Add Model to pumps.yml using the ‘Pump Changes’ MR template with the following attributes:

model - the name of the model in dbt and snowflake
timestamp_column - the name of the column that should be used to batch the data (or null if there is none and the table is small)
sensitive - True if this model contains sensitive data and is in the pumps_sensitive directory and schema
single - True if you want to create a single file in the target location. False if multiple files can be written
stage - The name of the snowflake stage you’d like to use for the target location
owner - your (or the business DRI’s) GitLab handle

Step 3: Create an issue in the platypus project using the ‘change’ issue template so that the Integration team can map and integrate the data into the target application.

Operational Data Pumps

Model	Target system	RF	MNPI
pump_hash_marketing_contact	Marketo	24h	No
pump_marketing_contact	Marketo	24h	No
pump_marketing_premium_to_ultimate	Marketo	24h	No
pump_subscription_product_usage	Salesforce	24h	No
pump_product_usage_free_user_metrics_monthly	Salesforce	24h	No
pump_daily_data_science_scores	Salesforce	24h	Yes
pump_churn_forecasting_scores	Salesforce	24h	Yes

Data Science Data Pumps

The Daily Data Science Scores Pump and the Pump Churn Forecasting Scores Pump are two specific use-cases of the data pump, used to bring data science related data from Snowflake into S3, so that it can be picked up by Openprise and loaded into Salesforce.

The source model for the Daily Data Science Scores pump called mart_crm_account_id contains a combination of PtE and PtC scores, while the Churn Forecasting Scores pump source model mart_crm_subscription_id contains scores strictly related to the Churn Forecasting model.

Marketing Data Mart to Marketo

The Email Data Mart is designed to automatically power updates to Marketo to enable creation of structured and targeted communications.

Trusted Data Model to Gainsight

The Data Model to Gainsight Pump is designed to automatically power updates to Gainsight to enable creation of visualizations, action plans, and strategies for Customer Success to help our customers succeed in their use of GitLab.

Qualtrics Mailing List Data Pump / Qualtrics SheetLoad

The Qualtrics mailing list data pump process, also known in code as Qualtrics SheetLoad, enables emails to be uploaded to Qualtrics from the data warehouse without having to be downloaded onto a team member’s machine first. This process shares its name with SheetLoad because it looks through Google Sheets for files with names starting with qualtrics_mailing_list. For each of the files it finds with an id column as the first column, it uploads that file to Snowflake. The resulting table is then joined with the GitLab user table to retrieve email addresses. The result is then uploaded to Qualtrics as a new mailing list.

During the process, the Google Sheet is updated to reflect the process’ status. The first column’s name is set to processing when the process begins, and then is set to processed when the mailing list and contacts have been uploaded to Qualtrics. Changing the column name informs the requester of the process’ status, assists in debugging, and ensures that a mailing list is only created once for each spreadsheet.

The end user experience is described on the UX Qualtrics page.

Debugging to Qualtrics Processes

Attempting to reprocess a spreadsheet should usually be the first course of action when a spreadsheet has an error and there is no apparent issue with the request file itself. Reprocessing has been necessary in the past when new GitLab plan names have been added to the gitlab_api_formatted_contacts dbt model, as well as when the Airflow task hangs when processing a file. This process should only be performed with coordination or under request from the owner of the spreadsheet, to ensure that they are not using any partial mailing list created by the process, as well as not making any additional changes to the spreadsheet.

To reprocess a Qualtrics Mailing List request file: 1. Disable the Qualtrics Sheetload DAG in Airflow. 2. Delete any mailing lists in Qualtrics that have been created from the erroring spreadsheet. You should be able to log into Qualtrics using the Qualtrics - API user credentials and delete the mailing list. The mailing list’s name corresponds to the name of the spreadsheet file after qualtrics_mailing_list., which should also be the same as the name of the tab in the spreadsheet file. 3. Edit cell A1 of the erroring file to be id. 4. Enable the Qualtrics Sheetload DAG in Airflow again and let it run, closely monitoring the Airflow task log.

Data Spigot

A Data Spigot is a concept/methodology to give external systems, access to Snowflake data in a controlled manner. To give external systems access to Snowflake, the following controls are in place:

A dedicated service account with a/an key-pair/OAuth authentication.
A dedicated view (or views) only exposing the minimum required data. No Personally Identifiable Information (PII) may be disclosed.
A dedicated role (or equivalent) with access to only the specified tables/views.
A dedicated XS warehouse to limit and monitor costs.
A network security policy to limit network traffic to a specific IP (range)

The process for setting up a new Data Spigot is as follows:

Comply to the controls that are in place, as described above.
Add new Data Spigots to the table below:

Current Data Spigots

Connected system	Data scope	Database table/view	MNPI
Grafana	Snowplow loading times	`prod.legacy.snowplow_page_views_all_grafana_spigot`	No
Gainsight		`prod.common_prep.prep_usage_ping_no_license_key`	No
Gainsight		`prod.common_mart_product.mart_product_usage_wave_1_3_metrics_latest`	No
Gainsight		`prod.common_mart_product.mart_product_usage_wave_1_3_metrics_monthly`	No
Gainsight		`prod.common_mart_product.mart_product_usage_wave_1_3_metrics_monthly_diff`	No
Gainsight		`prod.common_mart_product.mart_saas_product_usage_metrics_monthly`	No
Gainsight		`prod.common_mart_product.mart_product_usage_paid_user_metrics_monthly`	No
Gainsight		`prod.common_mart_product.mart_product_usage_free_user_metrics_monthly`	No
Gainsight		`prod.restricted_safe_common_mart_sales.mart_arr`	Yes
Salesforce		`mart_product_usage_paid_user_metrics_monthly`, `mart_product_usage_paid_user_metrics_monthly_report_view`	No
Zapier	t.b.d.	`prod.workspace_customer_success.mart_product_usage_health_score`	No

Sales Systems Use-Case: Using the Snowflake API

Data Deduplication

Data deduplication is essential for ensuring data quality and reducing storage and compute costs in Snowflake. The current GitLab.com pipeline is designed to execute a full data extract for specific tables where incremental extraction is not feasible, as well as for tables intended for Slowly Changing Dimensions (SCD) modeling. To check for any missing transactions in the source system, incremental extraction tables consistently overlap by 30 minutes.

Additionally, all data sourced from another application, CustomersDot, is extracted in full twice a day, as each extract plays a role in building the SCD downstream.

To address our need for reduced Service Level Objectives (SLO) and Service Level Agreements (SLA), we have shifted towards more frequent extracts for both CustomersDot and GitLab.com. This adjustment has resulted in an increase in duplicate records and higher storage requirements in Snowflake for tables associated with both full and incremental extracts. The growing number of duplicates has adversely affected the results of the dbt model and dbt tests on these data sources over time.

To decrease dbt runtime and enhance the efficiency of Snowflake’s computing and storage, we developed a deduplication framework specifically targeting these data sources. This framework can be easily extended to other data sources in Snowflake where duplicate records may accumulate.

Deduplication Framework

The deduplication framework consists of two main components:

Airflow: Airflow consists of 3 deduplication DAG’s:
- Deduplication DAG for gitlab.com incremental extract t_deduplication_gitlab_com_incremental
- Deduplication Staging DAG for gitlab.com scd (full) extract t_deduplication_gitlab_db_scd
- Deduplication SCD DAG for CustomerDot SCD extract.t_gitlab_customers_db_dbt
Since we maintain the list of the tables, we extract data in the manifest file as part of gitlab_data_extract pipeline. Airflow relies on the exact source of truth to get the list of the tables for which it has to run the deduplication logic. The DAG is scheduled to run weekly.
Snowflake: In Snowflake, the following activities are carried out:
- Backup tables are created using Snowflake clone command with timestamp suffixes in the TAP_POSTGRES_BKP schema inside of the RAW database.
- A temporary table is created with a deduplicated dataset using a GROUP BY clause to eliminate duplicates while retaining the most recent records and managing special columns like _uploaded_at and _task_instance. The deduplication logic selects all unique rows from the table.
- The temporary tables are swapped with the original tables, while maintaining current grants and permissions.
- Temporary tables are dropped after a successful swap.
- Delete the backup table older than 7 days.

Visualization

We use Tableau as our Data Visualization and Business Intelligence tool. To request access, please follow submit an access request. Use the template Tableau_Rquest for Tableau access requests.

Meta Analyses for the Data Team

Tableau Usage! 📈 - coming soon
Tableau Account Optimization 💪 - coming soon
Tableau Account Maintenance 🗑️ - coming soon
dbt Event Logging - coming soon
Snowflake Spend ️❄

Security

Passwords

Per GitLab’s password policy, we rotate service accounts that authenticate only via passwords every 90 days. A record of systems changed and where those passwords were updated is kept in this Google Sheet.

We also rotate Snowflake user passwords the first Sunday of every 3rd month of the year (January, April, July, October) via the Snowflake Password Reset DAG.

Software User Provisioning

The data team is responsible for provisioning users within the tools managed by the Data Team. This includes tools like Tableau, MonteCarlo, Fivetran, Stitch, and Snowflake.

For Snowflake, we have a robust process documented in the Snowflake Permissions Paradigm section of this page.

For other tools, add users via the UI and in the appropriate Google Group if one exists.

Stitch provisioning

A new user in Stitch should by default be added to the General role. This role gives sufficient access to Stitch to create new, change existing and troubleshoot running extractions. Stitch provisioning is a two-step process. First, the IT operations team adds the team member to the app.stitch Okta group by feeling the Access Request. The second step involves adding the user’s email to the Stitch application.

Google Data Studio

Much like Google Drive all GitLab team members have access to Google’s Data Studio which can be used to build dashboards with data from Google Sheets or other Google data sources. Hence there is no access request needed to get access provisioned to Google Data Studio. Google Data Studio is especially popular with Marketing with their use of Google Analytics. Though this resides outside of the platform described above, any data managed within Google’s Data Studio must adhere to the same Data Categorization and Management Policies as we do in the rest of our platform.

There are 3 types of objects available in Google Data Studio:

Data Sources -* This is a connection to data sources. Currently there is no connection available/supported towards our Snowflake data warehouse.
Reports
- This is for creating reports based on any connected data set.
Explorer
- This is a tool to quickly explore data sets and find detailed insights.

The sharing and access process in Data Studio is comparable to sharing in Google Drive / Google Docs. Google Studio Objects can be shared with individuals in our GitLab organization account or with the Organization as a whole. There are no group or role level permissions available. Given the decentralized quality of managing dashboards and data sources in Data studio it is advised that business critical data and reporting be eventually migrated to Snowflake and Tableau. This is made easy with the use of sheetload or FiveTran, which has a BigQuery connector.

A GitLab Team Member that creates any artifacts in Google Studio owns the owner permissions of that particular object. With the ownership the GitLab Team Member holds responsibility to keep data SAFE within GitLab and outside the organization. Google Data Studio currently doesn’t provide an admin interface that can take over the ownership. Upon off-boarding any ownership of existing objects should be carried over to ensure business continuity by the respective object owner. Note that Red Data should never be stored or transmitted within Google Data Studio.

Sales Analytics Notebooks

The Sales Analytics have a couple (but expanding) list of regular update processes that will benefit from being able to be run automatically without human intervention.

Some of those are:

X-Ray fitted curves calculation: Quarterly process that create a table with fitted curves to historical coverage ratios. This data is used within the X-Ray dashboard.
QTD Pre-Aggregated data for X-Ray and SAE Heatmap: Daily process to precalculate data aggregations at different levels. This process is much easier to run in Python than with SQL and we will be able to upload the data directly into Snowflake.

For this, we have implemented a solution consisting of multiple Airflow dags, per schedule.

The process, explained

As of right now (subject to further iterations and changes), the steps are the following:

A Sales Analyst works on a Python Notebook (example notebook) and makes it ready for production (making sure the cell execution results are cleared, no local variables/secrets are laying around etc.)
The Sales Analyst uploads the notebook and its respective query in the corresponding folder, depending on what schedule the notebook should run on. The available schedules (and therefore folders) under https://gitlab.com/gitlab-data/analytics/-/tree/master/sales_analytics_notebooks are:
- daily - daily at 6AM
- weekly - every Monday at 6AM
- monthly - every 7th day of the month, at 6AM
- quarterly - every 7th day of the quarter, at 6AM

This has been implemented by creating 4 main DAGs (one per schedule) consisting of as many tasks as there are notebooks for that schedule. New tasks are dynamically added to the DAG as notebooks are committed to the repository.

The code for the dags can be found in the Sales Analytics Dags in the gitlab-data/analytics project.

Example

Currently, under the /daily/ notebooks we have one sample notebook and its corresponding query.

This notebook runs daily and the scores produced during execution are loaded into Snowflake in the RAW.SALES_ANALYTICS schema.

For this data to be available on Tableau, dbt models will have to be written to expose them as views in the production database, under the PROD.WORKSPACE_SALES schema.

For that the Sales Analyst can either open an MR directly into the gitlab-data/analytics project, or create an issue on this project and a data platform engineer will implement the necessary dbt models.

In order to change the desired day of the week/time of these schedules, the Sales Analyst can open an issue on the gitlab-data/analytics project.

Failure notifications

Dag failure alerts are sent from Airflow to the #sales-analytics-pipelines, so the Sales Analysts can monitor errors with the notebooks
If the errors seem to be platform-related, the Sales Analyst can reach out to the data platform engineers either via Slack (via the #data-engineering channel), or by opening an issue on the gitlab-data/analytics project

GSheets & Jupyter Notebooks

A couple of new functions have been added to the GitLabdata library (Link to PyPi, Link to the source code) to allow reading from and writing to GSheets files.

Reading from GSheets within Jupyter Notebooks

The function is called read_from_gsheets(link to function source code) and it accepts a spreadsheet_id and a sheet_name as parameters, it returns a dataframe.

⚠️ The specific sheet should be shared with the relevant gCloud SERVICE ACCOUNT user’s email account (See System Set Up).

Writing to GSheets from Jupyter Notebooks

The function is called write_to_gsheets(link to function source code) and it accepts a spreadsheet_id, a sheet_name and a dataframe as parameters.

⚠️ The specific sheet should be shared with the relevant gCloud SERVICE ACCOUNT user’s email account (See System Set Up).

System set up - Remote execution

For production use-cases, a service user has been provided and the credentials are stored in the Data Team Secure Vault under GCP Service Account for Exporting to GSheets.

⚠️ The specific sheet should be shared with the service account user’s email (data-team-sheets-sa@gitlab-analysis.iam.gserviceaccount.com) prior to calling this function, otherwise the account won’t be able to write to or read from the sheet.

System set up - Local / Team development

For local development, you need to set the GSHEETS_SERVICE_ACCOUNT_CREDENTIALS environment variable with the value of your team’s gCloud SERVICE ACCOUNT Credential (The actual JSON content should be the value of this environment variable, not the path.)

This can be done by running the following command in the terminal of your choice: export GSHEETS_SERVICE_ACCOUNT_CREDENTIALS = 'JSON_CREDENTIAL_CONTENTS'

To maintain our high standard in security and avoid any potential breaches, it is required that each team requests and manages their own gCloud SERVICE ACCOUNT.

The GCP team can support with the creation of the user / GCP project. Here is an example of the issue to create the Service Account for the Revenue Strategy and Analytics team.

The gCloud SERVICE ACCOUNT requires Google Workspace Delegated Admin permissions.

⚠️ The specific sheet should be shared with the gCloud SERVICE ACCOUNT user’s email prior to calling this function, otherwise the account won’t be able to write to or read from the sheet.

Remaining work

Update the repository URL for the sales analytics notebooks (link to issue)

Sales Systems Use-Case: Using the Snowflake API

The Sales Systems team needs to run the same query several times per day against Snowflake and load that data into Salesforce.

The data team provided an API user so the Sales Analytics team can automate this process, instead of manually downloading the data and uploading it into Salesforce. More detail on this use-case can be found in the original issue #15456.

The data pulled from the database is encapsulated in a view that strictly exposes only the requested data and the sales systems team will be querying this view directly via the Snowflake API. A new role was created specifically, called SALES_SYSTEMS_SNOWFLAKE_API_ROLE for this use-case on Snowflake and it has been configured to only have read access on the underlying view.

The Snowflake API user has been created following the steps in the official Snowflake documentation on Using Key Pair Authentication and the credential is stored in our Data Team Secure vault and is to be shared with the Sales Systems team.

We created a runbook with a step-by-step guide on how to create the user and role for this purpose - link to the Snowflake API User runbook.

Exceptions

Exceptions to this standard will be tracked as per the Information Security Policy Exception Management Process.

References

The platform infrastructure

AWS Data Team Guide

AWS data team guide

Data Infrastructure

Overview Data Infrastructure pages are available in our Internal GitLab Handbook. Quick Links …

Data Pipelines

This page describes the ways we extract this data via data pipelines.

Data Platform Security

Data platform security involves implementing measures to protect the confidentiality, integrity, and availability of data within our platform. This encompasses a range of strategies and technologies aimed at safeguarding sensitive information from unauthorised access, data breaches, and other security threats. These measures often include access controls, encryption, authentication mechanisms, monitoring tools, and compliance frameworks to ensure that data remains secure throughout its lifecycle within the platform. By prioritising data platform security, we can mitigate risks, maintain regulatory compliance and build trust with our stakeholders.

View page source - Edit this page - please contribute.

Data Team Platform

Our Data Platform Vision

Makes it Easier to Contribute

Is Reliable

Is Secure

Is Maintainable

Benefits a Larger Community

Purpose

Scope

Roles & Responsibilities

Standards

Quick Links

Our Data Stack

Extract and Load

Data Sources

Source contacts

Tier definition

Data Team Access to Data Sources

Data Source Overviews

Snowplow Infrastructure

AI

Using AI to Develop Data Products

Building AI Products

1. Snowflake Cortex AI

2. Federated AI Tools

Orchestration

Data Warehouse

Snowplow updating columns

Snowplow nullify geo columns

Snowplow nullify page_url_path columns

Snowflake support portal access

Warehouse Access

Additional Access

Provisioning Additional Access

Additional Access Reminders

Historical Provisioning steps

Snowflake Permissions Paradigm

User Roles

Functional Roles

Functional Role Assignment

Object Roles

Masking Roles

Examples

Snowflake CI jobs

1) Automate creating users/roles in Snowflake platform

2) Automating roles.yml

Custom Templates

Common Custom Templates

Databases

Roles

Users

Automating roles.yml: Project Access Token

snowflake_users.yml - end of file issue

Local Testing

Snowflake Deprovisioning Users

Snowflake user/service account

Provisioning permissions to external tables to user roles

Logging in and using the correct role

Compute Resources

Data Storage

Raw

Snowflake Data Share

Prep

Prod

Folder Structure in Analytics Project

Static

Data Masking

Static Masking

Dynamic Masking

Timezones

Snapshots

dbt

Other uses

Language

Backups

Unforeseen circumstances

Incorrect events happening inside the data platform

Unavailability of the Snowflake environment

Snowflake Admin tasks

Create new Snowflake external stage for GCS storage bucket

Snowplow nullify `page_url_path` columns