Enterprise Data Warehouse

Enterprise Data Warehouse Overview

Architectural Overview

GitLab’s Enterprise Data Warehouse serves as our central repository for analytics and reporting, transforming raw data into actionable insights. Through the combination of Snowflake’s powerful data platform and dbt’s transformation capabilities, we’ve implemented a robust ELT (Extract, Load, Transform) framework that adheres to the Kimball methodology.

Our warehouse architecture is organized into four distinct schemas:

  1. COMMON Schema: Houses our Enterprise Dimensional Model (EDM), serving as the heart of our integrated application data. This schema implements the Kimball methodology to ensure the highest data quality standards.
  2. SPECIFIC Schema: Maintains independent application data that doesn’t require integration with other systems, following our Trusted Data Development process while avoiding unnecessary complexity.
  3. WORKSPACE Schema: Provides a flexible environment for experimentation and prototyping, serving as a staging area for future EDM solutions.
  4. LEGACY Schema: Maintains historical modeling approaches as we strategically deprecate and migrate critical systems to our modern architecture.

Dimensional Modeling Fundamentals

Dimensional modeling is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball. It presents data in a standard, intuitive framework that allows for high-performance access while maintaining business process orientation.

Core Concepts

Dimensional modeling uses two primary components:

  • Facts (Measures): The numerical values in your data. These are the “how many” and “how much” values you want to analyze, like sales amount, quantity sold, or customer count. While most facts can be summed or averaged, some (like ratios or percentages) cannot be meaningfully aggregated.
  • Dimensions (Context): The descriptive attributes that give meaning to your facts. Dimensions answer the “who, what, when, where, why” of your data. Dimensions contain hierarchies (like Date -> Month -> Quarter -> Year) and descriptive details (like product name or category), letting you analyze your facts from multiple perspectives.

This approach creates several schema patterns:

  • Star Schema: Central fact table linked to dimension tables
  • Snowflake Schema: Dimensional tables linking to other dimension tables
  • Galaxy Schema: Multiple interconnected fact tables

Development Process

Dimensional models are built in four key steps:

  1. Choose the business process (ie. track annual revenue)
  2. Declare the grain (ie. per customer)
  3. Identify the dimensions
  4. Identify the facts

Benefits of Dimensional Modeling

Dimensional modeling offers a proven, industry-standard approach that makes data easily accessible to business teams while maintaining consistent definitions. Its flexible architecture allows seamless integration of new subject areas, and its analytical power grows as dimensions are added, creating an increasingly robust foundation for enterprise analytics.

Schemas

Our schema design reflects the natural progression of data through our warehouse, from initial ingestion to final consumption. Raw data enters through dedicated schemas, undergoes transformation in preparation schemas, and ultimately resides in consumption-ready schemas that support our dimensional model. This layered approach allows us to maintain clean handoffs between different stages of data processing while ensuring data quality and consistency throughout the pipeline. Each schema serves as a checkpoint in our data’s journey, with clear responsibilities and governance rules that help maintain the integrity of our data warehouse.

Common Prep Schema

The Common Prep schema serves as a crucial intermediate layer in our data architecture, guided by four fundamental principles that ensure data quality, maintainability, and usability.

Core Principles

Four fundamental principles guide development and maintenance:

  1. Single Source of Truth Maintain one prep model per dimensional entity. For example, use a single prep_charge model rather than separate variations, pushing specialized versions downstream as FACT, MART, or REPORT models. This approach:
  • Streamlines data lineage
  • Keeps code DRY
  • Reduces maintenance overhead
  • Prevents redundant implementations
  1. Lowest Grain Preservation Keep prep models at the lowest possible grain of the dimensional entity. This:
  • Establishes them as the Single Source of Truth (SSOT)
  • Enables flexible downstream modeling
  • Supports various model types (DIM, FACT, MART, MAPPING, BDG, REPORT)
  • Avoids limiting future analysis capabilities
  1. Comprehensive Data Retention Avoid filtering records in the COMMON_PREP schema. Instead:
  • Implement filtering in COMMON schema and downstream
  • Maintain data availability for various use cases
  • Prevent premature data exclusion
  • Support diverse analytical needs
  1. Pragmatic Model Creation Skip the prep layer when direct transformation to Common schema is more efficient:
  • Avoid pattern-based modeling
  • Eliminate unnecessary complexity
  • Focus on value addition
  • Maintain model efficiency

Use Cases

The schema serves six essential functions:

  1. Surrogate Key Generation Create and manage keys used throughout the Common Schema.
  2. Data Cleansing Standardize data types and handle NULL values consistently.
  3. Business Logic Application Implement transformations needed before data combination.
  4. Reference Data Integration Incorporate foreign keys and identifier fields for proper joining.
  5. Source Unification Combine multiple data sources with consistent formatting.
  6. Performance Optimization Break down large datasets for efficient processing.

While the COMMON_PREP schema is optional, it provides significant value when used appropriately. These principles establish a foundation for clean, maintainable, and efficient data modeling that supports diverse analytical needs while preventing unnecessary complexity.

Common Mapping Schema

Mapping/look-up (map_) tables to support dimension tables should be created in the common_mapping schema.

Common Schema

The Common schema is where all of the facts and dimensions that compose the Enterprise Dimensional Model are stored. The Common schema contains a variety of different types of dimensions and facts that create multiple star schemas. The models in this schema are robust and provide the basis for analyzing GitLab’s businesses processes.

Our dimensional model revolves around two core components: dimensions (providing context) and facts (measuring events). Understanding each component’s role and characteristics is crucial for effective data modeling.

Dimension Tables

Dimension tables supply the descriptive attributes that give context to our business events. These are the “who, what, where, when, why, and how” of our data. A dimension table typically represents a business entity like a customer, product, or location.

Characteristics of Dimensions

  • Contain descriptive attributes (text-based or categorical)
  • Usually have fewer rows than fact tables but more columns
  • Change relatively slowly over time
  • Provide the entry points for querying and filtering
  • Include hierarchical relationships

Common Types of Dimensions

Conformed Dimensions Conformed dimensions enable facts and measures to be consistently categorized and described across multiple fact tables and data marts. These dimensions maintain consistent meaning across multiple fact tables. This standardization ensures reliable analytical reporting and promotes reusability throughout the data warehouse. When properly implemented, each subject area can be analyzed independently while maintaining the ability to combine insights with related areas. However, this cross-area analysis capability breaks down if dimensions have even slight variations between subject areas. Kimball refers to this standardized set of shared dimensions as the “conformance bus”, which facilitates seamless integration of common dimensions and enables comprehensive reporting across multiple subject areas. Examples include:

  • Date dimensions used across various business processes
  • Customer dimensions used in sales and marketing

Local Dimensions Specific to a single business process or fact table, these dimensions provide context for particular events or metrics. Examples include:

  • Support ticket status
  • Order types
  • Campaign attributes

Slowly Changing Dimensions & Snapshots

Understanding Time Perspectives

Data analysis typically requires two viewpoints: current and historical. The current view uses up-to-date dimension values, while historical analysis needs to understand how things looked at specific points in time. For example:

  • Analyzing sales with a previous product catalog
  • Tracking customer location changes over time
  • Understanding organizational structure changes
Types of Dimensions

We implement three approaches to handle time-based changes:

Type 1 Dimensions

  • Overwrite values when they change
  • Maintain only current state
  • Provide simplest implementation
  • Lose historical context

Type 2 Dimensions (SCD)

  • Add new records for changes
  • Track validity periods with valid_from and valid_to dates
  • Enable historical analysis
  • Maintain complete change history

Type 3 Dimensions

  • Maintain current and alternate values
  • Enable multiple analytical perspectives
  • Support dual categorization needs
  • Not currently implemented in our EDM
SCD in Practice

Snapshot tables form the backbone of our historical tracking system, capturing the complete lifecycle of business objects from creation through every modification to the present state. These tables maintain a detailed audit trail of changes while enabling efficient historical analysis.

In our implementation, Slowly Changing Dimensions are created using dbt’s snapshot functionality. dbt snapshots provide a simple yet powerful way to track historical changes in our data. When a snapshot is run, dbt compares the current state of the data with the previous snapshot and automatically tracks any changes through valid_from and valid_to dates.

A snapshot tracks changes through validity periods, marking each state with these timestamps. For example, a simple state change might look like:

id attribute valid_from_date valid_to_date
1 ‘open’ 2022-01-01 2021-01-02
1 ‘closed’ 2022-01-02 NULL

While this format efficiently stores historical data, it can be challenging for business users to analyze. To improve usability, we transform these snapshots into daily grain records in the COMMON schema. These _daily_snapshot models expand the validity periods into individual day records, making time-based analysis more intuitive while maintaining consistency with our dimensional model.

During model development, we start with staging models in COMMON_PREP to implement business logic, then use dbt’s snapshot functionality to track changes. This foundation allows us to create daily snapshots when needed while managing performance impacts.

Best practices for snapshot implementation include adding clear current record indicators, maintaining consistent grain across related models, and thoroughly documenting validity periods. These practices ensure our historical tracking remains accurate and performant while serving diverse analytical needs.

Fact Tables

Fact tables record the business events we want to analyze. They contain the quantitative metrics (measures) of our business processes, along with references to related dimensions.

Key Characteristics

Fact tables capture business events with numeric measures and typically grow continuously as new events occur. They’re characterized by many rows (each representing a single event) but relatively few columns.

While most measures in these tables can be aggregated, some can only be partially aggregated (like averages), and others shouldn’t be aggregated at all (like percentages). Each fact table represents a specific business process and includes foreign keys that link to dimension tables, providing context to the measures.

Types of Facts

Atomic Facts Fact tables form the foundation of fact-based analysis by capturing business events at their most granular level. Each row represents an individual business event with complete, unfiltered data, preserving the maximum level of detail. By maintaining this granularity, atomic facts enable flexible aggregation options for various analytical needs.

Derived Facts Derived facts are specialized views built on top of atomic facts to serve specific analytical needs while maintaining clear data lineage. They improve performance by creating focused subsets of large atomic fact tables. For instance, when analysts typically work with just 10% of a large event table, a derived fact can extract just that portion, improving query speed. These tables also standardize metrics by precomputing commonly used aggregations.

Additionally, through “drill across facts”, we can combine multiple fact tables through their shared conformed dimensions to create unified analytical views. By using full outer joins on common dimensions, this process creates derived fact tables that link related business processes while maintaining dimensional consistency, enabling analysis across multiple business areas.

Special Purpose Tables

Bridge Tables

Bridge (bdg_) tables reside in the common schema and serve a crucial role in our dimensional model. These intermediate tables resolve many-to-many relationships between tables, maintaining data model flexibility while ensuring proper relationship management.

Common Mart Schema

The Common Mart schema combines dimensions and facts into business-ready analytics models, serving as the primary access point for business users and analytics tools.

Purpose and Structure

The mart layer transforms our dimensional model into business-specific datasets that combine relevant facts and dimensions. By pre-joining common attributes and applying standard business rules, marts optimize data for specific analytical needs within each business domain.

Organization By Business Domain

Mart models are organized by business function, separating data into focused areas like Finance, Marketing, People, Product, and Sales. This functional organization allows each department to work with data that’s specifically tailored to their analytical needs while still maintaining connections to the broader data warehouse through conformed dimensions.

Key Characteristics

Mart models are built directly on fact and dimension tables from the Enterprise Data Model, ensuring consistent business definitions and standardized relationships. To maintain data lineage and prevent complexity, they never reference other mart models as sources.

Designed with analytics in mind, these models feature pre-joined tables for common query patterns and include frequently used calculations. The data grain is carefully considered to balance user needs with performance requirements.

The design puts business users first by using familiar business terms for tables and fields, rather than technical language. This business-oriented structure and documentation supports self-service analytics by aligning with common analysis patterns, making the data more accessible and intuitive.

Best Practices

  • Keep models focused on specific business domains
  • Document assumptions and limitations
  • Maintain consistent naming across related marts
  • Regular testing of business logic
  • Monitor usage patterns for optimization
  • Maintain clear lineage to source models

Special Purpose Tables

Scaffold Tables

Scaffold tables serve as a foundational structure between fact tables, ensuring comprehensive coverage of all possible dimensional combinations in visualizations and analyses. When working with visualization tools like Tableau, these tables become particularly valuable by filling in gaps and enabling consistent comparisons between actuals and targets. They’re especially useful for time-based analysis, where you need to maintain a continuous view even when data points are missing. By providing this complete dimensional framework, scaffold tables help prevent misleading gaps in reports and dashboards, ensuring that analysts can see the full picture, including periods or combinations where no data exists. These scaffold tables reside in the common_mart schema with the rpt_scaffold_ prefix, building on top of fact tables while maintaining complete dimensional combinations.

In our targets vs actuals reporting, we use a scaffold table to ensure data completeness and consistency. This table maintains a comprehensive structure that covers all time periods and attribute combinations, whether or not sales activity occurred.

Specific Schema

The SPECIFIC schema is to be used for tables that perform a reporting function and act as a source of truth but do not conform to the dimensional modeling structure of the Enterprise Dimensional Model.

No Transformaion Views

A No Transformation View should be direct views of raw source data that are needed for reporting without further transformation. They should not be used to build additional tables since there will be a table upstream in the RAW or PREP database that will provide better lineage documentation for further transformations. They should always be created as a view with no additional transformation or filtering and should be prefixed with ntv_.

Entitlement

To facilitate the use of row level security in both Snowflake and Tableau a schema dedicated for entitlement tables, a mapping between the user or role and the records they are allowed to see, is used. The tables in this schema follow a standard form but are not limited to an exact structure. The purpose of these tables is to be joined to other tables in such a way that at query time the second table will be limited to the appropriate records for the runner of the query.

Naming

The name of the entitlement table should direct users to the other table or tables that it should be used in combination with as well as the application it should be used with. Documentation for exactly what tables the entitlement table should be used for can be found in the data warehouse model documentation. For example an entitlement table that would be used with the mart_team_member_directory table in Tableau would be named ent_team_member_directory_tableau.

Form

Each entitlement table must have at least two columns: a join key that will connect to an other table and represents a subset of records and a column representing a Tableau user or Snowflake role. The column that is used to join to an other table should be named the same as it is in that table to make it easier to user the correct table. The values in the column that is used as a join key should represent the values of that column in the corresponding table. The column or columns that represent the Tableau users or Snowflake roles should be named to match.

Every combination of user and join key must be explicitly included in as row in the table.

Example

If row level security is to be implented on the mart_team_member_directory table diretly in Snowflake then the following table would be created:

ent_team_member_directory_snowflake

cost_center snowflake_role
Cost of Sales TMEMBER1
G&A TMEMBER1
R&D ANALYST_GROUP
Marketing ANALYST_GROUP

Then a row access policy would be applied to the mart_team_member_directory table.

If row level security is to be implented on the mart_team_member_directory table in a Data Source in Tableau then the following table would be created:

ent_team_member_directory_tableau

cost_center tableau_user
Cost of Sales team_member1@gitlab.com
Cost of Sales team_member2@gitlab.com
Cost of Sales team_member3@gitlab.com
G&A team_member1@gitlab.com
G&A team_member2@gitlab.com
Sales team_member1@gitlab.com
R&D team_member3@gitlab.com
Marketing team_member1@gitlab.com

Then, using the guidelines outlined in Tableau Developers Guide, a Data Source and filters would be created.

Technical Implementation Details

Naming Standards

It is critical to be intentional when organizing a self-service data environment, starting with naming conventions. The goal is to make navigating the data warehouse easy for beginner, intermediate, and advanced users. We make this possible by following these best practices:

  1. PREP TABLES: prep_<subject> = Used to clean raw data and prepare it for dimensional analysis.
  2. FACT TABLES: fct_<verb> Facts represent events or real-world processes that occur. Facts can often be identified because they represent the action or ‘verb’. (e.g. session, transaction)
  3. DIMENSION TABLES: dim_<noun> = dimension table. Dimensions provide descriptive context to the fact records. Dimensions can often be identified because they are ’nouns’ such as a person, place, or thing (e.g. customer, employee) The dimension attributes act as ‘adjectives’. (e.g. customer type, employee division)
  4. MART TABLES: mart_<subject> = Join dimension and fact tables together with minimal filters and aggregations. Because they have minimal filters and aggregations, mart tables can be used for a wide variety of reporting purposes.
  5. REPORT TABLES: rpt_<subject> = Can be built on top of dim, fact, and mart tables. Very specific filters are applied that make report tables applicable to a narrow subset of reporting purposes.
  6. PUMP TABLES: pump_<subject> = Can be built on top of dim, fact, mart, and report tables. Used for models that will be piped into a third party tool.
  7. MAP TABLES: map_<subjects> = Used to maintain one-to-one relationships between data that come from different sources.
  8. BRIDGE TABLES: bdg_<subjects> = Used to maintain many-to-many relationships between data that come from different sources. See the Kimball Group’s documentation for tips on how to build bridge tables.
  9. SCAFFOLD TABLES: rpt_scaffold_<subject> = Used to support the visualization layer by creating a template / blueprint with all the combinations of common dimensions between the desired fact tables.
  10. Singular naming should be used, e.g. dim_customer, not dim_customers.
  11. Use prefixes in table and column names to group like data. Data will remain logically grouped when sorted alphabetically, e.g. dim_geo_location, dim_geo_region, dim_geo_sub_region.
  12. Use dimension table names in primary and foreign key naming. This makes it clear to the user what table will need to be joined to pull in additional attributes. For example, the primary key for dim_crm_account is dim_crm_account_id. If this field appears in fct_subscription, it will be named dim_crm_account_id to make it clear the user will need to join to dim_crm_account to get additional account details.
  13. Dimension, fact, and mart tables are not to contain references to operational systems. We abstract the name away from the source system the data is produced into a name that describes the business entity or semantic significance of the data. For example, data from Salesforce is described as crm in the dimensional model and not sfdc or salesforce.

File-Based Data Sources

When handling non-CSV data sources, we prefer direct extraction from source systems. However, temporary solutions using seed files, Sheetload, or Driveload may be acceptable with a clear deprecation plan. For CSV-type source data, we have three options:

  1. dbt seed
  • Version controlled via GitLab
  • Easy updates via MR
  • Best for < 1,000 rows
  • Preferred for small datasets
  1. GCP Driveload
  • Stable and predictable
  • No unexpected changes
  • Requires manual file updates
  • Preferred for larger datasets
  1. Sheetload
  • Enables team data entry
  • Less stable implementation
  • Difficult SOX compliance
  • Last resort for Tier 1 assets

Testing Framework

Models require testing and documentation via schema.yml files, following the Trusted Data Framework (TDF).

Time Standards

We standardize on Monday as the first day of week across all systems:

CASE WHEN day_name = 'Mon' THEN date_day
    ELSE DATE_TRUNC('week', date_day)
END AS first_day_of_week

This ensures:

  • Consistent weekly reporting
  • Alignment with ISO 8601
  • Accurate metric calculations
  • Standard implementation across models

Entity Relationship Diagram (ERD) Library

These diagrams provide the relationships between data objects in the Enterprise Dimensional Model across the major business process fly wheels.

Lead to Cash ERDs

ERD Library

Product Release to Adoption ERDs

ERD Library

Team Member ERDs

ERD Library

Create Entity Relationship (ER) Diagrams using Lucidchart

Lucidchart is a web-based diagramming application that allows users to visually collaborate on drawing, revising and sharing charts and diagrams, and improve processes, systems, and organizational structures.

An Entity Relationship (ER) Diagram is a type of flowchart that illustrates how entities or objects relate to each other within a system. ER diagrams are used to model and design relational databases, in terms of logic and business rules (in a logical data model) and in terms of the specific technology to be implemented (in a physical data model.)

The below steps illustrate how to create an ER Diagram (Logical & Physical Data Model) using Lucidchart's ERD import functionality. This process of creating a data model from a database or a script is also known as Reverse engineering.

Step 1: Create a blank lucid document from 'Lucidchart app' (should be available via Okta dashboard in case a user is assigned to it).

create-lucid-chart.png

Step 2: Click 'Import Data' that appears below the 'Shape Library' located at the bottom left hand side of the page.

import-data.png

Step 3: Select 'Entity Relationship (ERD)' from 'All Data Sources'. And choose 'Import from SQL Database' option from the dropdown list menu of 'Import your Data'.

import-sql-database.png

Step 4: Select 'MySQL' as the DBMS source to import the data from, to create the ERD.

sql-script.png

Step 5: Run the below script in Snowflake after updating/modifying the filter critera for selecting the 'table_schema' for selecting the tables for which the ERD needs to be created for.
The below query can be run in Prod database in Snowflake to get all the Models/tables from COMMON and COMMON_PREP Schemas:

SELECT 'mysql' dbms,
        t.TABLE_SCHEMA,
        t.TABLE_NAME,
        c.COLUMN_NAME,
        c.ORDINAL_POSITION,
        c.DATA_TYPE,
        c.CHARACTER_MAXIMUM_LENGTH,
        '' CONSTRAINT_TYPE,
        '' REFERENCED_TABLE_SCHEMA,
        '' REFERENCED_TABLE_NAME,
        '' REFERENCED_COLUMN_NAME
FROM INFORMATION_SCHEMA.TABLES t
LEFT JOIN INFORMATION_SCHEMA.COLUMNS c
ON t.TABLE_SCHEMA=c.TABLE_SCHEMA
AND t.TABLE_NAME=c.TABLE_NAME
WHERE t.TABLE_SCHEMA NOT IN('INFORMATION_SCHEMA')
AND t.TABLE_SCHEMA IN ('COMMON', 'COMMON_PREP')
Step 6: Download and export the results/output of the above query from Snowflake in to a csv file.

export-results.png

Step 7: Navigate back to Lucidchart app and select the result.csv file to be uploaded and click 'Import'.

import-tables.png

Step 8: All the tables/models from the selected 'table_schema' list will now appear under 'ERD Import'. The schemas can be expanded and scrolled through to view all the tables.

schemas.png

tables.png

Step 9: The required tables/entities of interest can be dragged on to the canvas and relationships between the entities can be defined from the ribbon to create the ER Diagram.
Note: The number of fields to be shown for each of the entity can easily be modified from the 'Advanced Options' section thats appears towards the right hand side of the page.

ERD.png


Big Data

Big Data is a concept that we use to understand the limits of data offerings. Generically, Big Data is anything that exceeds or strains our current technical capacity for processing and delivery. Dealing with Big Data may be less efficient and more costly as new or creative solutions need to be developed and deployed to expand the capabilities of the data offerings.

Big Data and the Enterprise Data Warehouse

Big Data in the Enterprise Data Warehouse is categorized by three general topics:

  1. Volume - how much data there is for a relevant source, concept, or model
  2. Velocity - how quickly the data changes, is ingested, or consumed
  3. Variety - how the structure and format of a data source compares to other sources

The Enterprise Data Warehouse is limited in each of these conceptual areas and any data that approaches these limits would be considered Big Data.

Volume

While there is not a real limit to the amount of data that can be stored in the Enterprise Data Warehouse there are limits to how much data can be transformed in a performant and cost effective way. The current limits are being able to create a single table within a 3 hour timeframe using an XL snowflake warehouse. As an example all of the collected snowplow data, over 6TB and 30 billion records, can not all be processed at once and would be considered Big Data.

Velocity

As the Enterprise Data Warehouse is designed to process and transform the data to present a curated set of tables, there are limits to how quickly those transformations can be processed in a cost effective way. The Enterprise Data Warehouse is designed to process the data every 24 hours and data that needs to be processed more quickly than that to accommodate business needs would be considered Big Data. As an example evaluating website behavior in near real time would be considered Big Data.

Variety

The current design of the Enterprise Data Warehouse is build on the Snowflake cloud database. This limits the formats and structures of data that can be processed to those that fit into structured tables. While some processing of semi-structured data, such as JSON, is possible directly in the Enterprise Data Warehouse this is limited and must be first ingested as a column in a table decreasing efficiency. Generally, any data input or output from the Enterprise Data Warehouse that is not in a structured table would be considered Big Data. As an example the service ping payload from the Version database is JSON that requires extensive manipulation before in can be analyzed and would be considered Big Data.

Analytics Performance Policy Framework

Problem Statement

Due to increasing data volumes and business logic complexity in the Enterprise Data Warehouse, the data model transformations built in the EDW have become increasingly non-performant overtime with the daily dbt model production run taking over 12 hours to complete. Query runtimes on some of our largest Snowplow, Service Ping, and GitLab.com data sets can take longer than several minutes to complete on L and XL size warehouses. Our daily Snowplow event data volumes are expected to increase by 2.5x within the next 12+ months. Therefore, we need an Analytics Performance Policy that can provide guidelines on how to architect performant data models in the transformation layer of the EDW that balance technical considerations with business needs and requirements.

We think about dbt model runs along 3 major dimensions: performance, efficiency, and cost.

  1. Performance relates to how long it takes a model to build
  2. Efficiency relates to how well a model uses local storage, remote storage, and partition pruning.
  3. Cost relates to how many Snowflake credits are required to run a model and is impacted by both the performance and efficiency of the model.

The scope of this Analytics Performance Policy at this time is specifically focused on the performance of models. In the future, we will consider adding a separate efficiency and cost policy that would roll-up to an overall Analytics Scalability Policy.

The Analytics Performance Policy is only considering the data transformations and does not consider retention of data that is extracted and loaded towards the RAW database of the EDW. For the time being, the policy assumes we will keep all data in the RAW database and we will not delete data. After a data retention policy is implemented in the future, we would reevaluate and iterate on this Analytics Performance Policy that focuses on the Transformation layer of the EDW. The alignment that is reached with the Functional Teams in this Analytics Performance Policy will be used to influence a data retention policy on the RAW database in Snowflake.

Performance Targets

These initial performance targets were created to allow the daily dbt model production run to finish within an 8 hour working day and provide for the run to be triaged within a working day. The Snowflake query time targets were created to make incremental improvements from several minutes query times to 1 minute to provide for a more productive and delightful querying experience in Snowflake. These targets are subject to change in the future as we continue to improve performance and receive new business requirements.

  1. Maintain production dbt DAG run time of less than 8 hours. Assumes we do not scale up warehouse size for existing models and keep using a XL warehouse as a maximum. Assumes we can scale out with using more concurrent threads running at the same time.
  2. An individual dbt model’s run time is consistently less than 1 hour, and its design accounts for forecasted data volume increases.
  3. A simple query of the Snowplow, Service Ping, and GitLab.com big data sets in Snowflake finishes in under 1 minute on a L or XL warehouse.

Architectural Approaches to Improve Performance

  1. Transform and Surface Smaller Amounts of Data in the Data Models.
    1. Limit the amount of rows and/or columns surfaced in the atomic fact tables (lowest grain that captures ALL transactions for a business process), based on business needs weighed against technical and performance constraints. Ex. Limit the data in a model to the last 13 months of data that is both performant and required for business analysis.
    2. Shard monolithic atomic fact tables (lowest grain that captures ALL transactions for a business process) that model the higher levels of abstraction of a business process into smaller data models that model sub-business processes. Ex. Model the SDLC (Software Development Lifecycle) into its component parts like SCM, CI, CD, Security etc. versus having the end to end SDLC modeled in one, montholic, super large data model.
    3. Create aggregated data models at a level of detail purpose built for Business Analytics use cases. Ex. Consider aggregating product usage data to the user, namespace, installation, metric grains by week, month, quarter, and year timeframes as required by the business use case.
  2. Consider adding a clustering key to the data model that is aligned to commonly queried use cases in the model.
  3. Evaluate the use of the simple_cte macro in the data model and insure only the required columns that are needed for the model are selected.
  4. Consider setting the data model to be incremental.

Historical Archiving Process

  1. For non-idempotent data, which is data that cannot be recreated or otherwise surfaced in the data model due to a performance policy consideration, leverage a data platform archiving methodology to create an historical archive of the data.

For example, with only exposing 13 months of product usage data in atomic fact tables and creating an aggregated data table at the month, metric, namespace grain that only provides data for the past 13 months, an historical archive table would be able to provide insights from 2 or 3 years in the past for the aggregated table while the live data model would only provide the last 13 months of data.

Last modified February 17, 2025: Update EDW page (22a4e2a1)