New Data Source

How to add a new data source

This page describes in detail the process of adding new data sources to the data warehouse. The Data Team embraces any initiative of adding new data sources towards the data warehouse. We envision that data could lead towards value, but adding new data towards the data warehouse does not go with no costs.

Both the development (assigning resources from the Data Team, from other teams involved to support and also you as requestor) and keeping the data pipeline up and running (storage, compute resources, incident management, monitoring etc..) cost time and or money.

Please, before request adding a new data source, take the following into account:

  • Is there a valid business case?
    • Sometimes the business case is to comply with regulatory requirements.
    • Sometimes the business case is straightforward because value potential is clear above the investment costs.
    • Most of the time it’s hard to quantify, because either the value is unsure or the costs are unsure. Feel free to have an open and honest discussion with the Data Team. We have the experience and could help with the justification and this does not necessarily need to be a scientific calculation.
  • The work is not done when data lands in the data warehouse. The data will ‘just’ land in the raw data layer, and this is not accessible by default for Sisense. Data need to be loaded downstream into the Enterprise Dimensional Model (EDM) via dbt. Follow up needs to take place, and will come on top of the process described on this page.
    • Downstream modelling could be handled by the business team, because we embrace contribution on our data platform. Please note that extensive (dbt-, SQL- and data modelling knowledge is needed)
    • Downstream modelling could be handled by the Data Team, ideally by a Data Fusion Team. Planning needs to take place and priorities are set in line with company priorities. This falls not within the scope of adding a new data source to the data warehouse, thus this needs to be arranged subsequently.
    • If data is extracted in (complex) JSON format, a Data Engineer can support or can flatten the data to a tabular format and load this towards the PREP database.
    • There are 3 ways to follow up on the downstream data development.
  • Adding a new data source to the data warehouse is not an 1 off exercise. As soon as the data is extracted to the data warehouse, on a regular cadence (once week, once a day, multiple times per day, etc..) data will be refreshed. This means something can happen or can go wrong after the implementation. We will need a DRI from the source side (business and technical) to support this process when needed.
  • Data could be used, outside of the EDM. I.e. by a Function Analyst, in the raw data layer. Raise an AR to get access to the raw data.
  • When data ends up in the EDM, work is to be performed in Sisense by creating a dashboard. Also for doing this, some technical knowledge is required.
  • The data platform needs to get access to the source system. Although this sounds obvious, this is not always straightforward.

New data source

For adding a new data source, the workflow looks as follow (note, this is the existing workflow we use for all our work and made explicit here in the context of adding a new data source).

Process for adding a new data source:

Stage (Label) Summery for new data source
workflow::1 - triage All initial information will be provided by the requester and assessed by the Data Team
workflow::2 - validation A solution design and complete work breakdown for upcoming development will be created
workflow::3 - scheduling Once a quarter the Data Team OKRs are defined. Based on workload and priorities, new data sources could be incorporated in the next quarter OKRs
workflow::4 - scheduled Adding a new data source is on the list of next quarter OKRs. An epic will be created with all the work breakdown activities attached
workflow::5 - development Development is in progress
workflow::6 - review Development is under review
workflow::X - blocked Development is blocked

workflow::1 - triage

Every new data source requests starts with an issue. This applies for data source that are already extracted, but needs to be extended and for completely new data sources. Please use the New Data Source template. Assign the new issue to the Manager, Data Platform so that the Data Team can triage the issue in a timely manner. All initial information will be provided by the requestor and assessed by the Data Team. Cross check will be performed if the data doesn’t already exist in the data warehouse.

workflow::2 - validation

All details regarding the new data source will be flashed out, with the goal of creating a full work breakdown.

Data scope

Based on the requirements, the data points that need to be extracted (i.e. which tables) will be determined. A total data scope must be given.

Data minimisation

In the Data Platform we adopt the principle of data minimisation when extracting data, which means we only extract data that is needed to fulfill the scope/need. We do this for the following reasons:

  • Less data is using less resources, reducing pressure on source systems, pipelines and our data platform.
  • Less data means less maintenance, if a data source changes and we don’t extract that particular data(set), we likely don’t need to do anything.

This is all in the spirit of GitLabs value of efficiency.

Data minimisation is applied on the following levels:

  • Data sources. We only extract data sources that are needed.
  • Tables. We only extract the tables within an extraction that are needed.
  • Columns. We only extract the columns in the tables that are needed.

Data minimisation is not applied on the following levels:

  • Rows. We don’t apply filtering on rows by default, unless there is a good reason (technical or functional). This is to avoid confusion about which data our data consumers are actually seeing.
Data minimisation use case

When you want to do a data minimization and specify one or more columns to the JSON file while loading the data, you can use the procedure object_insert. Example for usage:

WITH base AS (
    SELECT try_parse_json('{"id":1,
                            "name": "ABC"}'
                                       ) AS json_data)
SELECT object_insert(json_data,'email','test@gitlab.com')
  FROM base;

-- {
--   "email": "test@gitlab.com",
--   "id": 1,
--   "name": "ABC"
-- }

In the situation when you want to discard one or more columns from the JSON file while loading the data, you can use the procedure object_delete. Example for using it:

WITH base AS (
    SELECT try_parse_json('{"id":1,
                            "name": "ABC",
                            "address":{"add1":"1234",
                                       "add2":"XYZ",
                                       "city":"TN",
                                       "apt": 1 }}'
                                       ) AS json_data)
SELECT object_delete(json_data,'id','address')
  FROM base;

-- {"name": "ABC"}

In this situation, you can exclude the column that shouldn’t be processed for various reasons (RED data, PII data, no value for the data or other minimization principles).

Extraction solution

The Data Team has different instruments available to extract data out of source systems (randomly ordered):

The decision for which instrument to use, is always based on the combination of:

  1. Implementation effort.
  2. Maintenance effort.
  3. Ability to extend and adjust.

Its the Data Team that determines which instrument is used. The following decision diagram is used to determine the right extraction solution:

graph LR

%%descisions
    api_available{API Available?}
    data_bucket{Data available in the gcp or s3 bucket?}
    fivetran_connector_available{Fivetran connector viable?}
    stitch_connector_available{Stitch connector viable?}
    data_is_ext_snowflake_sources{Data is in different Snowflake Account}
    singer_option{Singer Tap viable?}
    streaming{Need to low-latency load?}
%%end solutions
    Custom([Custom development])
    Fivetran([Fivetran])
    Singer([Singer])
    Stitch([Stitch])
    Snowflake_datashare([Snowflake data share])
    Snowpipe([Snowpipe])
    Snowflake_task([Snowflake task])
%%flow
    ds_request[New Request]-->data_bucket
    data_bucket-->|No|api_available
    data_bucket-->|Yes|streaming
    streaming-->|Yes|Snowpipe
    streaming-->|No|Snowflake_task
    api_available-->|No|data_is_ext_snowflake_sources
    api_available-->|Yes|fivetran_connector_available
    stitch_connector_available-->|Yes|Stitch
    fivetran_connector_available-->|Yes|Fivetran
    fivetran_connector_available-->|No|stitch_connector_available
    stitch_connector_available-->|No|singer_option
    singer_option-->|Yes|Singer
    singer_option-->|No|Custom
    data_is_ext_snowflake_sources-->|Yes|Snowflake_datashare
    data_is_ext_snowflake_sources-->|No|Custom

Custom development is a solution designed and developed by the GitLab Data Team. Examples of this are the current PGP and the Zuora Rev Pro extraction.

Access request

Although it could be helpful to already provide the Data Team access to the source system, its not mandatory to raise an Access Request right now.

Work breakdown

End goal of the validation is to have a solution design and a complete work breakdown, with DRI’s and size estimation attached. All the work that needs to be performed is stated down. We aim to have it on a level, where every work action could be converted to an issue with no larger T-Shirt size than M / 5/8 issue poins.

MNPI Data

It needs to be determined if there is MNPI data in the data sources and if this data is about to be extracted towards the Data Warehouse. If the data source contains MNPI and this data is extracted, change the issue label of the issue to new data source MNPI.

workflow::3 - scheduling

Based on the business case, effort to implement, workload in the Data Team and Data Team priorities, the implementation will be scheduled. For scheduling we follow the GitLab Data Team Planning Drumbeat. This means that for every quarter, the Data Team determines when a new data source request will be picked up. When a new data source request remains in scheduling, it does not mean that it isn’t on the radar of the Data Team. It means that it isn’t been scheduled yet, because:

  1. Defining the OKRs for the upcoming quarter didn’t take place. Once a quarter the Data Team sets the OKRs for the next quarter.
  2. It didn’t fit in the OKRs for next quarter because of business case, effort to implement, workload in the Data Team and Data Team priorities.

workflow::4 - scheduled

When a request for implementing a new data source is in scope for next quarter OKRs, the request is scheduled. An epic will be created with the issue attached to it. Based on the work breakdown all issues are created with the corresponding issue weight, label assigned workflow::4 - scheduled, the right milestone (if known) and attached to the epic.

workflow::5 - development

When the execution of the work starts, the issue is in development and follows the regular development life cycle.

During the development, the Data Engineer aligns with all the stakeholders, about access to the data. Data access can be provided to the raw schema, depending on the data and the use case.

Creating labels

When it is a complete new data pipeline that extracts data from a upstream system that is not yet extracted, create a new label on in the GitLab Data group. I.e: https://gitlab.com/groups/gitlab-data/-/labels/23453371/edit

workflow::6 - review

When the execution of the work is finished, the issue is in review and follows the regular development life cycle.

workflow::X - blocked

When the execution could not continue due to the need of external intervention, the issue is blocked. There must be a clear problem statement given and the right people need to be assigned on the shortest notice possible.

Red Data

Red data (according to the GitLab Data Classification Policy is not allowed to be stored in our Data Platform (Snowflake). Therefore we will not bring in/connect new data sources that are listed in the tech stack with data_classification: Red unless there is a mission critical business reason. There is an exception process available which will enable us to evaluate the needs on a case-by-case basis and this process will require approval from BT/Data VP-level, Security and Privacy. Evaluating the business reason and obtaining approvals are part of the triage process and are governed via the new data source template.

Note: The exception process must be fulfilled to either connect a system with Red data and/or to extract Red data (fields) from that system. However, the business case to extract Red data (fields) under the exception process will necessitate a higher standard of review than a business case that only requires connecting a Red data system without extraction of Red data (fields). Where extraction of Red data (fields) is approved under the exception process, masking will be applied in the Data Platform (Snowflake) as described in the proceeding section.

Exception process - connecting a not listed source or a Red data source

When extracting new data towards Snowflake and the data source is not listed or listed data_classification: Red;

  • The team member must state a business case to connect the system to the Data Platform.
  • The team member that requests the connection will perform an analysis of the data and confirm the data that is extracted is not Red data.
    • For instances where Red data (fields) must be ingested into Snowflake, masking will be applied upon extraction.
  • BT/Data VP-level, Legal and Security must sign off to start the implementation.

Personal data

Extracting Personal Data into the Data Platofrm (Snowflake) is allowed, but it will require additional review from our Legal Privacy team and, where applicable, the People Team. When requesting to add a new data source, the team member that requests the upstream system/data source is responsible to indicate if the data source contains personal data and which data elements it concerns. If a team member doesn’t have enough knowledge to determine whether certain infomration is Personal Data, they should tag a subject matter expert and, if needed, a data engineer.

Please note that “pseudonymized” data is still Personal Data under privacy legislation.

If the upstream system (data source) contains Personal Data, the Data Team will either:

  • exclude the indicated Personal Data elements from the extract if the extract can fulfill the requester’s purpose without retaining the Personal Data; or
  • obtain Legal Privacy review and approval to action the extraction to and processing of Personal Data in our Data Platform.

Sensitive Personal Data

Certain Personal Data elements are deemed “sensitive” under various privacy legislation. Sensitive Personal Data generally includes racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data, data related to health, data related to sex life or sexual orientation, criminal offenses, and citizenship/immigraion status.

Privacy legislation prohibits the processing of these types of data elements, except in limited circumstances. If a request is made to extract data that contains sensitive Personal Data, it will generally be denied unless true consent has been obtained to the processing, and Legal Privacy and, where applicable, the People Team, have approved this processing.

Monte Carlo observability

When a new data source is extracted towards the raw layer there will be a new separate schema created for that data source in many cases. In order to make sure the new data source is observed by Monte Carlo follow the steps outlined in Monte Carlo permission handbook section.

Documentation

The new data source will be documented in the handbook, following this template.