Data For Product Managers
This page is intended to help Product Managers at GitLab understand what data is available to them and how they can use it to understand how their product is used. This page primarily covers two topics: how to consume data, and what data is available.
How to Consume Data at GitLab
The user-facing end of GitLab’s data stack consists of our BI Tool, Tableau, which is connected to our Snowflake data warehouse. The Tableau handbook page of the Data team handbook has general information about Tableau aimed for a wider GitLab audience.
Useful links for Product Managers
Here are some useful links that we recommend for you to bookmark:
- Data Catalog: Helpful guides on how to use our data
- DBT documentation: Data model documentation
- Service Ping Metrics Dictionary: Service Ping metric definitions and metadata
- Service Ping documentation
- Snowplow Event Dictionary: Snowplow event definitions and metadata
- Product Data Insights handbook: Information about the Product Data Insights team
- Tableau documentation: Information and guides on how to use Tableau
- Analytics Instrumentation Quick Links
Getting Tableau Access
In order to gain access to Tableau, you will need to follow the instructions here and open an access request.
- To create your own charts and dashboards, you need to have a Creator or Explorer license. You can read more about the Tableau license types here.
Published Data Sources
Published data Tableau sources are great ways to allow Tableau users to build charts without writing any SQL or modeling. The Data team has created several Published Data Sources in Tableau that have the official “Certified” badge.
For example, Mart Ping Instance
can be used to look at Service Ping ping-level details.
How do I know what tables are available?
The Data team uses a tool called dbt for our data transformation layer. A nice feature of dbt is dbt docs, which automatically creates documentation for all of the models in our schema. Our dbt docs instance can be found here.
- Tableau will show a list of all tables available for querying when you form a connection to Snowflake in the Data Source pane.
- Each data source has a distinct naming convention for tables (see more about Key Data Sources below).
- Service Ping models contain
ping_instance
in the name (ex:dim_ping_instance
,mart_ping_instance_metric_monthly
) - GitLab.com Postgres db event models start with
fct_event
ormart_event
(ex:fct_event_user_daily
,mart_event_namespace_monthly
) - GitLab.com Postgres db table replicas are named after the source table/content (ex:
dim_merge_request
,dim_namespace
) - Snowplow models contain
behavior
in the name (ex:mart_behavior_structured_event
,fct_behavior_website_page_view
)
- Service Ping models contain
How can I update or add more information to the dbt docs?
You will need to locate the file you wish to update or create in the gitlab-data analytics project. Please be sure to read and follow the SQL style guide when creating the changes. If you wish to update only the descriptions or information about tables you will be looking for a schema.yml
file. If you wish to actually change the structure of tables it will be a *.sql
file.
Next, create a branch and then submit an MR using the dbt Model Changes
template to the gitlab-data analytics project. When creating your branch and MR please follow the Data team workflow and use the appropriate Data team labels.
How can I get help?
If you ever get stuck or have a question, please ask for help in the #data slack channel. It is recommended to also cross-post questions in your #g_, #s_, or #product channels because many PMs have data related expertise and can provide you quick assistance for common product data questions.
- Remember, it’s helpful for us to know the context behind your question. Don’t just say what you want to know but also why so others can point you to a more efficient way to get your answer.
- This document is meant to serve as a guide of best practices. Please add what you learn when you need help with this content.
If needed, you may create an issue in the Product Data Insights project and assign it to a product data analyst. You can read more about working with the PDI team here.
Key Data Sources for Product Managers at GitLab
We have three primary data sources for product usage data:
- Service Ping (for Self-Managed, Dedicated, and GitLab.com)
- GitLab.com Postgres Database (for GitLab.com)
- Snowplow (for GitLab.com and the AI Gateway)
Each data source comes with its own caveats, capabilities, and limitations. The first question we on the Data or PDI teams ask product managers is usually “are you interested in knowing this for Self-Managed or GitLab.com?” Our approach to answering your question and the data source(s) available differ greatly between the two. Although our Self-Managed offering has many more active customers, our GitLab.com offering has much more granular data available to analyze.
Service Ping (Version App)
Service Ping is a custom tool that GitLab built to collect weekly aggregated information from our customers across various deployment options:
- Self-Managed: Customers who host our product on their own hardware.
- GitLab Dedicated: Our fully-managed, single-tenant SaaS offering where each customer gets their own isolated instance of GitLab, hosted and managed by GitLab’s team.
- GitLab.com: Our multi-tenant SaaS offering.
Key Concepts
- Service Ping collects and reports data at an installation level. For Self-Managed and GitLab Dedicated customers, this means one ping per installation per week. For GitLab.com, which is a single, large-scale installation, Service Ping reporting still occurs as a single ping within our Service Ping framework, representing the entire GitLab.com ecosystem. (In other words, despite the deployment type or installation size, we still receive data at the installation level).
- Service Ping provides pre-aggregated counts of specific events/actions (aka metrics). We are unable to do any analysis at a more granular level (ex: user-level, project-level, etc) since the metrics are already aggregated.
- Sending Service Ping is optional but defaults to being on.
- You can see the percent of paid subscriptions that successfully send a ping every month on this chart. We do not know about the opt-in rate of unlicensed (Core/Free) users but assume the same rate. Here is more information on why some installations block data from being sent.
- Since we do not receive pings from every customer, we apply an estimation in monthly reporting to fill in the gaps. You can read more about the estimation methodology in the internal handbook here.
- Customers must adopt a version of GitLab with a metric instrumented in order to report the metric. For example, if a metric is added in 17.3, only customers on versions >= 17.3 will report the metric. This means that it can take months to have a sufficient number of customers reporting the metric.
- You can track GitLab version adoption on this dashboard.
- Pings are added to Snowflake daily. By the 2nd of the month all the data should be available from the previous month.
Key Columns
ping_deployment_type
is the best column to use in order to differentiate Self-Managed, Dedicated, and GitLab.com usage.- If using
ping_delivery_type
, know that both Dedicated and GitLab.com are included inSaaS
.
- If using
- Use the
metrics_path
column to filter to the metric of interest. You can find more metric-level details and metadata in the Service Ping Metrics Dictionary. - For monthly reporting, we limit to the last ping we receive each month per installation. You can filter your queries to
is_last_ping_of_month = TRUE
- We often talk about “installations” or refer to the
dim_installation_id
column. An “installation” is the unique combination ofdim_instance_id
/uuid
anddim_host_id
.- We use “installations” since a single “instance” (
dim_instance_id
/uuid
) can have multiple hosts.
- We use “installations” since a single “instance” (
Examples
Here is an example of a query that provides ping-level details, filters out GitLab.com, and limits to the last ping of the month:
Self-Managed and Dedicated Ping-Level Details
SELECT *
FROM common_mart.mart_ping_instance
WHERE ping_created_at >= CURRENT_DATE-30
AND ping_deployment_type != 'GitLab.com'
AND is_last_ping_of_month = TRUE
LIMIT 1000
;
Here is an example of a query that provides metric-level reporting by month and deployment type:
Service Ping Metric by Month and Deployment Type
SELECT
ping_created_date_month,
ping_deployment_type,
metrics_path,
SUM(monthly_metric_value) AS monthly_metric_value,
COUNT(DISTINCT IFF(monthly_metric_value > 0, dim_installation_id, NULL)) AS installation_count --count of installations reporting usage that month
FROM common_mart.mart_ping_instance_metric_monthly --this model already filters to the last ping of the month
WHERE ping_created_date_month >= '2024-06-01'
AND metrics_path = 'usage_activity_by_stage_monthly.secure.sast_scans' --arbitrary metric, switch this out for metric of interest
GROUP BY 1,2,3
ORDER BY 1,2
;
GitLab.com Postgres Database
Because GitLab.com is a GitLab instance hosted by GitLab, we have access to the instance’s postgres database and can load parts of it into our Snowflake data warehouse. This means we can get a very detailed look into how our product is used on GitLab.com.
- Any part of the product that creates a table on the backend (see the schema file) can be added to the ELT job which will sync 3x a day to the warehouse. From there, all we need to do is build a dbt base model to make them accessible in Tableau.
What if the table or column I want isn’t in the data warehouse?
Our ELT process works by explicitly stating which columns and tables we want to import into the data warehouse. This means we might be missing a column or whole table that you want to have in the data warehouse for analysis. When this is the case, please create a Data issue letting us know what you want us to import using the New Data Source template. Before doing so, please confirm that the table/column is truly part of the production schema.
Replicating Service Ping using GitLab.com Data
- Because Service Ping only aggregates data at an
installation
level, it is not super useful for GitLab.com since we often want to see information at thenamespace
level. For example, knowing that 40K people used your stage on GitLab.com is somewhat useful, but you’ll probably want to know more context (Are they free or paid? What plan are they on? Do I have any power users or is usage equally distributed?) - But since we have access to the GitLab.com Postgres database, we are capable of replicating many Service Ping metrics at the namespace level or even the user level.
- This model exemplifies how some Service Ping metrics could be replicated for GitLab.com at the namespace level. This model is available as a Published Data Source in Tableau as
Mart Event Valid
.
Examples
Here is an example of a query that will generate GitLab.com UMAU by day:
GitLab.com UMAU by Day
SELECT
event_date,
COUNT(DISTINCT dim_user_id) AS umau
FROM common_mart.mart_event_user_daily
WHERE event_date >= CURRENT_DATE-30
AND is_umau = TRUE
GROUP BY 1
ORDER BY 1
;
Here is an example of a query that will generate paid GitLab.com GMAU by month:
Paid GitLab.com GMAU by Month
SELECT
event_calendar_month,
group_name,
user_count
FROM common_mart_product.rpt_event_xmau_metric_monthly
WHERE event_calendar_month >= '2024-06-01'
AND is_gmau = TRUE
AND user_group = 'paid'
ORDER BY 2,1
;
Snowplow
Snowplow Analytics is an open-source enterprise event-level analytics platform that enables data collection from multiple platforms for advanced data analysis.
- We pseudonymize
user_id
on all Snowplow events, meaning that we are unable to connect an event to a specific user (or the GitLab.com Postgres db).- We also pseudonymize page URLs to remove any potential PII or RED data.
- Self-Managed and Dedicated installations do not send Snowplow data to GitLab.
- The one exception is the AI Gateway where we receive events from all deployment types (Self-Managed, Dedicated, and GitLab.com).
Key Concepts
- Because Snowplow does not rely on Service Ping, we do not need to wait for a version of GitLab to be adopted to start receiving data. We can collect and visualize data as soon as the instrumentation is deployed.
- Even though the pseudonymization of
user_id
of Snowplow events is a limitation, with the fast feedback, Snowplow is an effective source of data to measure feature adoption and usage.- Note: We are still able to count the number of users who engage with a feature, which is sufficient for most use cases. We just do not know who those users are.
- Snowplow events can be blocked by the user.
Instrumentation
Analytics Instrumentation has built Internal Event tracking, which will guide you on how to instrument Snowplow events. To get started, use the Quick Start Guide to Internal Event Tracking.
Once your Snowplow events have been instrumented, as part of the validation process, the newly instrumented event should be tested to ensure it is working properly. While you as the PM probably won’t be doing the validation yourself every time, it is nice to know how it works. You can learn more about testing Snowplow events in the Internal Event documentation here.
Visualize your events in Tableau
The data you have instrumented is most useful if it can be visualized in a chart. Refer to the Tableau section of the handbook for information on creating charts.
- First, check if they are correctly stored in Snowflake in the Snowplow Event Exploration Dashboard (note: the data volume is quite large, so please be patient with dashboard load time). You can use the filters to find your events. If you are not sure of the value of the different attributes, they should have been captured in the event definition. If not, check with your engineering manager.
- Once you have verified that your events are properly stored, you are ready to query and visualize the data! Please be aware that we are collecting several millions of events (page views, structured events) per day, so the whole dataset is quite slow to query. In order to make it easier to explore this data source, we have created several smaller tables:
common_mart.mart_behavior_structured_event
: contains ALL structured events (these are what is instrumented with Internal Event tracking)common.fct_behavior_website_page_view
: contains ALL page viewscommon.fct_behavior_unstructured_event
: contains ALL unstructured events (including click events, form submissions, etc)
Examples
The Snowplow models are quite large and can be slow to query. To make your query faster, use a date (behavior_at
) filter in your WHERE
statement.
Here is an example of a query that will look at the top 100 Snowplow events from the last several days:
Top 100 Snowplow Events
SELECT
event_action,
COUNT(*) AS event_count
FROM common_mart.mart_behavior_structured_event
WHERE behavior_at >= CURRENT_DATE-3
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
;
Here is an example of a query that will look at top 100 pages viewed in the last several days (note how the URLs are pseudonymized):
Top 100 Pages Viewed
SELECT
page_url,
COUNT(*) AS page_view_count
FROM common.fct_behavior_website_page_view
WHERE behavior_at >= CURRENT_DATE-3
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
;
Other data sources**
- Zuora
- Zuora is our SSOT for subscription information and key metrics like ARR and Seat Count.
- Customers Dot (CDot) database
- CDot is our SSOT for Fulfillment-related data, including trials.
- Sheetload
- You can load your own Google Sheets into the data warehouse. Read more here.
Analytics Instrumentation
Analytics Instrumentation is part of the product org and is completely separate from the Data and Product Data Insights teams. However, these teams collaborate closely as the Customer Product Adoption pod.
- The Analytics Instrumentation team members are the DRIs for data collection across both GitLab.com and Self-Managed. They own Service Ping and Snowplow. They are the ones to go to for questions like:
- How do I instrument a new metric for Self-Managed?
- What are the best practices for adding to Service Ping?
- How can I use Snowplow/Internal Events to track a frontend interaction on GitLab.com?
- Can I utilize Snowplow/Internal Events to track events on the server-side?
696f2d23
)