Monte Carlo Guide

Monte Carlo (MC) is our Data Observability tool and helps us deliver better results more efficiently.

What and why

Monte Carlo (MC) is our Data Observability tool and helps us deliver better results more efficiently.

The Data Team default for observing the status of the data is using Monte Carlo. Creating any tests (called monitors in MonteCarlo) are done via the UI of Monte Carlo and reported according to the notification strategy. On another iteration in the near future we plan to implement Monitors as Code and these tests will also be version controlled. Currently dbt still used for existing tests, there is no roadmap in place to migrate these to Monte Carlo.

How We Operate Monte Carlo

We use the #data-pipelines Slack channel for MC platform related alerts. We are planning on using the #data-analytics Slack channel in the near future for model related alerts, as soon as we have implemented the full notification strategy for Monte Carlo. This work is planned under this epic for F23Q3: Onboard Analytics Engineers to the Monte Carlo Tool

Monte Carlo is an integral part of our Daily Data Triage and will replace the TD Trusted Data Dashboards.

graph TD

mc(Monte Carlo)
sf(Snowflake Data Warehouse)
de(GitLab Team Member)

mc --> |alerts| de
de --> |improves| sf
sf --> |observes|mc

The whole body of work covering the Monte Carlo rollout at GitLab falls under epic Rollout Data Observability Tool with 100% coverage of Tier 1 Tables to improve Trusted Data, Data Quality, and Data Team member efficiency, where the work breakdown has been done and issues have been created to reflect the necessary steps until we are up and running with Monte Carlo on production.

Logging In

Login to Monte Carlo is done via Okta. Go to https://getmontecarlo.com/signin. The following screen appears upon login and after providing your email and clicking “Sign in with SSO”, you should be redirected to your Okta login. Please note, you need to login via SSO and not via username/password.

image

A runbook of how everything is technically set up can be found in the Monte Carlo Runbook.

The gist of it is that there is an Okta Group called okta-montecarlo-users that is maintained by the Data team and has the Monte Carlo app assigned to it. In order to be able to access Monte Carlo via Okta by default, your user should be part of the okta-montecarlo-users group. For that you should submit an AR (similar ARs: Example AR 1, Example AR 2) and assign it to Rigerta Demiri (@rigerta) or ping the #data channel linking the AR.

Once logged in, you should be able to see the Monte Carlo Monitors dashboard with details on the objects being monitored and several custom monitors that have already been set up.

image

You can create a new monitor or view existing monitor details, such as definition and schedule and any anomalies related to it. Alternatively, you can also list all the incidents by clicking on the Incidents menu item on the top menu bar, you can search for a specific model by querying the Catalog view or check Pipelines for a detailed lineage information on how the data flows from the source to the production model.

Depending on the role assigned to your user (by default every user logging in via SSO is assigned a Viewer role), you might be able to see Settings and check existing users and integrations (such as Slack integration, Snowflake integration, dbt integration etc.)

If you need your role to be updated, you can reach out to anyone on the data platform team and they will be able to modify your existing role.

More information on navigating the UI can be found in the official Monte Carlo documentation.

Adding a New Monitor

Monte Carlo will be running volume, freshness and schema change monitors by default on all the objects it has access to. However, these checks are based on update patterns the tool learns from the data and if you need a specific custom check that runs on a certain schedule, you might want to add a custom monitor for that.

The official Monte Carlo documentation on monitors can be found in the Monitors Overview guide.

We have one Monte Carlo Snowflake Integration in place, which has two separate connections to Snowflake. The first connection is called snowflake and it operates on DATA_OBS_WH_1, an XS Snowflake Warehouse. The second connection called snowflake large and it operates on DATA_OBS_WH_L, a L Snowflake Warehouse.

Please make sure to mindfully choose the connection that makes most sense for your new custom monitor when adding a new one. Only choose to run the monitor on the large warehouse if this is really necessary for your custom SQL query to run in a reasonable amount of time and to prevent it from timing out.

Fine-Tuning an Existing Monitor

If you want to modify an existing monitor, depending on the type of monitor, you can modify different parts of it such as the schedule, the timestamp field to be taken into account & the alert condition.

Responding To A Slack Alert

Currently, when we are getting notifications on different Slack channels, we can already triage the issue via Slack by assigning a status to it choosing from: Fixed, Expected, Investigating, No action needed and False positive (No status is a default status by MonteCarlo). Once we start investigating and we have a finding, if we write a comment on Slack in the same notification thread, that comment will automatically be added to the incident on Monte Carlo.

Our goal is to be able to integrate Monte Carlo with GitLab so that whenever we get an alert on Slack, a triage issue would automatically be opened on GitLab and we’d follow the same Data Triage procedure as usual.

There is detailed information including a video section in the official Monte Carlo documentation on how to respond to an alert.

Incident status

Each MonteCarlo incident has always a status. See the folowwing list when to use which status:

MonteCarlo status Context Actions done or to -do
Fixed Incident is not active anymore. Actively worked on resolving the incident or the incident is normalized automatically.
Expected Incident was flagged by MonteCarlo correctly. We knew that this was coming, like a batch update or a schema change that was in the works. None
Investigating Actively working on the incident Root cause investigation and resolve if needed
No action needed Incident was flagged by MonteCarlo correctly, but its not a breaking change None
False positive Incident was flagged by MonteCarlo wrongly None
No Status Default status by MonteCarlo Start investigating and update status

Note on DWH Permissions

In order for Monte Carlo to be integrated with Snowflake, we have had to run the permissions script as specified in the official docs for each database we needed to monitor. The same script has to be run as many times as we have databases to monitor (in our case raw, prep and prod) with the correct values for the $database_to_monitor variable. The scrips foresees new tables to be added to existing schemas. In case of a new schema the script has to be executed again for the database the schema resides. The data observability user is stored on our internal data vault.

Please note this is an exception to our usual permission-handling procedure, where we rely on Permifrost, because observability permissions are an edge-case for Permifrost and not yet supported by the tool. There is an ongoing feature request on Permifrost for adding granularity to the way permissions are set, but no solution has been agreed on yet.

Monitoring strategy

By default, we monitor all tables in the RAW, PREP, and PROD databases in Monte-Carlo, unless there is a specific reason not to, or if we reach the limits specified in our contract. Excluded tables or schemas from monitoring are documented below.

Exclude sandbox schemas

Sandbox environments are generally created for the purpose of testing. We normally don’t take any actions on them even if any alerts come through in our triage slack channels. For this reason we exclude monitoring schemas that contain sandbox to avoid getting any alerts from them. This has been set via an exclude rule in Monte Carlo.

Notification strategy

All incidents are reported in MonteCarlo incident portal. For triage purposes the most important (which requires action) are routed towards Slack. The following matrix shows per data area which type of monitors are routed and towards which channel:

Database DataScope Volume Freshness Schema changes Custom monitors
RAW TIER1 #data-pipelines #data-pipelines #analytics-pipelines (once per day) #data-pipelines
TIER2 #data-pipelines #data-pipelines #analytics-pipelines (once per day) #data-pipelines
TIER3 #data-pipelines #data-pipelines #analytics-pipelines (once per day) #data-pipelines
PREP n/a - - - -
PROD COMMON * #analytics-pipelines #analytics-pipelines - #analytics-pipelines
WORKSPACE ** - - - -
WORKSPACE-DATA-SCIENCE #data-science-pipelines #data-science-pipelines - #data-science-pipelines
LEGACY *** - - - -

* COMMON is also the COMMON_RESTRICTED equivalent. It excludes COMMON_PREP and COMMON_MAPPING ** WORKSPACE-DATA-SCIENCE is the only workspace schema we are including in the notification strategy *** Only these two models (snowplow_structured_events_400 and snowplow_structured_events_all) of the LEGACY schema have been included temporarily as per MR !7049

This notification strategy is the basis for any alert being sent from Monte Carlo towards Slack. However, as of Notifications 2.0 Monte Carlo has introduced Audiences. This means, the above notification strategy has now been migrated towards Audiences and we have the following Audiences in place, sending alerts to Slack channels as specified below:

Audience Slack Channel
Analytics Engineers #analytics-pipelines
Analytics Instrumentation #g_analyze_analytics_instrumentation
Data Engineers #data-pipelines
Data Science #data-science-pipelines
Sales Analytics #sales-analytics-pipelines

Domains

We have the availability to use domains in our Monte Carlo environment. Currently domains can be used to create separate environments for separate team members, domains automatically filter monitors and incidents by projects and datasets. We have a limited number of domains available.

Domain Description Data Scope
Data Platform Team Domain for the Data Platform Team - scope raw data layer in Snowflake Snowflake raw layer

Use domains

In Monte Carlo UI in the top right corner there is a dropdown box available which you can select a particular domain or all domains.

image

BI Integrations

When we initially deployed Monte Carlo at GitLab, we defaulted to Sisense as a BI tool, as it was what we were using at the time. However, recently we have started migrating to Tableau and therefore we have added the Tableau integration to our Monte Carlo instance.

It is now possible to check table and field lineage from our raw models to Tableau objects, such as Tableau Views, Tableau Live Data Sources or Tableau Extract Data Sources.

The Sisense integration and Tableau integration coexist on Monte Carlo and all related Sisense charts as well as Tableau objects can be seen on the lineage charts.

Internal Monte Carlo handbook page

Additional internal information is available in our Internal GitLab Handbook.

Last modified December 13, 2024: Remove trailing spaces (a4c83fb3)