Cells: Data Pipeline Ingestion
The Cells architecture will have a significant impact on the current data pipeline which exports data from Postgres to Snowflake for the use of data analytics. This data pipeline fulfils many use cases (i.e. SAAS Service ping, Gainsight metrics and Reporting and Analytics of the SAAS Platform).
1. Definition
2. Data flow
The current data pipeline is limited by not having the possibility to get data via a CDC mechanism (which leads to data quality issues) and works by polling the Postgres database and looking for new and updated records or fully extracting data for certain tables which causes a lot of overhead.
At the moment the data pipeline runs against two instances that get created from a snapshot of both the main
and ci
databases.
This is done to avoid workload on the production databases.
In the Cells architecture there will be more Postgres instances because of which the current pipeline couldn’t scale to pull data from all the Postgres instances. Requirements around the data pipeline moving forward are as follows:
- We need a process that allows capturing all the CDC (insert, update and delete) from all Cells, scaling automatically with N number of Cells.
- We need to have (direct or indirect) access to database instances which allows it to do data catch up in case of major failure or root cause analysis for data anomalies.
- We need monitoring in place to alert any incident that can delay the data ingestion.
3. Proposal
4. Evaluation
4.1. Pros
4.2. Cons
e47101dc
)