The two main challenges in building data pipelines

ELLEN KÖNIG @ELLEN_KOENIG THE TWO MAIN CHALLENGES IN BUILDING DATA
PIPELINES

-Identifying what type of data pipeline you need -Making sure
your data pipeline does not distort the data 🥁

1. Decide on the scope of the project 2. Decide
on the architecture pattern 3. Identify the infrastructure options and constraints IDENTIFYING THE TYPE OF DATA PIPELINE YOU NEED Source: https://www.reddit.com/r/dataengineering/comments/suvukx/rdataengineering_buzzwords_details_in_comments/

- Pipeline for collecting sales data in a Nigerian supermarket
chain EXAMPLE Source: https://commons.wikimedia.org/wiki/File:Cashier_stand_in_a_Nigerian_Grocery_store.jpg

1. What is the goal of creating the pipeline? -
Aggregate sales data from all the stores in the supermarket chain 2. Who are the end users of your data? 1. National sales managers 2. Regional data warehouse managers 3. Which use cases do you want to address? 1. Daily revenue summary 2. Re-order soon-to-be out of stock products before they run out 4. What should be the functional focus of the pipeline? 1. First use case: Daily revenue summary for the national sales managers 2. Sales data from Nigeria 3. Start with the data from October 1, 2022 1. DECIDE ON THE SCOPE OF THE PROJECT

- Freshness: How often does the data need to be
updated? (Real- time vs. hourly or less) - Streaming vs. Batch pipeline - We need: Daily (Batch) - Volume: Does the data fi t into the memory of one machine? - Single machine vs. distributed machines architecture - 30 supermarkets with ~500 transactions per day, 50 bytes per transaction => less than 1GB per day => fi ts into one machine - Source & Destination Connectors: - Type of data access? (API, storage, stream, …) - Data format? (JSON, CSV, Parquet, binary, …) 2. DECIDE ON AN ARCHITECTURE PATTERN

- Identify infrastructure constraints (cloud vs. on-premise, certain vendors) and
what data infrastructure is available there - Cloud: GCP - Identify security needs (private networks, data encryption at rest and in transit, access controls for the data, …) - VPN from the point of sale to GCP needed - Data encryption at rest and in transit needed - Only the C-level, and the fi nance and analytics departments may access the data IDENTIFY THE INFRASTRUCTURE OPTIONS AND CONSTRAINTS

1. Test- fi rst pipeline development 2. Only write simple
transformations MAKING SURE YOUR DATA PIPELINE DOES NOT DISTORT THE DATA

1. Set up an end-to-end test without any transformation -
End-to-end test: Compares source test data to destination test data - Simplest test case: Empty transaction fi le—> Aggregate to zero revenue per day 2. Set up your infrastructure to make the integration test case work 3. De fi ne your unit tests fi rst before you code each of your transformations - Unit test: Compares fi xed, fake input data to fi xed output data - Start with the „Happy Path“, then edge cases - Happy path: One transaction per day of 1 Naira -> Metric result: 1 NGN revenue/day - Edge cases: No sales, invalid products, sales returned the next day, negative sales price, discounts,… 1. TEST-FIRST PIPELINE DEVELOPMENT ?

The simpler the transformation, the more likely it is correct
and easy to debug One transformation should only do one thing! (No „AND“ in the description) 1. Recommendation: Design transformations separately for each use case 2. Recommendation: Design small modular transformations and chain them 2. ONLY WRITE SIMPLE TRANSFORMATIONS

Identifying what type of data pipeline you need Making sure
your data pipeline does not distort the data THE TWO MAIN CHALLENGES IN BUILDING DATA PIPELINES

1. Decide on the scope of the project 2. Decide
on the architecture pattern 3. Identify the infrastructure options and constraints IDENTIFYING THE TYPE OF DATA PIPELINE YOU NEED 1. Test- fi rst pipeline development 2. Only write simple transformations MAKING SURE YOUR DATA PIPELINE DOES NOT DISTORT THE DATA

ENJOY THE HACKATHON! 👩🔧🧑💻👩🔬 ELLEN KÖNIG @ELLEN_KOENIG

The two main challenges in building data pipelines

The two main challenges in building data pipelines

ellenkoenig

More Decks by ellenkoenig

Other Decks in Technology

Featured

Transcript

ELLEN KÖNIG @ELLEN_KOENIG THE TWO MAIN CHALLENGES IN BUILDING DATA

-Identifying what type of data pipeline you need -Making sure

1. Decide on the scope of the project 2. Decide

- Pipeline for collecting sales data in a Nigerian supermarket

1. What is the goal of creating the pipeline? -

- Freshness: How often does the data need to be

- Identify infrastructure constraints (cloud vs. on-premise, certain vendors) and

1. Test- fi rst pipeline development 2. Only write simple

1. Set up an end-to-end test without any transformation -

The simpler the transformation, the more likely it is correct

Identifying what type of data pipeline you need Making sure

1. Decide on the scope of the project 2. Decide

ENJOY THE HACKATHON! 👩🔧🧑💻👩🔬 ELLEN KÖNIG @ELLEN_KOENIG