Slide 1

Slide 1 text

ELLEN KÖNIG @ELLEN_KOENIG THE TWO MAIN CHALLENGES IN BUILDING DATA PIPELINES

Slide 2

Slide 2 text

-Identifying what type of data pipeline you need -Making sure your data pipeline does not distort the data 🥁

Slide 3

Slide 3 text

1. Decide on the scope of the project 2. Decide on the architecture pattern 3. Identify the infrastructure options and constraints IDENTIFYING THE TYPE OF DATA PIPELINE YOU NEED Source: https://www.reddit.com/r/dataengineering/comments/suvukx/rdataengineering_buzzwords_details_in_comments/

Slide 4

Slide 4 text

- Pipeline for collecting sales data in a Nigerian supermarket chain EXAMPLE Source: https://commons.wikimedia.org/wiki/File:Cashier_stand_in_a_Nigerian_Grocery_store.jpg

Slide 5

Slide 5 text

1. What is the goal of creating the pipeline? - Aggregate sales data from all the stores in the supermarket chain 2. Who are the end users of your data? 1. National sales managers 2. Regional data warehouse managers 3. Which use cases do you want to address? 1. Daily revenue summary 2. Re-order soon-to-be out of stock products before they run out 4. What should be the functional focus of the pipeline? 1. First use case: Daily revenue summary for the national sales managers 2. Sales data from Nigeria 3. Start with the data from October 1, 2022 1. DECIDE ON THE SCOPE OF THE PROJECT

Slide 6

Slide 6 text

- Freshness: How often does the data need to be updated? (Real- time vs. hourly or less) - Streaming vs. Batch pipeline - We need: Daily (Batch) - Volume: Does the data fi t into the memory of one machine? - Single machine vs. distributed machines architecture - 30 supermarkets with ~500 transactions per day, 50 bytes per transaction => less than 1GB per day => fi ts into one machine - Source & Destination Connectors: - Type of data access? (API, storage, stream, …) - Data format? (JSON, CSV, Parquet, binary, …) 2. DECIDE ON AN ARCHITECTURE PATTERN

Slide 7

Slide 7 text

- Identify infrastructure constraints (cloud vs. on-premise, certain vendors) and what data infrastructure is available there - Cloud: GCP - Identify security needs (private networks, data encryption at rest and in transit, access controls for the data, …) - VPN from the point of sale to GCP needed - Data encryption at rest and in transit needed - Only the C-level, and the fi nance and analytics departments may access the data IDENTIFY THE INFRASTRUCTURE OPTIONS AND CONSTRAINTS

Slide 8

Slide 8 text

1. Test- fi rst pipeline development 2. Only write simple transformations MAKING SURE YOUR DATA PIPELINE DOES NOT DISTORT THE DATA

Slide 9

Slide 9 text

1. Set up an end-to-end test without any transformation - End-to-end test: Compares source test data to destination test data - Simplest test case: Empty transaction fi le—> Aggregate to zero revenue per day 2. Set up your infrastructure to make the integration test case work 3. De fi ne your unit tests fi rst before you code each of your transformations - Unit test: Compares fi xed, fake input data to fi xed output data - Start with the „Happy Path“, then edge cases - Happy path: One transaction per day of 1 Naira -> Metric result: 1 NGN revenue/day - Edge cases: No sales, invalid products, sales returned the next day, negative sales price, discounts,… 1. TEST-FIRST PIPELINE DEVELOPMENT ?

Slide 10

Slide 10 text

The simpler the transformation, the more likely it is correct and easy to debug One transformation should only do one thing! (No „AND“ in the description) 1. Recommendation: Design transformations separately for each use case 2. Recommendation: Design small modular transformations and chain them 2. ONLY WRITE SIMPLE TRANSFORMATIONS

Slide 11

Slide 11 text

Identifying what type of data pipeline you need Making sure your data pipeline does not distort the data THE TWO MAIN CHALLENGES IN BUILDING DATA PIPELINES

Slide 12

Slide 12 text

1. Decide on the scope of the project 2. Decide on the architecture pattern 3. Identify the infrastructure options and constraints IDENTIFYING THE TYPE OF DATA PIPELINE YOU NEED 1. Test- fi rst pipeline development 2. Only write simple transformations MAKING SURE YOUR DATA PIPELINE DOES NOT DISTORT THE DATA

Slide 13

Slide 13 text

ENJOY THE HACKATHON! 👩🔧🧑💻👩🔬 ELLEN KÖNIG @ELLEN_KOENIG