Slide 1

Slide 1 text

Best Practices for Building Maintainable Data Pipeline Aditya Satrya Data Engineering Tech Lead at Mekari linkedin.com/in/asatrya 15 November 2020

Slide 2

Slide 2 text

Introduction 2 ● Aditya Satrya ● Data Engineering Tech Lead at Mekari ● Lives in Bandung ● linkedin.com/in/asatrya/

Slide 3

Slide 3 text

What is data pipeline? 3 Series of steps or actions to move and combine data from various sources for analysis or visualization

Slide 4

Slide 4 text

What is Airflow? 4 Airflow is a platform to programmatically author, schedule and monitor workflows (https://airflow.apache.org/)

Slide 5

Slide 5 text

What does maintainable means? 5 ● Easy to understand ● Easy to change ● Easy to recover from failure ● Easy to debug

Slide 6

Slide 6 text

6 #1 Ingest data as-raw-as-possible #2 Create idempotent and deterministic processes #3 Rest data between tasks #4 Validate data after each step #5 Create workflow as a code An opinionated list of best practices to begin with

Slide 7

Slide 7 text

#1 Ingest data as-raw-as-possible 7 Why? ● Prevent losing data because of process failure ● Enable to reprocess data when there’s changes in business rules How? ● Use format that have minimum metadata definition (i.e: CSV, JSON) ● Don’t use format like Parquet for ingestion

Slide 8

Slide 8 text

#2 Create idempotent & deterministic processes 8 What is means? ● Idempotent: Run multiple times without changing the result ● Deterministic: The results of a transformations only depends on determined input parameters ○ i.e: the result should NOT depend on the timing of it runs

Slide 9

Slide 9 text

#2 Create idempotent & deterministic processes 9 Why? ● Reproducible ○ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ○ When schema or business logic changes, just rerun the process ● Easier to debug ● Enable to create dev/staging environment that mirrors production

Slide 10

Slide 10 text

#2 Create idempotent & deterministic processes 10 How? ● Don’t append, overwrite the partition instead ● Don’t alter data, write new one instead ● Don’t produce side effect, treat your process as a function ● Define all factors that influence the result as input parameters ○ i.e.: Don't use date.today(), use Airflow’s execution_time instead

Slide 11

Slide 11 text

#3 Rest data between tasks 11 Why? ● Enable workflow to run in cluster, means this is a good foundation to scaling How? ● Write result of a task to the storage ● Next task will read from the storage

Slide 12

Slide 12 text

#3 Rest data between tasks 12

Slide 13

Slide 13 text

#4 Validate data after each step 13 Why? ● Prevent silent data error ● Never publish wrong data

Slide 14

Slide 14 text

#4 Validate data after each step 14 How? Use write-audit-publish (WAP) pattern Image from https://docs.greatexpectations.io/en/latest/guides/workflows_patterns/deployment_airflow.html

Slide 15

Slide 15 text

#5 Create workflow as code (opposite to using drag & drop tools) 15 Why? ● Reproducible ● Flexible ● Versioned in Git ● Leverage software engineering best practices: ○ unit & integration test, code review, CI/CD, containerization How? Using Airflow can help you

Slide 16

Slide 16 text

Thank you! Hope you find it useful Code sample: https://github.com/asatrya/mkr-data-pipeline 16