Best Practices for Building Maintainable Data Pipeline

Best Practices for Building Maintainable Data Pipeline Aditya Satrya Data
Engineering Tech Lead at Mekari linkedin.com/in/asatrya 15 November 2020

Introduction 2 • Aditya Satrya • Data Engineering Tech Lead
at Mekari • Lives in Bandung • linkedin.com/in/asatrya/

What is data pipeline? 3 Series of steps or actions
to move and combine data from various sources for analysis or visualization

What is Airflow? 4 Airflow is a platform to programmatically
author, schedule and monitor workflows (https://airflow.apache.org/)

What does maintainable means? 5 • Easy to understand •
Easy to change • Easy to recover from failure • Easy to debug

6 #1 Ingest data as-raw-as-possible #2 Create idempotent and deterministic
processes #3 Rest data between tasks #4 Validate data after each step #5 Create workﬂow as a code An opinionated list of best practices to begin with

#1 Ingest data as-raw-as-possible 7 Why? • Prevent losing data
because of process failure • Enable to reprocess data when there’s changes in business rules How? • Use format that have minimum metadata deﬁnition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion

#2 Create idempotent & deterministic processes 8 What is means?
• Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs

#2 Create idempotent & deterministic processes 9 Why? • Reproducible
◦ When something breaks, just ﬁx the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production

#2 Create idempotent & deterministic processes 10 How? • Don’t
append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use Airflow’s execution_time instead

#3 Rest data between tasks 11 Why? • Enable workﬂow
to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage

#3 Rest data between tasks 12

#4 Validate data after each step 13 Why? • Prevent
silent data error • Never publish wrong data

#4 Validate data after each step 14 How? Use write-audit-publish
(WAP) pattern Image from https://docs.greatexpectations.io/en/latest/guides/workﬂows_patterns/deployment_airﬂow.html

#5 Create workﬂow as code (opposite to using drag &
drop tools) 15 Why? • Reproducible • Flexible • Versioned in Git • Leverage software engineering best practices: ◦ unit & integration test, code review, CI/CD, containerization How? Using Airﬂow can help you

Thank you! Hope you ﬁnd it useful Code sample: https://github.com/asatrya/mkr-data-pipeline
16

Best Practices for Building Maintainable Data P...

Best Practices for Building Maintainable Data Pipeline

Aditya Satrya

More Decks by Aditya Satrya

Other Decks in Technology

Featured

Transcript

Best Practices for Building Maintainable Data Pipeline Aditya Satrya Data

Introduction 2 • Aditya Satrya • Data Engineering Tech Lead

What is data pipeline? 3 Series of steps or actions

What is Airﬂow? 4 Airﬂow is a platform to programmatically

What does maintainable means? 5 • Easy to understand •

6 #1 Ingest data as-raw-as-possible #2 Create idempotent and deterministic

#1 Ingest data as-raw-as-possible 7 Why? • Prevent losing data

#2 Create idempotent & deterministic processes 8 What is means?

#2 Create idempotent & deterministic processes 9 Why? • Reproducible

#2 Create idempotent & deterministic processes 10 How? • Don’t

#3 Rest data between tasks 11 Why? • Enable workﬂow

#3 Rest data between tasks 12

#4 Validate data after each step 13 Why? • Prevent

#4 Validate data after each step 14 How? Use write-audit-publish

#5 Create workﬂow as code (opposite to using drag &

Thank you! Hope you ﬁnd it useful Code sample: https://github.com/asatrya/mkr-data-pipeline