Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Best Practices for Building Maintainable Data Pipeline

Aditya Satrya
November 15, 2020

Best Practices for Building Maintainable Data Pipeline

An opinionated list of best practices to begin with:
#1 Ingest data as-raw-as-possible
#2 Create idempotent and deterministic processes
#3 Rest data between tasks
#4 Validate data after each step
#5 Create workflow as a code

Recorded talk: https://youtu.be/PKgDjGCYKTE?t=454

Aditya Satrya

November 15, 2020
Tweet

More Decks by Aditya Satrya

Other Decks in Technology

Transcript

  1. Best Practices for Building Maintainable Data Pipeline Aditya Satrya Data

    Engineering Tech Lead at Mekari linkedin.com/in/asatrya 15 November 2020
  2. Introduction 2 • Aditya Satrya • Data Engineering Tech Lead

    at Mekari • Lives in Bandung • linkedin.com/in/asatrya/
  3. What is data pipeline? 3 Series of steps or actions

    to move and combine data from various sources for analysis or visualization
  4. What is Airflow? 4 Airflow is a platform to programmatically

    author, schedule and monitor workflows (https://airflow.apache.org/)
  5. What does maintainable means? 5 • Easy to understand •

    Easy to change • Easy to recover from failure • Easy to debug
  6. 6 #1 Ingest data as-raw-as-possible #2 Create idempotent and deterministic

    processes #3 Rest data between tasks #4 Validate data after each step #5 Create workflow as a code An opinionated list of best practices to begin with
  7. #1 Ingest data as-raw-as-possible 7 Why? • Prevent losing data

    because of process failure • Enable to reprocess data when there’s changes in business rules How? • Use format that have minimum metadata definition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion
  8. #2 Create idempotent & deterministic processes 8 What is means?

    • Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs
  9. #2 Create idempotent & deterministic processes 9 Why? • Reproducible

    ◦ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production
  10. #2 Create idempotent & deterministic processes 10 How? • Don’t

    append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use Airflow’s execution_time instead
  11. #3 Rest data between tasks 11 Why? • Enable workflow

    to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage
  12. #4 Validate data after each step 13 Why? • Prevent

    silent data error • Never publish wrong data
  13. #4 Validate data after each step 14 How? Use write-audit-publish

    (WAP) pattern Image from https://docs.greatexpectations.io/en/latest/guides/workflows_patterns/deployment_airflow.html
  14. #5 Create workflow as code (opposite to using drag &

    drop tools) 15 Why? • Reproducible • Flexible • Versioned in Git • Leverage software engineering best practices: ◦ unit & integration test, code review, CI/CD, containerization How? Using Airflow can help you