Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Best Practices for Building Maintainable Data Pipeline

116c9c298f57285d2d3b5ac0749f014b?s=47 Aditya Satrya
November 15, 2020

Best Practices for Building Maintainable Data Pipeline

An opinionated list of best practices to begin with:
#1 Ingest data as-raw-as-possible
#2 Create idempotent and deterministic processes
#3 Rest data between tasks
#4 Validate data after each step
#5 Create workflow as a code

116c9c298f57285d2d3b5ac0749f014b?s=128

Aditya Satrya

November 15, 2020
Tweet

Transcript

  1. Best Practices for Building Maintainable Data Pipeline Aditya Satrya Data

    Engineering Tech Lead at Mekari linkedin.com/in/asatrya 15 November 2020
  2. Introduction 2 • Aditya Satrya • Data Engineering Tech Lead

    at Mekari • Lives in Bandung • linkedin.com/in/asatrya/
  3. What is data pipeline? 3 Series of steps or actions

    to move and combine data from various sources for analysis or visualization
  4. What is Airflow? 4 Airflow is a platform to programmatically

    author, schedule and monitor workflows (https://airflow.apache.org/)
  5. What does maintainable means? 5 • Easy to understand •

    Easy to change • Easy to recover from failure • Easy to debug
  6. 6 #1 Ingest data as-raw-as-possible #2 Create idempotent and deterministic

    processes #3 Rest data between tasks #4 Validate data after each step #5 Create workflow as a code An opinionated list of best practices to begin with
  7. #1 Ingest data as-raw-as-possible 7 Why? • Prevent losing data

    because of process failure • Enable to reprocess data when there’s changes in business rules How? • Use format that have minimum metadata definition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion
  8. #2 Create idempotent & deterministic processes 8 What is means?

    • Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs
  9. #2 Create idempotent & deterministic processes 9 Why? • Reproducible

    ◦ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production
  10. #2 Create idempotent & deterministic processes 10 How? • Don’t

    append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use Airflow’s execution_time instead
  11. #3 Rest data between tasks 11 Why? • Enable workflow

    to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage
  12. #3 Rest data between tasks 12

  13. #4 Validate data after each step 13 Why? • Prevent

    silent data error • Never publish wrong data
  14. #4 Validate data after each step 14 How? Use write-audit-publish

    (WAP) pattern Image from https://docs.greatexpectations.io/en/latest/guides/workflows_patterns/deployment_airflow.html
  15. #5 Create workflow as code (opposite to using drag &

    drop tools) 15 Why? • Reproducible • Flexible • Versioned in Git • Leverage software engineering best practices: ◦ unit & integration test, code review, CI/CD, containerization How? Using Airflow can help you
  16. Thank you! Hope you find it useful Code sample: https://github.com/asatrya/mkr-data-pipeline

    16