Building Data Workflows with Luigi & Python

Building Data Workflows with Luigi & Python

A demo and presentation given at Python Stuttgart Meetup (April 2019) about building data workflows with Luigi.

A demo Luigi project is available on Github https://github.com/chhantyal/luigi-example

0660d7eae1277c9d84a498110504061d?s=128

Nar Kumar Chhantyal

April 26, 2019
Tweet

Transcript

  1. 2.

    v Data Lake/Platform @ Breuninger.com v Python/Luigi with Kubernetes on

    Google Cloud v Web Dev in past life (Flask/Django/NodeJS) v Twitter/Github: @chhantyal
  2. 3.

    ❏ Workflow/pipeline tool for batch jobs ❏ Open sourced by

    Spotify Engineering ❏ Written entirely in Python. Jobs are just normal Python code ❏ Lightweight, comes with Web UI ❏ Has no built in scheduler, usually crontab is used ❏ Has tons of contrib packages eg. Hadoop, BigQuery, AWS
  3. 4.

    Daily Sales Report Create a daily revenue report from sales

    transactions. We need do few things first to build final report: v Dump sales data from prod database v Move to a backup drive. v Ingest into analytics database etc.
  4. 5.

    Daily Sales Report I will just write modular Python script,

    what could possibly go wrong? 1. 0 10 * * * dump_sales_data.py 2. 0 10 * * * backup_to_safe_drive.py 3. 0 11 * * * ingest_to_analyticsdb.py 4. 0 12 * * * aggregate_data.py 5. Profit? !
  5. 6.

    Daily Sales Report Few issues: 1. What happens when first

    one fails? 2. What if first one takes longer than one hour? 3. What if you have to do same thing for last five days? 4. How do I see if these jobs ran successfully or not? Or even worse, your batch job had bug and you have to do for whole month? 5. What happens if job somehow runs twice? Duplicate data?
  6. 7.

    Daily Sales Report v Demo with Luigi v Final task

    diagram v Source code: https://github.com/chhantyal/luigi-example
  7. 8.

    Customer Analytics Luigi to the rescue: 1. What happens when

    first one fails? - Rest of the tasks will be in pending state. 2. What if first one takes longer than one hour? - Rest of the pipeline will wait for downstream dependencies to finish first. 3. What if you have to do same thing for last five days? - Can easily do backfilling.
  8. 9.

    Customer Analytics Luigi to the rescue (continued): 4. How do

    I see if these jobs ran successfully or not? - Has decent Web UI to track job runs. Target files can be used as well. 5. What happens if it somehow runs twice? Duplicate data? - Running jobs again does nothing if they were already successful. Jobs are idempotent.
  9. 10.

    Luigi is lightweight Python based framework to manage batch jobs.

    It gives you: § workflow management § dependency resolution § jobs visualisation (Web UI) § plain Python code as jobs (thus very flexible to extend) § easy unit testing § works great with Kubernetes
  10. 12.

    Contact: kumar.chhantyal@breuninger.de v Data (big or small) v Python !

    v Docker/Kubernetes v Google Cloud v Table Tennis v Running/Biking v Cakes ✨# v Cool Team $