Slide 1

Slide 1 text

Or how to manage 100s batch jobs better! with

Slide 2

Slide 2 text

v Data Lake/Platform @ Breuninger.com v Python/Luigi with Kubernetes on Google Cloud v Web Dev in past life (Flask/Django/NodeJS) v Twitter/Github: @chhantyal

Slide 3

Slide 3 text

❏ Workflow/pipeline tool for batch jobs ❏ Open sourced by Spotify Engineering ❏ Written entirely in Python. Jobs are just normal Python code ❏ Lightweight, comes with Web UI ❏ Has no built in scheduler, usually crontab is used ❏ Has tons of contrib packages eg. Hadoop, BigQuery, AWS

Slide 4

Slide 4 text

Daily Sales Report Create a daily revenue report from sales transactions. We need do few things first to build final report: v Dump sales data from prod database v Move to a backup drive. v Ingest into analytics database etc.

Slide 5

Slide 5 text

Daily Sales Report I will just write modular Python script, what could possibly go wrong? 1. 0 10 * * * dump_sales_data.py 2. 0 10 * * * backup_to_safe_drive.py 3. 0 11 * * * ingest_to_analyticsdb.py 4. 0 12 * * * aggregate_data.py 5. Profit? !

Slide 6

Slide 6 text

Daily Sales Report Few issues: 1. What happens when first one fails? 2. What if first one takes longer than one hour? 3. What if you have to do same thing for last five days? 4. How do I see if these jobs ran successfully or not? Or even worse, your batch job had bug and you have to do for whole month? 5. What happens if job somehow runs twice? Duplicate data?

Slide 7

Slide 7 text

Daily Sales Report v Demo with Luigi v Final task diagram v Source code: https://github.com/chhantyal/luigi-example

Slide 8

Slide 8 text

Customer Analytics Luigi to the rescue: 1. What happens when first one fails? - Rest of the tasks will be in pending state. 2. What if first one takes longer than one hour? - Rest of the pipeline will wait for downstream dependencies to finish first. 3. What if you have to do same thing for last five days? - Can easily do backfilling.

Slide 9

Slide 9 text

Customer Analytics Luigi to the rescue (continued): 4. How do I see if these jobs ran successfully or not? - Has decent Web UI to track job runs. Target files can be used as well. 5. What happens if it somehow runs twice? Duplicate data? - Running jobs again does nothing if they were already successful. Jobs are idempotent.

Slide 10

Slide 10 text

Luigi is lightweight Python based framework to manage batch jobs. It gives you: § workflow management § dependency resolution § jobs visualisation (Web UI) § plain Python code as jobs (thus very flexible to extend) § easy unit testing § works great with Kubernetes

Slide 11

Slide 11 text

QUESTIONS?

Slide 12

Slide 12 text

Contact: [email protected] v Data (big or small) v Python ! v Docker/Kubernetes v Google Cloud v Table Tennis v Running/Biking v Cakes ✨# v Cool Team $