Spotify Engineering ❏ Written entirely in Python. Jobs are just normal Python code ❏ Lightweight, comes with Web UI ❏ Has no built in scheduler, usually crontab is used ❏ Has tons of contrib packages eg. Hadoop, BigQuery, AWS
transactions. We need do few things first to build final report: v Dump sales data from prod database v Move to a backup drive. v Ingest into analytics database etc.
one fails? 2. What if first one takes longer than one hour? 3. What if you have to do same thing for last five days? 4. How do I see if these jobs ran successfully or not? Or even worse, your batch job had bug and you have to do for whole month? 5. What happens if job somehow runs twice? Duplicate data?
first one fails? - Rest of the tasks will be in pending state. 2. What if first one takes longer than one hour? - Rest of the pipeline will wait for downstream dependencies to finish first. 3. What if you have to do same thing for last five days? - Can easily do backfilling.
I see if these jobs ran successfully or not? - Has decent Web UI to track job runs. Target files can be used as well. 5. What happens if it somehow runs twice? Duplicate data? - Running jobs again does nothing if they were already successful. Jobs are idempotent.
It gives you: § workflow management § dependency resolution § jobs visualisation (Web UI) § plain Python code as jobs (thus very flexible to extend) § easy unit testing § works great with Kubernetes