Building Data Workflows with Luigi and Kubernetes

Building Data Workflows with Luigi and Kubernetes

Slides from talk at EuroPython 2019, Basel

This talk will focus on how one can build complex data pipelines in Python. I will introduce Luigi and show how it solves problems while running multiple chain of batch jobs like dependency resolution, workflow management, visualisation, failure handling etc.

After that, I will present how to package Luigi pipelines as Docker image for easier testing and deployment. Finally, I will go through way to deploy them on Kubernetes cluster, thus making it possible to scale Big Data pipelines on-demand and reduce infrastructure costs. I will also give tips and tricks to make Luigi Scheduler play well with Kubernetes batch execution feature.

This talk will be accompanied by demo project. It will be very beneficial for audience who have some experience in running batch jobs (not necessarily in Python), typically people who work in Big Data sphere like data scientists, data engineers, BI devs and software developers. Familiarity with Python is helpful but not needed.

0660d7eae1277c9d84a498110504061d?s=128

Nar Kumar Chhantyal

July 11, 2019
Tweet

Transcript

  1. 2.

    Nar Kumar Chhantyal v Data Lake @ Breuninger.com v Python/Luigi

    with Kubernetes on Google Cloud v Web Dev in past life (Flask/Django/NodeJS) v Twitter/Github: @chhantyal v Web: http://chhantyal.net
  2. 3.

    v Workflow/pipeline tool for batch jobs v Open sourced by

    Spotify Engineering v Written entirely in Python. Jobs are just normal Python code v Lightweight, comes with Web UI v Has tons of contrib packages eg. Hadoop, BigQuery, AWS v Has no built in scheduler, usually crontab is used
  3. 4.

    Daily Sales Report Create a daily revenue report from sales

    transactions. We need do few things first to build final report: v Dump sales data from prod database v Ingest into analytics database v Run aggregation & update dashboard
  4. 5.

    Daily Sales Report I will just write modular Python script,

    what could possibly go wrong? 1. 0 10 * * * dump_sales_data.py 2. 0 11 * * * ingest_to_analyticsdb.py 3. 0 12 * * * aggregate_data.py 4. Profit? !
  5. 6.

    Daily Sales Report Few issues: 1. What happens when first

    one fails? 2. What if first one takes longer than one hour? 3. What if you have to do same thing for last five days? 4. How do I see if these jobs ran successfully or not? 5. What happens if job somehow runs twice? Duplicate data?
  6. 7.

    Daily Sales Report v Luigi implimentation v Source code: https://github.com/chhantyal/luigi-kubernetes

    v Run from CLI: luigi --module example SalesReport --date=2019-07-11
  7. 8.

    Luigi has no built-in scheduler. Usually, crontab is used: v

    0 08 * * * luigi --module example SalesReport --date=2019-07-11 CRONTAB +
  8. 10.

    A Job creates one or more Pods to do specific

    task. It ensures the pods’ successful completion and reschedules them in case of failure (aka. run to complation). A Cron Job creates Jobs on a time-based schedule.
  9. 11.

    Daily Sales Report v Run on Kubernetes (Minikube) • Deploy

    Luigid • Build Docker images & upload to registry • Deploy pipeline on K8S v Cronjob à Job à Pod v Source code: https://github.com/chhantyal/luigi-kubernetes v Docker images: https://hub.docker.com/u/chhantyal
  10. 12.

    Luigi being lightweight, it makes great tool to containerize and

    run on Kubernates cluster. As a result, you can manage complex batch processes and scale them seamlessly on demand. Kubernetes v Horizontal scaling v Flexible deployment v Continuous integration & delivery Luigi v Workflow managment v Dependency resolution v Easy testing & containerization
  11. 13.

    Contact: kumar.chhantyal@breuninger.de | twitter.com/chhantyal v Data (big & small) v

    Python ! v Docker/Kubernetes v Google Cloud v Table tennis " / running # / biking $ / cakes ✨&✨ v Cool team ' v Stuttgart, Germany (ca. 2h train ride from Basel)