Let Luigi do the plumbing for you @ PyData London meetup

Let Luigi do the plumbing for you @ PyData London meetup

Lighting talk given at the PyData London November meetup about building data pipelines in Python using Luigi.

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

November 03, 2015
Tweet

Transcript

  1. 1.

    Let Luigi do the plumbing for you Building Data Pipelines

    in Python 1 Marco Bonzanini @ PyData London ! 3rd November 2015
  2. 2.

    Data pipelines:! • steps to extract, clean, augment, join data!

    • every non-trivial project has one! ! From prototype to production:! • as a Data Scientist, the focus is on R&D! • automation and replicability matter
  3. 3.

    Luigi: GNU make + Unix pipelines + steroids! • Workflow

    manager in Python! • Dependency management! • Error control, checkpoints, failure recovery! • Minimal boilerplate code! • Dependency graph visualisation! ! $ pip install luigi https://github.com/spotify/luigi
  4. 4.

    Task: unit of execution! ! class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  5. 5.

    Target: output of a task! ! class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool ! Off-the-shelf support for local filesystem, S3,! RDBMS, Elasticsearch, …!
  6. 6.

    Suggestions to Ease Deployment! • Don’t re-invent the wheel •

    Develop Python packages (setup.py) • Parameterise everything (env variables: good) • Use decent logging mechanism • Docker: probably good idea