Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Let Luigi do the plumbing for you @ PyData Lond...

Let Luigi do the plumbing for you @ PyData London meetup

Lighting talk given at the PyData London November meetup about building data pipelines in Python using Luigi.

Marco Bonzanini

November 03, 2015
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Let Luigi do the plumbing for you Building Data Pipelines

    in Python 1 Marco Bonzanini @ PyData London ! 3rd November 2015
  2. Data pipelines:! • steps to extract, clean, augment, join data!

    • every non-trivial project has one! ! From prototype to production:! • as a Data Scientist, the focus is on R&D! • automation and replicability matter
  3. Luigi: GNU make + Unix pipelines + steroids! • Workflow

    manager in Python! • Dependency management! • Error control, checkpoints, failure recovery! • Minimal boilerplate code! • Dependency graph visualisation! ! $ pip install luigi https://github.com/spotify/luigi
  4. Task: unit of execution! ! class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  5. Target: output of a task! ! class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool ! Off-the-shelf support for local filesystem, S3,! RDBMS, Elasticsearch, …!
  6. Suggestions to Ease Deployment! • Don’t re-invent the wheel •

    Develop Python packages (setup.py) • Parameterise everything (env variables: good) • Use decent logging mechanism • Docker: probably good idea