Lighting talk given at the PyData London November meetup about building data pipelines in Python using Luigi.
Let Luigi do the
plumbing for you
Building Data Pipelines in Python
Marco Bonzanini @ PyData London
3rd November 2015
• steps to extract, clean, augment, join data!
• every non-trivial project has one!
From prototype to production:!
• as a Data Scientist, the focus is on R&D!
• automation and replicability matter
Luigi: GNU make + Unix pipelines + steroids!
• Workﬂow manager in Python!
• Dependency management!
• Error control, checkpoints, failure recovery!
• Minimal boilerplate code!
• Dependency graph visualisation!
$ pip install luigi
Task: unit of execution!
pass # list of dependencies
pass # task output
pass # task logic
Target: output of a task!
pass # return bool
Off-the-shelf support for local ﬁlesystem, S3,!
RDBMS, Elasticsearch, …!
Suggestions to Ease Deployment!
• Don’t re-invent the wheel
• Develop Python packages (setup.py)
• Parameterise everything (env variables: good)
• Use decent logging mechanism
• Docker: probably good idea