Building Data
Pipelines in Python
Marco Bonzanini
!
PyData London 2016
Slide 2
Slide 2 text
Nice to meet you
• @MarcoBonzanini
• “Type B” Data Scientist
• PhD in Information Retrieval
• Book with PacktPub (July 2016)
Slide 3
Slide 3 text
R&D ≠ Engineering
R&D results in production = high value
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
Big Data Problems
vs
Big Data Problems
Slide 6
Slide 6 text
Data Pipelines
Data ETL Analytics
• Many components in a data pipeline:
• Extract, Clean, Augment, Join data
Slide 7
Slide 7 text
Good Data Pipelines
Easy to reproduce
Easy to productise
Slide 8
Slide 8 text
Towards Good Pipelines
• Transform your data, don’t overwrite
• Break it down into components
• Different packages (e.g. setup.py)
• Unit tests vs end-to-end tests
Good = Replicable and Productisable
Slide 9
Slide 9 text
Anti-Patterns
• Bunch of scripts
• Single run-everything script
• Hacky homemade dependency control
• Don’t reinvent the wheel
Slide 10
Slide 10 text
Intermezzo
Let me rant about testing
Icon by Freepik from flaticon.com
Slide 11
Slide 11 text
(Unit) Testing
• Unit tests in three easy steps:
• import unittest
• Write your tests
• Quit complaining about lack of time to write tests
Slide 12
Slide 12 text
Benefits of (unit) testing
• Safety net for refactoring
• Safety net for lib upgrades
• Validate your assumptions
• Document code / communicate your intentions
• You’re forced to think
Slide 13
Slide 13 text
Testing: not convinced yet?
Slide 14
Slide 14 text
Testing: not convinced yet?
Slide 15
Slide 15 text
Testing: not convinced yet?
f1 = fscore(p, r)
min_bound, max_bound = sorted([p, r])
assert min_bound <= f1 <= max_bound
Slide 16
Slide 16 text
Testing: I’m almost done
• Unit tests vs Defensive Programming
• Say no to tautologies
• Say no to vanity tests
• Know the ecosystem: py.test, nosetests, hypothesis,
coverage.py, …
Slide 17
Slide 17 text
Slide 18
Slide 18 text
Intro to Luigi
GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify
• Dependency management
• Error control, checkpoints, failure recovery
• Minimal boilerplate
• Dependency graph visualisation
$ pip install luigi
Slide 19
Slide 19 text
Luigi Task: unit of execution
class MyTask(luigi.Task):
!
def requires(self):
pass # list of dependencies
def output(self):
pass # task output
def run(self):
pass # task logic
Slide 20
Slide 20 text
Luigi Target: output of a task
class MyTarget(luigi.Target):
!
def exists(self):
pass # return bool
Off the shelf support for local file system, S3, Elasticsearch, RDBMS
(also via luigi.contrib)
Slide 21
Slide 21 text
Not only Luigi
• More Python-based workflow managers:
• Airflow by Airbnb
• Mrjob by Yelp
• Pinball by Pinterest
Slide 22
Slide 22 text
When things go wrong
• import logging
• Say no to print() for debugging
• Custom log format / extensive info
• Different levels of severity
• Easy to switch off or change level
Slide 23
Slide 23 text
Who reads the logs?
You’re not going to read the logs, unless…
• E-mail notifications
• built-in in Luigi
• Slack notifications
$ pip install luigi_slack # WIP
Slide 24
Slide 24 text
Summary
• R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products
• Automation and replicability matter
• You need a workflow manager
• Good engineering principles help:
• Testing, logging, packaging, …