Building Data Pipelines in Python

Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016

Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London

R&D ≠ Engineering R&D results in production = high value

Big Data Problems vs Big Data Problems

Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data

Good Data Pipelines Easy to reproduce Easy to productise

Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable

Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel

Intermezzo Let me rant about testing Icon by Freepik from
ﬂaticon.com

(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests

Beneﬁts of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think

Testing: not convinced yet?

Testing: not convinced yet?   f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound

Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …

</rant>

Intro to Luigi GNU Make + Unix pipes + Steroids
• Workﬂow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi

Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic

Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local ﬁle system, S3, Elasticsearch, RDBMS (also via luigi.contrib)

Not only Luigi • More Python-based workﬂow managers: • Airﬂow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest

When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level

Who reads the logs? You’re not going to read the
logs, unless… • E-mail notiﬁcations • built-in in Luigi • Slack notiﬁcations $ pip install luigi_slack # WIP

Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workﬂow manager • Good engineering principles help: • Testing, logging, packaging, …

Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini

Building Data Pipelines in Python

Building Data Pipelines in Python

Marco Bonzanini

More Decks by Marco Bonzanini

Other Decks in Programming

Featured

Transcript