Slide 1

Slide 1 text

Building Data Pipelines in Python Marco Bonzanini ! PyData London 2016

Slide 2

Slide 2 text

Nice to meet you • @MarcoBonzanini • “Type B” Data Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016)

Slide 3

Slide 3 text

R&D ≠ Engineering R&D results in production = high value

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Big Data Problems vs Big Data Problems

Slide 6

Slide 6 text

Data Pipelines Data ETL Analytics • Many components in a data pipeline: • Extract, Clean, Augment, Join data

Slide 7

Slide 7 text

Good Data Pipelines Easy to reproduce Easy to productise

Slide 8

Slide 8 text

Towards Good Pipelines • Transform your data, don’t overwrite • Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable

Slide 9

Slide 9 text

Anti-Patterns • Bunch of scripts • Single run-everything script • Hacky homemade dependency control • Don’t reinvent the wheel

Slide 10

Slide 10 text

Intermezzo Let me rant about testing Icon by Freepik from flaticon.com

Slide 11

Slide 11 text

(Unit) Testing • Unit tests in three easy steps: • import unittest • Write your tests • Quit complaining about lack of time to write tests

Slide 12

Slide 12 text

Benefits of (unit) testing • Safety net for refactoring • Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think

Slide 13

Slide 13 text

Testing: not convinced yet?

Slide 14

Slide 14 text

Testing: not convinced yet?

Slide 15

Slide 15 text

Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound, max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound

Slide 16

Slide 16 text

Testing: I’m almost done • Unit tests vs Defensive Programming • Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Intro to Luigi GNU Make + Unix pipes + Steroids • Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi

Slide 19

Slide 19 text

Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self): pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic

Slide 20

Slide 20 text

Luigi Target: output of a task class MyTarget(luigi.Target): ! def exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)

Slide 21

Slide 21 text

Not only Luigi • More Python-based workflow managers: • Airflow by Airbnb • Mrjob by Yelp • Pinball by Pinterest

Slide 22

Slide 22 text

When things go wrong • import logging • Say no to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level

Slide 23

Slide 23 text

Who reads the logs? You’re not going to read the logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP

Slide 24

Slide 24 text

Summary • R&D is not Engineering: can we meet halfway? • Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …

Slide 25

Slide 25 text

Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini