Slide 1

Slide 1 text

Building Data Pipelines in Python Marco Bonzanini QCon London 2017

Slide 2

Slide 2 text

Nice to meet you

Slide 3

Slide 3 text

R&D ≠ Engineering

Slide 4

Slide 4 text

R&D ≠ Engineering R&D results in production = high value

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Big Data Problems vs Big Data Problems

Slide 7

Slide 7 text

Data Pipelines (from 30,000ft) Data ETL Analytics

Slide 8

Slide 8 text

Data Pipelines (zooming in) ETL {Extract Transform Load {Clean Augment Join

Slide 9

Slide 9 text

Good Data Pipelines Easy to Reproduce Productise {

Slide 10

Slide 10 text

Towards Good Data Pipelines

Slide 11

Slide 11 text

Towards Good Data Pipelines (a) Your Data is Dirty unless proven otherwise “It’s in the database, so it’s already good”

Slide 12

Slide 12 text

Towards Good Data Pipelines (b) All Your Data is Important unless proven otherwise

Slide 13

Slide 13 text

Towards Good Data Pipelines (b) All Your Data is Important unless proven otherwise Keep it. Transform it. Don’t overwrite it.

Slide 14

Slide 14 text

Towards Good Data Pipelines (c) Pipelines vs Script Soups

Slide 15

Slide 15 text

Tasty, but not a pipeline Pic: Romanian potato soup from Wikipedia

Slide 16

Slide 16 text

$ ./do_something.sh $ ./do_something_else.sh $ ./extract_some_data.sh $ ./join_some_other_data.sh ... Anti-pattern: the script soup

Slide 17

Slide 17 text

Script soups kill replicability

Slide 18

Slide 18 text

$ cat ./run_everything.sh ./do_something.sh ./do_something_else.sh ./extract_some_data.sh ./join_some_other_data.sh $ ./run_everything.sh Anti-pattern: the master script

Slide 19

Slide 19 text

Towards Good Data Pipelines (d) Break it Down setup.py and conda

Slide 20

Slide 20 text

Towards Good Data Pipelines (e) Automated Testing i.e. why scientists don’t write unit tests

Slide 21

Slide 21 text

Intermezzo Let me rant about testing Icon by Freepik from flaticon.com

Slide 22

Slide 22 text

(Unit) Testing Unit tests in three easy steps: • import unittest • Write your tests • Quit complaining about lack of time to write tests

Slide 23

Slide 23 text

Benefits of (unit) testing • Safety net for refactoring • Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think

Slide 24

Slide 24 text

Testing: not convinced yet?

Slide 25

Slide 25 text

Testing: not convinced yet?

Slide 26

Slide 26 text

Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound, max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound

Slide 27

Slide 27 text

Testing: I’m almost done • Unit tests vs Defensive Programming • Say no to tautologies • Say no to vanity tests • The Python ecosystem is rich: 
 py.test, nosetests, hypothesis, coverage.py, …

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Towards Good Data Pipelines (f) Orchestration Don’t re-invent the wheel

Slide 30

Slide 30 text

You need a workflow manager Think: 
 GNU Make + Unix pipes + Steroids

Slide 31

Slide 31 text

Intro to Luigi • Task dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi

Slide 32

Slide 32 text

Luigi Task: unit of execution class MyTask(luigi.Task): def requires(self): return [SomeTask()] def output(self): return luigi.LocalTarget(…) def run(self): mylib.run()

Slide 33

Slide 33 text

Luigi Target: output of a task class MyTarget(luigi.Target): def exists(self): ... # return bool Great off the shelf support 
 local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Intro to Airflow • Like Luigi, just younger • Nicer (?) GUI • Scheduling • Apache Project

Slide 36

Slide 36 text

Towards Good Data Pipelines (g) When things go wrong The Joy of debugging

Slide 37

Slide 37 text

import logging

Slide 38

Slide 38 text

Who reads the logs? You’re not going to read the logs, unless… • E-mail notifications (built-in in Luigi) • Slack notifications $ pip install luigi_slack # WIP

Slide 39

Slide 39 text

Towards Good Data Pipelines (h) Static Analysis The Joy of Duck Typing

Slide 40

Slide 40 text

If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. — somebody on the Web

Slide 41

Slide 41 text

>>> 1.0 == 1 == True True >>> 1 + True 2

Slide 42

Slide 42 text

>>> '1' * 2 '11' >>> '1' + 2 Traceback (most recent call last): File "", line 1, in TypeError: Can't convert 'int' object to str implicitly

Slide 43

Slide 43 text

def do_stuff(a: int, b: int) -> str: ... return something PEP 3107 — Function Annotations
 (since Python 3.0) (annotations are ignored by the interpreter)

Slide 44

Slide 44 text

typing module: semantically coherent PEP 484 — Type Hints
 (since Python 3.5) (still ignored by the interpreter)

Slide 45

Slide 45 text

pip install mypy

Slide 46

Slide 46 text

• Add optional types • Run: mypy --follow-imports silent mylib • Refine gradual typing (e.g. Any)

Slide 47

Slide 47 text

Summary Basic engineering principles help
 (packaging, testing, orchestration, logging, static analysis, ...)

Slide 48

Slide 48 text

Summary R&D is not Engineering:
 can we meet halfway?

Slide 49

Slide 49 text

Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini