Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Pipelines in Python @ QCon London 2017

Building Data Pipelines in Python @ QCon London 2017

Slides for my talk at QCon London 2017:

This talk discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data, in general all the steps that are necessary to prepare your data for your data-driven product. In particular, the focus is on data plumbing and on the practice of going from prototype to production.

Starting from some common anti-patterns, we'll highlight the need for a workflow manager for any non-trivial project.

We'll discuss the case for Luigi as an interesting option to consider, and we'll consider where it fits in the bigger picture of deploying a data product.

Marco Bonzanini

March 06, 2017

More Decks by Marco Bonzanini

Other Decks in Technology


  1. Building Data Pipelines in Python Marco Bonzanini QCon London 2017

  2. Nice to meet you

  3. R&D ≠ Engineering

  4. R&D ≠ Engineering R&D results in production = high value

  5. None
  6. Big Data Problems vs Big Data Problems

  7. Data Pipelines (from 30,000ft) Data ETL Analytics

  8. Data Pipelines (zooming in) ETL {Extract Transform Load {Clean Augment

  9. Good Data Pipelines Easy to Reproduce Productise {

  10. Towards Good Data Pipelines

  11. Towards Good Data Pipelines (a) Your Data is Dirty unless

    proven otherwise “It’s in the database, so it’s already good”
  12. Towards Good Data Pipelines (b) All Your Data is Important

    unless proven otherwise
  13. Towards Good Data Pipelines (b) All Your Data is Important

    unless proven otherwise Keep it. Transform it. Don’t overwrite it.
  14. Towards Good Data Pipelines (c) Pipelines vs Script Soups

  15. Tasty, but not a pipeline Pic: Romanian potato soup from

  16. $ ./do_something.sh $ ./do_something_else.sh $ ./extract_some_data.sh $ ./join_some_other_data.sh ... Anti-pattern:

    the script soup
  17. Script soups kill replicability

  18. $ cat ./run_everything.sh ./do_something.sh ./do_something_else.sh ./extract_some_data.sh ./join_some_other_data.sh $ ./run_everything.sh Anti-pattern:

    the master script
  19. Towards Good Data Pipelines (d) Break it Down setup.py and

  20. Towards Good Data Pipelines (e) Automated Testing i.e. why scientists

    don’t write unit tests
  21. Intermezzo Let me rant about testing Icon by Freepik from

  22. (Unit) Testing Unit tests in three easy steps: • import

    unittest • Write your tests • Quit complaining about lack of time to write tests
  23. Benefits of (unit) testing • Safety net for refactoring •

    Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  24. Testing: not convinced yet?

  25. Testing: not convinced yet?

  26. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound,

    max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  27. Testing: I’m almost done • Unit tests vs Defensive Programming

    • Say no to tautologies • Say no to vanity tests • The Python ecosystem is rich: 
 py.test, nosetests, hypothesis, coverage.py, …
  28. </rant>

  29. Towards Good Data Pipelines (f) Orchestration Don’t re-invent the wheel

  30. You need a workflow manager Think: 
 GNU Make +

    Unix pipes + Steroids
  31. Intro to Luigi • Task dependency management • Error control,

    checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  32. Luigi Task: unit of execution class MyTask(luigi.Task): def requires(self): return

    [SomeTask()] def output(self): return luigi.LocalTarget(…) def run(self): mylib.run()
  33. Luigi Target: output of a task class MyTarget(luigi.Target): def exists(self):

    ... # return bool Great off the shelf support 
 local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  34. None
  35. Intro to Airflow • Like Luigi, just younger • Nicer

    (?) GUI • Scheduling • Apache Project
  36. Towards Good Data Pipelines (g) When things go wrong The

    Joy of debugging
  37. import logging

  38. Who reads the logs? You’re not going to read the

    logs, unless… • E-mail notifications (built-in in Luigi) • Slack notifications $ pip install luigi_slack # WIP
  39. Towards Good Data Pipelines (h) Static Analysis The Joy of

    Duck Typing
  40. If it looks like a duck, swims like a duck,

    and quacks like a duck, then it probably is a duck. — somebody on the Web
  41. >>> 1.0 == 1 == True True >>> 1 +

    True 2
  42. >>> '1' * 2 '11' >>> '1' + 2 Traceback

    (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly
  43. def do_stuff(a: int, b: int) -> str: ... return something

    PEP 3107 — Function Annotations
 (since Python 3.0) (annotations are ignored by the interpreter)
  44. typing module: semantically coherent PEP 484 — Type Hints

    Python 3.5) (still ignored by the interpreter)
  45. pip install mypy

  46. • Add optional types • Run: mypy --follow-imports silent mylib

    • Refine gradual typing (e.g. Any)
  47. Summary Basic engineering principles help
 (packaging, testing, orchestration, logging, static

    analysis, ...)
  48. Summary R&D is not Engineering:
 can we meet halfway?

  49. Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini