Building Data Pipelines in Python

Building Data Pipelines in Python

Slides of my talk at PyCon7 in Florence (April 2016)

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

April 16, 2016
Tweet

Transcript

  1. Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia

    - Florence 2016
  2. Nice to meet you • @MarcoBonzanini • “Type B” Data

    Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
  3. R&D ≠ Engineering R&D results in production = high value

  4. None
  5. Big Data Problems vs Big Data Problems

  6. Data Pipelines Data ETL Analytics • Many components in a

    data pipeline: • Extract, Clean, Augment, Join data
  7. Good Data Pipelines Easy to reproduce Easy to productise

  8. Towards Good Pipelines • Transform your data, don’t overwrite •

    Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
  9. Anti-Patterns • Bunch of scripts • Single run-everything script •

    Hacky homemade dependency control • Don’t reinvent the wheel
  10. Intermezzo Let me rant about testing Icon by Freepik from

    flaticon.com
  11. (Unit) Testing • Unit tests in three easy steps: •

    import unittest • Write your tests • Quit complaining about lack of time to write tests
  12. Benefits of (unit) testing • Safety net for refactoring •

    Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  13. Testing: not convinced yet?

  14. Testing: not convinced yet?

  15. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound,

    max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  16. Testing: I’m almost done • Unit tests vs Defensive Programming

    • Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
  17. </rant>

  18. Intro to Luigi GNU Make + Unix pipes + Steroids

    • Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  19. Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  20. Luigi Target: output of a task class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  21. Not only Luigi • More Python-based workflow managers: • Airflow

    by Airbnb • Mrjob by Yelp • Pinball by Pinterest
  22. When things go wrong • import logging • Say no

    to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
  23. Who reads the logs? You’re not going to read the

    logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
  24. Summary • R&D is not Engineering: can we meet halfway?

    • Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
  25. Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini