Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Pipelines in Python

Building Data Pipelines in Python

Slides of my talk at PyCon7 in Florence (April 2016)

Marco Bonzanini

April 16, 2016
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia

    - Florence 2016
  2. Nice to meet you • @MarcoBonzanini • “Type B” Data

    Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
  3. R&D ≠ Engineering R&D results in production = high value

  4. None
  5. Big Data Problems vs Big Data Problems

  6. Data Pipelines Data ETL Analytics • Many components in a

    data pipeline: • Extract, Clean, Augment, Join data
  7. Good Data Pipelines Easy to reproduce Easy to productise

  8. Towards Good Pipelines • Transform your data, don’t overwrite •

    Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
  9. Anti-Patterns • Bunch of scripts • Single run-everything script •

    Hacky homemade dependency control • Don’t reinvent the wheel
  10. Intermezzo Let me rant about testing Icon by Freepik from

    flaticon.com
  11. (Unit) Testing • Unit tests in three easy steps: •

    import unittest • Write your tests • Quit complaining about lack of time to write tests
  12. Benefits of (unit) testing • Safety net for refactoring •

    Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  13. Testing: not convinced yet?

  14. Testing: not convinced yet?

  15. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound,

    max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  16. Testing: I’m almost done • Unit tests vs Defensive Programming

    • Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
  17. </rant>

  18. Intro to Luigi GNU Make + Unix pipes + Steroids

    • Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  19. Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  20. Luigi Target: output of a task class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  21. Not only Luigi • More Python-based workflow managers: • Airflow

    by Airbnb • Mrjob by Yelp • Pinball by Pinterest
  22. When things go wrong • import logging • Say no

    to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
  23. Who reads the logs? You’re not going to read the

    logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
  24. Summary • R&D is not Engineering: can we meet halfway?

    • Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
  25. Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini