Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Pipelines in Python

Building Data Pipelines in Python

Slides of my talk at PyCon7 in Florence (April 2016)

Marco Bonzanini

April 16, 2016
Tweet

More Decks by Marco Bonzanini

Other Decks in Programming

Transcript

  1. Nice to meet you • @MarcoBonzanini • “Type B” Data

    Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
  2. Data Pipelines Data ETL Analytics • Many components in a

    data pipeline: • Extract, Clean, Augment, Join data
  3. Towards Good Pipelines • Transform your data, don’t overwrite •

    Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
  4. Anti-Patterns • Bunch of scripts • Single run-everything script •

    Hacky homemade dependency control • Don’t reinvent the wheel
  5. (Unit) Testing • Unit tests in three easy steps: •

    import unittest • Write your tests • Quit complaining about lack of time to write tests
  6. Benefits of (unit) testing • Safety net for refactoring •

    Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  7. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound,

    max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  8. Testing: I’m almost done • Unit tests vs Defensive Programming

    • Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
  9. Intro to Luigi GNU Make + Unix pipes + Steroids

    • Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  10. Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  11. Luigi Target: output of a task class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  12. Not only Luigi • More Python-based workflow managers: • Airflow

    by Airbnb • Mrjob by Yelp • Pinball by Pinterest
  13. When things go wrong • import logging • Say no

    to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
  14. Who reads the logs? You’re not going to read the

    logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
  15. Summary • R&D is not Engineering: can we meet halfway?

    • Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …