Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Pipelines in Python - PyData Lond...

Building Data Pipelines in Python - PyData London 2016

Slides for my talk at PyData London 2016 on Building Data Pipeline with Python: http://pydata.org/london2016/schedule/presentation/7/

Abstract:
This talk discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data, in general all the steps that are necessary to prepare your data for your data-driven product. In particular, the focus is on data plumbing and on the practice of going from prototype to production.

Starting from some common anti-patterns, we'll highlight the need for a workflow manager for any non-trivial project.

We'll discuss the case for Luigi as an interesting option to consider, and we'll consider where it fits in the bigger picture of deploying a data product.

Marco Bonzanini

May 07, 2016
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Nice to meet you • @MarcoBonzanini • “Type B” Data

    Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016)
  2. Data Pipelines Data ETL Analytics • Many components in a

    data pipeline: • Extract, Clean, Augment, Join data
  3. Towards Good Pipelines • Transform your data, don’t overwrite •

    Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
  4. Anti-Patterns • Bunch of scripts • Single run-everything script •

    Hacky homemade dependency control • Don’t reinvent the wheel
  5. (Unit) Testing • Unit tests in three easy steps: •

    import unittest • Write your tests • Quit complaining about lack of time to write tests
  6. Benefits of (unit) testing • Safety net for refactoring •

    Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  7. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound,

    max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  8. Testing: I’m almost done • Unit tests vs Defensive Programming

    • Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
  9. Intro to Luigi GNU Make + Unix pipes + Steroids

    • Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  10. Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  11. Luigi Target: output of a task class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  12. Not only Luigi • More Python-based workflow managers: • Airflow

    by Airbnb • Mrjob by Yelp • Pinball by Pinterest
  13. When things go wrong • import logging • Say no

    to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
  14. Who reads the logs? You’re not going to read the

    logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
  15. Summary • R&D is not Engineering: can we meet halfway?

    • Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …