Building Data Pipelines in Python - PyData London 2016

Building Data Pipelines in Python - PyData London 2016

Slides for my talk at PyData London 2016 on Building Data Pipeline with Python: http://pydata.org/london2016/schedule/presentation/7/

Abstract:
This talk discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data, in general all the steps that are necessary to prepare your data for your data-driven product. In particular, the focus is on data plumbing and on the practice of going from prototype to production.

Starting from some common anti-patterns, we'll highlight the need for a workflow manager for any non-trivial project.

We'll discuss the case for Luigi as an interesting option to consider, and we'll consider where it fits in the bigger picture of deploying a data product.

Aa38bb7a9c35bc414da6ec7dcd8d7339?s=128

Marco Bonzanini

May 07, 2016
Tweet

Transcript

  1. 2.

    Nice to meet you • @MarcoBonzanini • “Type B” Data

    Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016)
  2. 4.
  3. 6.

    Data Pipelines Data ETL Analytics • Many components in a

    data pipeline: • Extract, Clean, Augment, Join data
  4. 8.

    Towards Good Pipelines • Transform your data, don’t overwrite •

    Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
  5. 9.

    Anti-Patterns • Bunch of scripts • Single run-everything script •

    Hacky homemade dependency control • Don’t reinvent the wheel
  6. 11.

    (Unit) Testing • Unit tests in three easy steps: •

    import unittest • Write your tests • Quit complaining about lack of time to write tests
  7. 12.

    Benefits of (unit) testing • Safety net for refactoring •

    Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  8. 15.

    Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound,

    max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  9. 16.

    Testing: I’m almost done • Unit tests vs Defensive Programming

    • Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
  10. 17.
  11. 18.

    Intro to Luigi GNU Make + Unix pipes + Steroids

    • Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  12. 19.

    Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):

    pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
  13. 20.

    Luigi Target: output of a task class MyTarget(luigi.Target): ! def

    exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  14. 21.

    Not only Luigi • More Python-based workflow managers: • Airflow

    by Airbnb • Mrjob by Yelp • Pinball by Pinterest
  15. 22.

    When things go wrong • import logging • Say no

    to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
  16. 23.

    Who reads the logs? You’re not going to read the

    logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
  17. 24.

    Summary • R&D is not Engineering: can we meet halfway?

    • Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …