$30 off During Our Annual Pro Sale. View Details »

Building Data Pipelines in Python - PyData London 2016

Building Data Pipelines in Python - PyData London 2016

Slides for my talk at PyData London 2016 on Building Data Pipeline with Python: http://pydata.org/london2016/schedule/presentation/7/

Abstract:
This talk discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data, in general all the steps that are necessary to prepare your data for your data-driven product. In particular, the focus is on data plumbing and on the practice of going from prototype to production.

Starting from some common anti-patterns, we'll highlight the need for a workflow manager for any non-trivial project.

We'll discuss the case for Luigi as an interesting option to consider, and we'll consider where it fits in the bigger picture of deploying a data product.

Marco Bonzanini

May 07, 2016
Tweet

More Decks by Marco Bonzanini

Other Decks in Technology

Transcript

  1. Building Data
    Pipelines in Python
    Marco Bonzanini
    !
    PyData London 2016

    View Slide

  2. Nice to meet you
    • @MarcoBonzanini
    • “Type B” Data Scientist
    • PhD in Information Retrieval
    • Book with PacktPub (July 2016)

    View Slide

  3. R&D ≠ Engineering
    R&D results in production = high value

    View Slide

  4. View Slide

  5. Big Data Problems
    vs
    Big Data Problems

    View Slide

  6. Data Pipelines
    Data ETL Analytics
    • Many components in a data pipeline:
    • Extract, Clean, Augment, Join data

    View Slide

  7. Good Data Pipelines
    Easy to reproduce
    Easy to productise

    View Slide

  8. Towards Good Pipelines
    • Transform your data, don’t overwrite
    • Break it down into components
    • Different packages (e.g. setup.py)
    • Unit tests vs end-to-end tests
    Good = Replicable and Productisable

    View Slide

  9. Anti-Patterns
    • Bunch of scripts
    • Single run-everything script
    • Hacky homemade dependency control
    • Don’t reinvent the wheel

    View Slide

  10. Intermezzo
    Let me rant about testing
    Icon by Freepik from flaticon.com

    View Slide

  11. (Unit) Testing
    • Unit tests in three easy steps:
    • import unittest
    • Write your tests
    • Quit complaining about lack of time to write tests

    View Slide

  12. Benefits of (unit) testing
    • Safety net for refactoring
    • Safety net for lib upgrades
    • Validate your assumptions
    • Document code / communicate your intentions
    • You’re forced to think

    View Slide

  13. Testing: not convinced yet?

    View Slide

  14. Testing: not convinced yet?

    View Slide

  15. Testing: not convinced yet?

    f1 = fscore(p, r)
    min_bound, max_bound = sorted([p, r])
    assert min_bound <= f1 <= max_bound

    View Slide

  16. Testing: I’m almost done
    • Unit tests vs Defensive Programming
    • Say no to tautologies
    • Say no to vanity tests
    • Know the ecosystem: py.test, nosetests, hypothesis,
    coverage.py, …

    View Slide


  17. View Slide

  18. Intro to Luigi
    GNU Make + Unix pipes + Steroids
    • Workflow manager in Python, by Spotify
    • Dependency management
    • Error control, checkpoints, failure recovery
    • Minimal boilerplate
    • Dependency graph visualisation
    $ pip install luigi

    View Slide

  19. Luigi Task: unit of execution
    class MyTask(luigi.Task):
    !
    def requires(self):
    pass # list of dependencies
    def output(self):
    pass # task output
    def run(self):
    pass # task logic

    View Slide

  20. Luigi Target: output of a task
    class MyTarget(luigi.Target):
    !
    def exists(self):
    pass # return bool
    Off the shelf support for local file system, S3, Elasticsearch, RDBMS
    (also via luigi.contrib)

    View Slide

  21. Not only Luigi
    • More Python-based workflow managers:
    • Airflow by Airbnb
    • Mrjob by Yelp
    • Pinball by Pinterest

    View Slide

  22. When things go wrong
    • import logging
    • Say no to print() for debugging
    • Custom log format / extensive info
    • Different levels of severity
    • Easy to switch off or change level

    View Slide

  23. Who reads the logs?
    You’re not going to read the logs, unless…
    • E-mail notifications
    • built-in in Luigi
    • Slack notifications
    $ pip install luigi_slack # WIP

    View Slide

  24. Summary
    • R&D is not Engineering: can we meet halfway?
    • Prototypes vs. Products
    • Automation and replicability matter
    • You need a workflow manager
    • Good engineering principles help:
    • Testing, logging, packaging, …

    View Slide

  25. Vanity Slide
    • speakerdeck.com/marcobonzanini
    • github.com/bonzanini
    • marcobonzanini.com
    • @MarcoBonzanini

    View Slide