Towards Good Data Pipelines (d)
Break it Down
setup.py and conda
Slide 20
Slide 20 text
Towards Good Data Pipelines (e)
Automated Testing
i.e. why scientists don’t write unit tests
Slide 21
Slide 21 text
Intermezzo
Let me rant about testing
Icon by Freepik from flaticon.com
Slide 22
Slide 22 text
(Unit) Testing
Unit tests in three easy steps:
• import unittest
• Write your tests
• Quit complaining about lack of time to write tests
Slide 23
Slide 23 text
Benefits of (unit) testing
• Safety net for refactoring
• Safety net for lib upgrades
• Validate your assumptions
• Document code / communicate your intentions
• You’re forced to think
Slide 24
Slide 24 text
Testing: not convinced yet?
Slide 25
Slide 25 text
Testing: not convinced yet?
Slide 26
Slide 26 text
Testing: not convinced yet?
f1 = fscore(p, r)
min_bound, max_bound = sorted([p, r])
assert min_bound <= f1 <= max_bound
Slide 27
Slide 27 text
Testing: I’m almost done
• Unit tests vs Defensive Programming
• Say no to tautologies
• Say no to vanity tests
• The Python ecosystem is rich:
py.test, nosetests, hypothesis, coverage.py, …
Slide 28
Slide 28 text
Slide 29
Slide 29 text
Towards Good Data Pipelines (f)
Orchestration
Don’t re-invent the wheel
Slide 30
Slide 30 text
You need a workflow manager
Think:
GNU Make + Unix pipes + Steroids
Slide 31
Slide 31 text
Intro to Luigi
• Task dependency management
• Error control, checkpoints, failure recovery
• Minimal boilerplate
• Dependency graph visualisation
$ pip install luigi
Slide 32
Slide 32 text
Luigi Task: unit of execution
class MyTask(luigi.Task):
def requires(self):
return [SomeTask()]
def output(self):
return luigi.LocalTarget(…)
def run(self):
mylib.run()
Slide 33
Slide 33 text
Luigi Target: output of a task
class MyTarget(luigi.Target):
def exists(self):
... # return bool
Great off the shelf support
local file system, S3, Elasticsearch, RDBMS
(also via luigi.contrib)
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
Intro to Airflow
• Like Luigi, just younger
• Nicer (?) GUI
• Scheduling
• Apache Project
Slide 36
Slide 36 text
Towards Good Data Pipelines (g)
When things go wrong
The Joy of debugging
Slide 37
Slide 37 text
import logging
Slide 38
Slide 38 text
Who reads the logs?
You’re not going to read the logs, unless…
• E-mail notifications (built-in in Luigi)
• Slack notifications
$ pip install luigi_slack # WIP
Slide 39
Slide 39 text
Towards Good Data Pipelines (h)
Static Analysis
The Joy of Duck Typing
Slide 40
Slide 40 text
If it looks like a duck,
swims like a duck,
and quacks like a duck,
then it probably is a duck.
— somebody on the Web
Slide 41
Slide 41 text
>>> 1.0 == 1 == True
True
>>> 1 + True
2
Slide 42
Slide 42 text
>>> '1' * 2
'11'
>>> '1' + 2
Traceback (most recent call last):
File "", line 1, in
TypeError: Can't convert 'int' object
to str implicitly
Slide 43
Slide 43 text
def do_stuff(a: int,
b: int) -> str:
...
return something
PEP 3107 — Function Annotations
(since Python 3.0)
(annotations are ignored by the interpreter)
Slide 44
Slide 44 text
typing module: semantically coherent
PEP 484 — Type Hints
(since Python 3.5)
(still ignored by the interpreter)