Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Building Data Pipelines in Python
Search
Marco Bonzanini
April 16, 2016
Programming
2
570
Building Data Pipelines in Python
Slides of my talk at PyCon7 in Florence (April 2016)
Marco Bonzanini
April 16, 2016
Tweet
Share
More Decks by Marco Bonzanini
See All by Marco Bonzanini
Pitfalls in Data Science Projects (and how to avoid them)
marcobonzanini
0
47
Is Your Open-source LLM Really Open?
marcobonzanini
0
53
Perambulations in Football Analytics
marcobonzanini
0
40
Natural Language Processing Expert Briefing @ PyData Global 2022
marcobonzanini
0
93
Natural Language Processing Expert Briefing @ PyData Global 2021
marcobonzanini
0
120
Getting into Data Science @ HisarCS 2021
marcobonzanini
0
260
Mining topics in documents with topic modelling and Python @ London Python meetup
marcobonzanini
1
210
Topic Modelling workshop @ PyCon UK 2019
marcobonzanini
2
110
Lies, Damned Lies, and Statistics @ PyCon UK 2019
marcobonzanini
0
130
Other Decks in Programming
See All in Programming
フロントエンド開発のためのブラウザ組み込みAI入門
masashi
7
3.5k
SwiftDataを使って10万件のデータを読み書きする
akidon0000
0
240
iOSでSVG画像を扱う
kishikawakatsumi
0
170
Vueのバリデーション、結局どれを選べばいい? ― 自作バリデーションの限界と、脱却までの道のり ― / Which Vue Validation Library Should We Really Use? The Limits of Self-Made Validation and How I Finally Moved On
neginasu
2
1.5k
テーブル定義書の構造化抽出して、生成AIでDWH分析を試してみた / devio2025tokyo
kasacchiful
0
290
Leading Effective Engineering Teams in the AI Era
addyosmani
7
610
PHPに関数型の魂を宿す〜PHP 8.5 で実現する堅牢なコードとは〜 #phpcon_hiroshima / phpcon-hiroshima-2025
shogogg
1
330
なぜGoのジェネリクスはこの形なのか? - Featherweight Goが明かす設計の核心
qualiarts
0
250
品質ワークショップをやってみた
nealle
0
630
pnpm に provenance のダウングレード を検出する PR を出してみた
ryo_manba
1
160
AI 駆動開発におけるコミュニティと AWS CDK の価値
konokenj
5
240
Catch Up: Go Style Guide Update
andpad
0
250
Featured
See All Featured
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Automating Front-end Workflow
addyosmani
1371
200k
Building a Modern Day E-commerce SEO Strategy
aleyda
44
7.8k
A better future with KSS
kneath
239
18k
Building a Scalable Design System with Sketch
lauravandoore
463
33k
Producing Creativity
orderedlist
PRO
347
40k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
For a Future-Friendly Web
brad_frost
180
10k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.7k
A Tale of Four Properties
chriscoyier
161
23k
jQuery: Nuts, Bolts and Bling
dougneiner
65
7.9k
Become a Pro
speakerdeck
PRO
29
5.6k
Transcript
Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016
Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
R&D ≠ Engineering R&D results in production = high value
None
Big Data Problems vs Big Data Problems
Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data
Good Data Pipelines Easy to reproduce Easy to productise
Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel
Intermezzo Let me rant about testing Icon by Freepik from
flaticon.com
(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests
Benefits of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet? f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
</rant>
Intro to Luigi GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
Not only Luigi • More Python-based workflow managers: • Airflow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest
When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
Who reads the logs? You’re not going to read the
logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini