Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Building Data Pipelines in Python
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Marco Bonzanini
April 16, 2016
Programming
2
580
Building Data Pipelines in Python
Slides of my talk at PyCon7 in Florence (April 2016)
Marco Bonzanini
April 16, 2016
Tweet
Share
More Decks by Marco Bonzanini
See All by Marco Bonzanini
Pitfalls in Data Science Projects (and how to avoid them)
marcobonzanini
0
69
Is Your Open-source LLM Really Open?
marcobonzanini
0
82
Perambulations in Football Analytics
marcobonzanini
0
66
Natural Language Processing Expert Briefing @ PyData Global 2022
marcobonzanini
0
110
Natural Language Processing Expert Briefing @ PyData Global 2021
marcobonzanini
0
140
Getting into Data Science @ HisarCS 2021
marcobonzanini
0
290
Mining topics in documents with topic modelling and Python @ London Python meetup
marcobonzanini
1
230
Topic Modelling workshop @ PyCon UK 2019
marcobonzanini
2
120
Lies, Damned Lies, and Statistics @ PyCon UK 2019
marcobonzanini
0
150
Other Decks in Programming
See All in Programming
20260313 - Grafana & Friends Taipei #1 - Kubernetes v1.36 的開發雜記:那些困在 Alpha 加護病房太久的 Metrics
tico88612
0
230
技術検証結果の整理と解析をAIに任せよう!
keisukeikeda
0
130
存在論的プログラミング: 時間と存在を記述する
koriym
4
440
PHP でエミュレータを自作して Ubuntu を動かそう
m3m0r7
PRO
2
130
守る「だけ」の優しいEMを抜けて、 事業とチームを両方見る視点を身につけた話
maroon8021
3
1.3k
CS教育のDX AIによる育成の効率化
niftycorp
PRO
0
160
Everything Claude Code OSS詳細 — 5層構造の中身と導入方法
targe
0
150
20260228_JAWS_Beginner_Kansai
takuyay0ne
5
610
AI 開発合宿を通して得た学び
niftycorp
PRO
0
170
Claude Codeログ基盤の構築
giginet
PRO
7
3.6k
SourceGeneratorのマーカー属性問題について
htkym
0
210
我々はなぜ「層」を分けるのか〜「関心の分離」と「抽象化」で手に入れる変更に強いシンプルな設計〜 #phperkaigi / PHPerKaigi 2026
shogogg
2
260
Featured
See All Featured
Amusing Abliteration
ianozsvald
0
140
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Mind Mapping
helmedeiros
PRO
1
130
The Curse of the Amulet
leimatthew05
1
10k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
9k
How to Think Like a Performance Engineer
csswizardry
28
2.5k
Balancing Empowerment & Direction
lara
5
960
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.3k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.4k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
160
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
650
Transcript
Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016
Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
R&D ≠ Engineering R&D results in production = high value
None
Big Data Problems vs Big Data Problems
Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data
Good Data Pipelines Easy to reproduce Easy to productise
Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel
Intermezzo Let me rant about testing Icon by Freepik from
flaticon.com
(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests
Benefits of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet? f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
</rant>
Intro to Luigi GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
Not only Luigi • More Python-based workflow managers: • Airflow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest
When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
Who reads the logs? You’re not going to read the
logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini