Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Building Data Pipelines in Python
Search
Marco Bonzanini
April 16, 2016
Programming
2
580
Building Data Pipelines in Python
Slides of my talk at PyCon7 in Florence (April 2016)
Marco Bonzanini
April 16, 2016
Tweet
Share
More Decks by Marco Bonzanini
See All by Marco Bonzanini
Pitfalls in Data Science Projects (and how to avoid them)
marcobonzanini
0
58
Is Your Open-source LLM Really Open?
marcobonzanini
0
71
Perambulations in Football Analytics
marcobonzanini
0
50
Natural Language Processing Expert Briefing @ PyData Global 2022
marcobonzanini
0
100
Natural Language Processing Expert Briefing @ PyData Global 2021
marcobonzanini
0
130
Getting into Data Science @ HisarCS 2021
marcobonzanini
0
280
Mining topics in documents with topic modelling and Python @ London Python meetup
marcobonzanini
1
220
Topic Modelling workshop @ PyCon UK 2019
marcobonzanini
2
120
Lies, Damned Lies, and Statistics @ PyCon UK 2019
marcobonzanini
0
140
Other Decks in Programming
See All in Programming
React 19でつくる「気持ちいいUI」- 楽観的UIのすすめ
himorishige
11
4.4k
Findy AI+の開発、運用におけるMCP活用事例
starfish719
0
2.1k
PC-6001でPSG曲を鳴らすまでを全部NetBSD上の Makefile に押し込んでみた / osc2025hiroshima
tsutsui
0
200
PostgreSQLで手軽にDuckDBを使う!DuckDB&pg_duckdb入門/osc25hi-duckdb
takahashiikki
0
230
組み合わせ爆発にのまれない - 責務分割 x テスト
halhorn
1
180
Go コードベースの構成と AI コンテキスト定義
andpad
0
160
Patterns of Patterns
denyspoltorak
0
420
Deno Tunnel を使ってみた話
kamekyame
0
310
從冷知識到漏洞,你不懂的 Web,駭客懂 - Huli @ WebConf Taiwan 2025
aszx87410
2
3.3k
それ、本当に安全? ファイルアップロードで見落としがちなセキュリティリスクと対策
penpeen
6
1.9k
まだ間に合う!Claude Code元年をふりかえる
nogu66
5
930
クラウドに依存しないS3を使った開発術
simesaba80
0
220
Featured
See All Featured
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
120
The browser strikes back
jonoalderson
0
300
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
59
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.6k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
115
100k
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
0
120
ラッコキーワード サービス紹介資料
rakko
0
1.9M
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
1
340
Tell your own story through comics
letsgokoyo
0
770
Transcript
Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016
Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
R&D ≠ Engineering R&D results in production = high value
None
Big Data Problems vs Big Data Problems
Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data
Good Data Pipelines Easy to reproduce Easy to productise
Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel
Intermezzo Let me rant about testing Icon by Freepik from
flaticon.com
(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests
Benefits of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet? f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
</rant>
Intro to Luigi GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
Not only Luigi • More Python-based workflow managers: • Airflow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest
When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
Who reads the logs? You’re not going to read the
logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini