Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Building Data Pipelines in Python
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Marco Bonzanini
April 16, 2016
Programming
590
2
Share
Building Data Pipelines in Python
Slides of my talk at PyCon7 in Florence (April 2016)
Marco Bonzanini
April 16, 2016
More Decks by Marco Bonzanini
See All by Marco Bonzanini
Pitfalls in Data Science Projects (and how to avoid them)
marcobonzanini
0
74
Is Your Open-source LLM Really Open?
marcobonzanini
0
85
Perambulations in Football Analytics
marcobonzanini
0
69
Natural Language Processing Expert Briefing @ PyData Global 2022
marcobonzanini
0
110
Natural Language Processing Expert Briefing @ PyData Global 2021
marcobonzanini
0
140
Getting into Data Science @ HisarCS 2021
marcobonzanini
0
300
Mining topics in documents with topic modelling and Python @ London Python meetup
marcobonzanini
1
230
Topic Modelling workshop @ PyCon UK 2019
marcobonzanini
2
130
Lies, Damned Lies, and Statistics @ PyCon UK 2019
marcobonzanini
0
150
Other Decks in Programming
See All in Programming
テレメトリーシグナルが導くパフォーマンス最適化 / Performance Optimization Driven by Telemetry Signals
seike460
PRO
2
220
AI活用のコスパを最大化する方法
ochtum
0
380
2026-03-27 #terminalnight 変数展開とコマンド展開でターミナル作業をスマートにする方法
masasuzu
0
300
脱 雰囲気実装!AgentCoreを良い感じにWEBアプリケーションに組み込むために
takuyay0ne
3
440
Feature Toggle は捨てやすく使おう
gennei
0
440
LM Linkで(非力な!)ノートPCでローカルLLM
seosoft
0
410
実践CRDT
tamadeveloper
0
350
安いハードウェアでVulkan
fadis
1
920
生成 AI 時代のスナップショットテストってやつを見せてあげますよ(α版)
ojun9
0
340
Rethinking API Platform Filters
vinceamstoutz
0
11k
Offline should be the norm: building local-first apps with CRDTs & Kotlin Multiplatform
renaudmathieu
0
120
PHPで TLSのプロトコルを実装してみる
higaki_program
0
740
Featured
See All Featured
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
61
43k
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.2k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.2k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
340
How Software Deployment tools have changed in the past 20 years
geshan
0
33k
Automating Front-end Workflow
addyosmani
1370
200k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.5k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
320
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
480
Faster Mobile Websites
deanohume
310
31k
[SF Ruby Conf 2025] Rails X
palkan
2
930
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.4k
Transcript
Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016
Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
R&D ≠ Engineering R&D results in production = high value
None
Big Data Problems vs Big Data Problems
Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data
Good Data Pipelines Easy to reproduce Easy to productise
Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel
Intermezzo Let me rant about testing Icon by Freepik from
flaticon.com
(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests
Benefits of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet? f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
</rant>
Intro to Luigi GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
Not only Luigi • More Python-based workflow managers: • Airflow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest
When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
Who reads the logs? You’re not going to read the
logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini