Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Building Data Pipelines in Python
Search
Marco Bonzanini
April 16, 2016
Programming
590
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Building Data Pipelines in Python
Slides of my talk at PyCon7 in Florence (April 2016)
Marco Bonzanini
April 16, 2016
More Decks by Marco Bonzanini
See All by Marco Bonzanini
Pitfalls in Data Science Projects (and how to avoid them)
marcobonzanini
0
83
Is Your Open-source LLM Really Open?
marcobonzanini
0
99
Perambulations in Football Analytics
marcobonzanini
0
74
Natural Language Processing Expert Briefing @ PyData Global 2022
marcobonzanini
0
120
Natural Language Processing Expert Briefing @ PyData Global 2021
marcobonzanini
0
150
Getting into Data Science @ HisarCS 2021
marcobonzanini
0
310
Mining topics in documents with topic modelling and Python @ London Python meetup
marcobonzanini
1
240
Topic Modelling workshop @ PyCon UK 2019
marcobonzanini
2
130
Lies, Damned Lies, and Statistics @ PyCon UK 2019
marcobonzanini
0
160
Other Decks in Programming
See All in Programming
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.2k
コンテキストの使い捨てをやめる — ビジネスルール駆動開発と miko —
ioki
0
140
不変条件と整合性境界—ビジネスが決める設計判断と実現パターン / Invariants and Consistency Boundaries
nrslib
13
3.5k
AI駆動開発で崩れていくコードベースを立て直す
kyoko_nr_nr
1
440
軽量Java基盤の設計 DIコンテナに頼らない、長期保守と1秒起動の実現 JJUG CCC 2026 Spring
macha64
0
460
Dataformのリポジトリを立ち上げるときにまずやること / dataform-day0-2026
snhryt
0
110
エージェンティックRAGにAWSで入門しよう!
har1101
8
1.2k
キャリア迷子上等 ─ "ない道"は自分で作ればいい
16bitidol
3
1.7k
肥大化するレガシーコードに立ち向かうためのインターフェース分離と依存の逆転 / JJUG CCC 2026 Spring
hirokunimaeta
0
500
OSもどきOS
arkw
0
460
New "Type" system on PicoRuby
pocke
1
480
Copilot CLI の継戦能力を高める コンテキスト管理
nozomutu
1
1.2k
Featured
See All Featured
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
200
Design of three-dimensional binary manipulators for pick-and-place task avoiding obstacles (IECON2024)
konakalab
0
440
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.2k
What's in a price? How to price your products and services
michaelherold
247
13k
Google's AI Overviews - The New Search
badams
0
1k
Typedesign – Prime Four
hannesfritz
42
3.1k
The SEO identity crisis: Don't let AI make you average
varn
0
480
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
580
Building AI with AI
inesmontani
PRO
1
1.1k
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
3
150
HDC tutorial
michielstock
2
690
4 Signs Your Business is Dying
shpigford
187
22k
Transcript
Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016
Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
R&D ≠ Engineering R&D results in production = high value
None
Big Data Problems vs Big Data Problems
Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data
Good Data Pipelines Easy to reproduce Easy to productise
Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel
Intermezzo Let me rant about testing Icon by Freepik from
flaticon.com
(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests
Benefits of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet? f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
</rant>
Intro to Luigi GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
Not only Luigi • More Python-based workflow managers: • Airflow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest
When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
Who reads the logs? You’re not going to read the
logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini