Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Building Data Pipelines in Python
Marco Bonzanini
April 16, 2016
Programming
2
490
Building Data Pipelines in Python
Slides of my talk at PyCon7 in Florence (April 2016)
Marco Bonzanini
April 16, 2016
Tweet
Share
More Decks by Marco Bonzanini
See All by Marco Bonzanini
Natural Language Processing Expert Briefing @ PyData Global 2022
marcobonzanini
0
30
Natural Language Processing Expert Briefing @ PyData Global 2021
marcobonzanini
0
75
Getting into Data Science @ HisarCS 2021
marcobonzanini
0
120
Mining topics in documents with topic modelling and Python @ London Python meetup
marcobonzanini
1
150
Topic Modelling workshop @ PyCon UK 2019
marcobonzanini
2
74
Lies, Damned Lies, and Statistics @ PyCon UK 2019
marcobonzanini
0
78
Lies, Damned Lies and Statistics @ PyLondinium 2019
marcobonzanini
1
99
Let the AI Do the Talk: Adventures with Natural Language Generation
marcobonzanini
1
160
Brewing Beer with Python
marcobonzanini
2
160
Other Decks in Programming
See All in Programming
僕が考えた超最強のKMMアプリの作り方
spbaya0141
0
170
Spring BootとKubernetesで実現する今どきのDevOps入門
xblood
0
320
新卒2年目がデータ分析API開発に挑戦【Stapy#88】/data-science-api-begginer
matsuik
0
330
Cloudflare Workersと状態管理
chimame
2
440
AWS App Runnerがそろそろ本番環境でも使い物になりそう
n1215
PRO
0
840
Workshop on Jetpack compose
aldefy
0
140
(新米)エンジニアリングマネージャーのしごと #RSGT2023
murabayashi
9
5.4k
Azure Functionsをサクッと開発、サクッとデプロイ/vscodeconf2023-baba
nina01
1
320
Findy - エンジニア向け会社紹介 / Findy Letter for Engineers
findyinc
2
42k
KubeClarityで始めるSBOM管理 @3-shake SRE Tech Talk / 3-shake-sre-teck-talk-202212
masayaaoyama
0
290
ペパカレで入社した私が感じた2つのギャップと向き合い方
kosuke_ito
0
110
Writing Greener Java Applications
hollycummins
0
330
Featured
See All Featured
Designing Dashboards & Data Visualisations in Web Apps
destraynor
224
50k
Put a Button on it: Removing Barriers to Going Fast.
kastner
56
2.5k
Adopting Sorbet at Scale
ufuk
65
7.8k
What the flash - Photography Introduction
edds
64
10k
GraphQLとの向き合い方2022年版
quramy
20
9.8k
A Philosophy of Restraint
colly
193
15k
Faster Mobile Websites
deanohume
295
29k
Creatively Recalculating Your Daily Design Routine
revolveconf
207
11k
Rebuilding a faster, lazier Slack
samanthasiow
69
7.5k
Fashionably flexible responsive web design (full day workshop)
malarkey
396
63k
Code Review Best Practice
trishagee
50
11k
Fantastic passwords and where to find them - at NoRuKo
philnash
31
1.8k
Transcript
Building Data Pipelines in Python Marco Bonzanini ! PyCon Italia
- Florence 2016
Nice to meet you • @MarcoBonzanini • “Type B” Data
Scientist • PhD in Information Retrieval • Book with PacktPub (July 2016) • Usually at PyData London
R&D ≠ Engineering R&D results in production = high value
None
Big Data Problems vs Big Data Problems
Data Pipelines Data ETL Analytics • Many components in a
data pipeline: • Extract, Clean, Augment, Join data
Good Data Pipelines Easy to reproduce Easy to productise
Towards Good Pipelines • Transform your data, don’t overwrite •
Break it down into components • Different packages (e.g. setup.py) • Unit tests vs end-to-end tests Good = Replicable and Productisable
Anti-Patterns • Bunch of scripts • Single run-everything script •
Hacky homemade dependency control • Don’t reinvent the wheel
Intermezzo Let me rant about testing Icon by Freepik from
flaticon.com
(Unit) Testing • Unit tests in three easy steps: •
import unittest • Write your tests • Quit complaining about lack of time to write tests
Benefits of (unit) testing • Safety net for refactoring •
Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet? f1 = fscore(p, r) min_bound,
max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
Testing: I’m almost done • Unit tests vs Defensive Programming
• Say no to tautologies • Say no to vanity tests • Know the ecosystem: py.test, nosetests, hypothesis, coverage.py, …
</rant>
Intro to Luigi GNU Make + Unix pipes + Steroids
• Workflow manager in Python, by Spotify • Dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
Luigi Task: unit of execution class MyTask(luigi.Task): ! def requires(self):
pass # list of dependencies def output(self): pass # task output def run(self): pass # task logic
Luigi Target: output of a task class MyTarget(luigi.Target): ! def
exists(self): pass # return bool Off the shelf support for local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
Not only Luigi • More Python-based workflow managers: • Airflow
by Airbnb • Mrjob by Yelp • Pinball by Pinterest
When things go wrong • import logging • Say no
to print() for debugging • Custom log format / extensive info • Different levels of severity • Easy to switch off or change level
Who reads the logs? You’re not going to read the
logs, unless… • E-mail notifications • built-in in Luigi • Slack notifications $ pip install luigi_slack # WIP
Summary • R&D is not Engineering: can we meet halfway?
• Prototypes vs. Products • Automation and replicability matter • You need a workflow manager • Good engineering principles help: • Testing, logging, packaging, …
Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini