Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PyConDE 2016 - Building Data Pipelines with P...
Search
Miguel Cabrera
October 31, 2016
Technology
0
330
PyConDE 2016 - Building Data Pipelines with Python
Miguel Cabrera
October 31, 2016
Tweet
Share
More Decks by Miguel Cabrera
See All by Miguel Cabrera
Machine Learning for Time Series Forecasting
mfcabrera
0
320
Data Science in Fashion - Exploring Demand Forecasting
mfcabrera
0
140
Helping Travellers Make Better Hotel Choices 500 Million Times a Month
mfcabrera
1
180
Europython 2016 - Things I wish I knew before using Python for Data Processing
mfcabrera
1
1.3k
PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science
mfcabrera
0
210
Python and Life Hacking with Emacs
mfcabrera
2
360
PyData Berlin 2015 - Processing Hotel Reviews with Python
mfcabrera
4
2k
Munich Datageeks - Introduction to SVM using Python
mfcabrera
2
310
Dictionary Learning for Music Genre Recognition
mfcabrera
0
260
Other Decks in Technology
See All in Technology
Claude_CodeでSEOを最適化する_AI_Ops_Community_Vol.2__マーケティングx_AIはここまで進化した.pdf
riku_423
2
530
20260208_第66回 コンピュータビジョン勉強会
keiichiito1978
0
100
usermode linux without MMU - fosdem2026 kernel devroom
thehajime
0
230
セキュリティについて学ぶ会 / 2026 01 25 Takamatsu WordPress Meetup
rocketmartue
1
300
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
17k
GitHub Issue Templates + Coding Agentで簡単みんなでIaC/Easy IaC for Everyone with GitHub Issue Templates + Coding Agent
aeonpeople
1
200
OWASP Top 10:2025 リリースと 少しの日本語化にまつわる裏話
okdt
PRO
3
590
ClickHouseはどのように大規模データを活用したAIエージェントを全社展開しているのか
mikimatsumoto
0
210
2026年、サーバーレスの現在地 -「制約と戦う技術」から「当たり前の実行基盤」へ- /serverless2026
slsops
2
220
レガシー共有バッチ基盤への挑戦 - SREドリブンなリアーキテクチャリングの取り組み
tatsukoni
0
210
[CV勉強会@関東 World Model 読み会] Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models (Mousakhan+, NeurIPS 2025)
abemii
0
110
広告の効果検証を題材にした因果推論の精度検証について
zozotech
PRO
0
150
Featured
See All Featured
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
280
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
200
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
34k
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
1
1.3k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
61
52k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.4k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
130
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.2k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.1k
Building Flexible Design Systems
yeseniaperezcruz
330
40k
30 Presentation Tips
portentint
PRO
1
210
Transcript
Building Data Pipelines with Python Data Engineer @ TY
@mfcabrera
[email protected]
Miguel Cabrera PyCon Deutschland 30.10.2016
Agenda
Agenda Context Data Pipelines with Luigi Tips and
Tricks Examples
Data Processing Pipelines
cat file.txt | wc -‐ l | mail -‐s
“hello”
[email protected]
ETL
ETL • Extract data from a data source •
Transform the data • Load into a sink
None
Feature Extraction Parameter Estimation Model Training Feature Extraction
Model Predict Visualize/ Format
Steps in different technologies
Steps can be run in parallel
Steps have complex dependencies among them
Workflows • Repeat • Parametrize •
Resume • Schedule it
None
None
“A Python framework for data flow definition and execution” Luigi
Concepts
Concepts Tasks Parameters Targets Scheduler & Workers
Tasks
None
1
2
3
4
WordCountTask file.txt wc.txt
WordCountTask file.txt wc.txt ToJsonTask wc.json
None
Parameters
None
Parameters Used to idenNfy the task From arguments
or from configuraNon Many types of Parameters (int, date, boolean, date range, Nme delta, dict, enum)
Targets
Targets Resources produced by a Task Typically Local files
or files distributed file system (HDFS) Must implement the method exists() Many targets available
None
Scheduler & Workers
None
Source: h@p:/ /www.arashrouhani.com/luigid-‐basics-‐jun-‐2015
BaVeries Included
Batteries Included Package contrib filled with goodies Good support
for Hadoop Different Targets Extensible
Task Types Task -‐ Local Hadoop MR, Pig, Spark,
etc SalesForce, ElasNcsearch, etc. ExternalProgram check luigi.contrib !
Target LocalTarget HDFS, S3, FTP, SSH, WebHDFS, etc.
ESTarget, MySQLTarget, MSQL, Hive, SQLAlchemy, etc.
None
Tips & Tricks
Separate pipeline and logic
Extend to avoid boilerplate code
DRY
Conclusion Luigi is a mature, baVeries-‐included alternaNve for building
data pipelines Lacks of powerful visualizaNon of the pipelines Requires a external way of launching jobs (i.e. cron). Hard to debug MR Jobs
Lear More hVps:/ /github.com/spoNfy/luigi hVp:/ /luigi.readthedocs.io/en/stable/
Thanks!
Credits • pipe icon by Oliviu Stoian from the Noun
Project • Photo Credit: (CC) h@ps:/ /www.flickr.com/photos/ 47244853@N03/29988510886 from hb.s via Compfight • Concrete Mixer: (CC) h@ps:/ /www.flickr.com/photos/ 145708285@N03/30138453986 by MasLabor via Compfight