Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PyConDE 2016 - Building Data Pipelines with Python
Search
Miguel Cabrera
October 31, 2016
Technology
0
260
PyConDE 2016 - Building Data Pipelines with Python
Miguel Cabrera
October 31, 2016
Tweet
Share
More Decks by Miguel Cabrera
See All by Miguel Cabrera
Machine Learning for Time Series Forecasting
mfcabrera
0
230
Data Science in Fashion - Exploring Demand Forecasting
mfcabrera
0
110
Helping Travellers Make Better Hotel Choices 500 Million Times a Month
mfcabrera
1
140
Europython 2016 - Things I wish I knew before using Python for Data Processing
mfcabrera
1
1.1k
PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science
mfcabrera
0
170
Python and Life Hacking with Emacs
mfcabrera
2
290
PyData Berlin 2015 - Processing Hotel Reviews with Python
mfcabrera
4
1.7k
Munich Datageeks - Introduction to SVM using Python
mfcabrera
2
210
Dictionary Learning for Music Genre Recognition
mfcabrera
0
240
Other Decks in Technology
See All in Technology
JBUG岡山 #6 WordCamp男木島の チームビルディング
takeshifurusato
0
150
ペパボのオブザーバビリティ研修2024 説明資料
kesompochy
0
1.1k
簡単に始めるSnowflakeの機械学習
nayuts
1
190
シフトレフトで挑む セキュリティの生産性向上
sekido
PRO
0
270
dxd2024-生成AIに振り回された3か月間の成功と失敗/dxd2024-link-and-motivation
lmi
2
260
AIアシスタントの活用で品質の向上と開発ワークフローのスピードアップ
nagix
1
190
GoとアクターモデルでES+CQRSを実践! / proto_actor_es_cqrs
ytake
1
150
Github Actions 로 Android 팀의 효율성 극대화
hadonghyun
0
160
AutomatedLabを使って内部ペンテストを勉強しよう! -やられ社内ネットワークの自動構築-
n_etupirka
1
610
テストケースの自動生成に生成AIの導入を試みた話と生成AIによる今後の期待
shift_evolve
0
180
[I/O Extended Android 2024] What`s new in Android 2024
kyeongwan
0
220
目標設定は好きですか? アジャイルとともに目標と向き合い続ける方法 / Do you like target Management?
kakehashi
10
3k
Featured
See All Featured
Into the Great Unknown - MozCon
thekraken
20
1.3k
Building Better People: How to give real-time feedback that sticks.
wjessup
357
18k
Intergalactic Javascript Robots from Outer Space
tanoku
266
26k
A Philosophy of Restraint
colly
200
16k
4 Signs Your Business is Dying
shpigford
178
21k
Agile that works and the tools we love
rasmusluckow
325
20k
Building an army of robots
kneath
301
42k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
353
29k
Navigating Team Friction
lara
181
13k
Building Your Own Lightsaber
phodgson
101
5.9k
Fantastic passwords and where to find them - at NoRuKo
philnash
42
2.7k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
121
18k
Transcript
Building Data Pipelines with Python Data Engineer @ TY
@mfcabrera
[email protected]
Miguel Cabrera PyCon Deutschland 30.10.2016
Agenda
Agenda Context Data Pipelines with Luigi Tips and
Tricks Examples
Data Processing Pipelines
cat file.txt | wc -‐ l | mail -‐s
“hello”
[email protected]
ETL
ETL • Extract data from a data source •
Transform the data • Load into a sink
None
Feature Extraction Parameter Estimation Model Training Feature Extraction
Model Predict Visualize/ Format
Steps in different technologies
Steps can be run in parallel
Steps have complex dependencies among them
Workflows • Repeat • Parametrize •
Resume • Schedule it
None
None
“A Python framework for data flow definition and execution” Luigi
Concepts
Concepts Tasks Parameters Targets Scheduler & Workers
Tasks
None
1
2
3
4
WordCountTask file.txt wc.txt
WordCountTask file.txt wc.txt ToJsonTask wc.json
None
Parameters
None
Parameters Used to idenNfy the task From arguments
or from configuraNon Many types of Parameters (int, date, boolean, date range, Nme delta, dict, enum)
Targets
Targets Resources produced by a Task Typically Local files
or files distributed file system (HDFS) Must implement the method exists() Many targets available
None
Scheduler & Workers
None
Source: h@p:/ /www.arashrouhani.com/luigid-‐basics-‐jun-‐2015
BaVeries Included
Batteries Included Package contrib filled with goodies Good support
for Hadoop Different Targets Extensible
Task Types Task -‐ Local Hadoop MR, Pig, Spark,
etc SalesForce, ElasNcsearch, etc. ExternalProgram check luigi.contrib !
Target LocalTarget HDFS, S3, FTP, SSH, WebHDFS, etc.
ESTarget, MySQLTarget, MSQL, Hive, SQLAlchemy, etc.
None
Tips & Tricks
Separate pipeline and logic
Extend to avoid boilerplate code
DRY
Conclusion Luigi is a mature, baVeries-‐included alternaNve for building
data pipelines Lacks of powerful visualizaNon of the pipelines Requires a external way of launching jobs (i.e. cron). Hard to debug MR Jobs
Lear More hVps:/ /github.com/spoNfy/luigi hVp:/ /luigi.readthedocs.io/en/stable/
Thanks!
Credits • pipe icon by Oliviu Stoian from the Noun
Project • Photo Credit: (CC) h@ps:/ /www.flickr.com/photos/ 47244853@N03/29988510886 from hb.s via Compfight • Concrete Mixer: (CC) h@ps:/ /www.flickr.com/photos/ 145708285@N03/30138453986 by MasLabor via Compfight