PyConDE 2016 - Building Data   Pipelines with Python

Building Data   Pipelines with Python Data Engineer @ TY 
@mfcabrera  [email protected] Miguel Cabrera  PyCon Deutschland 30.10.2016

Agenda

Agenda Context Data Pipelines with Luigi Tips and
Tricks Examples

Data Processing Pipelines

cat file.txt | wc -‐ l |   mail -‐s
“hello” [email protected]

ETL • Extract data from a data source •
Transform the data • Load into a sink

Feature   Extraction Parameter Estimation Model Training Feature   Extraction
Model Predict Visualize/ Format

Steps in diﬀerent technologies

Steps can be run in parallel

Steps have complex dependencies among them

Workﬂows • Repeat • Parametrize •
Resume • Schedule it

“A Python framework for data ﬂow deﬁnition and execution” Luigi

Concepts

Concepts Tasks Parameters Targets Scheduler & Workers

WordCountTask ﬁle.txt wc.txt

WordCountTask ﬁle.txt wc.txt ToJsonTask wc.json

Parameters

Parameters Used to idenNfy the task From arguments
or from conﬁguraNon Many types of Parameters (int, date, boolean, date range, Nme delta, dict, enum)

Targets

Targets Resources produced by a Task Typically Local files
or files distributed file system (HDFS) Must implement the method exists() Many targets available

Scheduler & Workers

Source: h@p:/ /www.arashrouhani.com/luigid-‐basics-‐jun-‐2015

BaVeries Included

Batteries Included Package contrib ﬁlled with goodies Good support
for Hadoop Diﬀerent Targets Extensible

Task Types Task -‐ Local Hadoop MR, Pig, Spark,
etc SalesForce, ElasNcsearch, etc. ExternalProgram check luigi.contrib !

Target LocalTarget HDFS, S3, FTP, SSH, WebHDFS, etc.
ESTarget, MySQLTarget, MSQL, Hive, SQLAlchemy, etc.

Tips & Tricks

Separate pipeline and logic

Extend to avoid boilerplate code

Conclusion Luigi is a mature, baVeries-‐included alternaNve for building
data pipelines Lacks of powerful visualizaNon of the pipelines Requires a external way of launching jobs (i.e. cron). Hard to debug MR Jobs

Lear More hVps:/ /github.com/spoNfy/luigi hVp:/ /luigi.readthedocs.io/en/stable/

Thanks!

Credits • pipe icon by Oliviu Stoian from the Noun
Project • Photo Credit: (CC) h@ps:/ /www.flickr.com/photos/ 47244853@N03/29988510886 from hb.s via Compfight • Concrete Mixer: (CC) h@ps:/ /www.flickr.com/photos/ 145708285@N03/30138453986 by MasLabor via Compfight

PyConDE 2016 - Building Data Pipelines with P...

PyConDE 2016 - Building Data Pipelines with Python

More Decks by Miguel Cabrera

Other Decks in Technology

Featured

Transcript

PyConDE 2016 - Building Data   Pipelines with P...

PyConDE 2016 - Building Data   Pipelines with Python