Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE 2016 - Building Data 
 Pipelines with Python

PyConDE 2016 - Building Data 
 Pipelines with Python

Miguel Cabrera

October 31, 2016
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Building Data 
 Pipelines with Python Data  Engineer  @  TY


    @mfcabrera
 [email protected] Miguel  Cabrera
 PyCon  Deutschland  30.10.2016
  2. ETL

  3. ETL • Extract  data  from  a  data  source   •

    Transform  the  data   • Load  into  a  sink  
  4. 1

  5. 2

  6. 3

  7. 4

  8. Parameters Used  to  idenNfy  the  task     From  arguments

     or  from  configuraNon   Many  types  of  Parameters  (int,  date,   boolean,  date  range,  Nme  delta,  dict,   enum)
  9. Targets Resources  produced  by  a  Task   Typically  Local  files

     or  files  distributed  file   system  (HDFS)   Must  implement  the  method  exists()   Many  targets  available
  10. Batteries Included Package  contrib  filled  with  goodies   Good  support

     for  Hadoop     Different  Targets   Extensible
  11. Task Types Task  -­‐  Local   Hadoop  MR,  Pig,  Spark,

     etc   SalesForce,  ElasNcsearch,  etc.   ExternalProgram   check  luigi.contrib  !
  12. Target LocalTarget   HDFS,  S3,  FTP,  SSH,    WebHDFS,  etc.

      ESTarget,  MySQLTarget,  MSQL,  Hive,   SQLAlchemy,  etc.
  13. DRY

  14. Conclusion Luigi  is  a  mature,  baVeries-­‐included   alternaNve  for  building

     data  pipelines   Lacks  of  powerful  visualizaNon  of  the   pipelines   Requires  a  external  way  of  launching  jobs   (i.e.  cron).   Hard  to  debug  MR  Jobs
  15. Credits • pipe  icon  by  Oliviu  Stoian  from  the  Noun

     Project   • Photo  Credit:  (CC)  h@ps:/ /www.flickr.com/photos/ 47244853@N03/29988510886  from  hb.s  via  Compfight     • Concrete  Mixer:  (CC)    h@ps:/ /www.flickr.com/photos/ 145708285@N03/30138453986  by  MasLabor  via   Compfight