Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE 2016 - Building Data 
 Pipelines with Python

PyConDE 2016 - Building Data 
 Pipelines with Python


Miguel Cabrera

October 31, 2016

More Decks by Miguel Cabrera

Other Decks in Technology


  1. Building Data 
 Pipelines with Python Data  Engineer  @  TY

 mfcabrera@gmail.com Miguel  Cabrera
 PyCon  Deutschland  30.10.2016
  2. Agenda

  3. Agenda Context   Data  Pipelines  with  Luigi   Tips  and

     Tricks   Examples
  4. Data Processing Pipelines

  5. cat  file.txt  |  wc  -­‐  l  |  
 mail  -­‐s

     “hello”  me@mail.org
  6. ETL

  7. ETL • Extract  data  from  a  data  source   •

    Transform  the  data   • Load  into  a  sink  
  8. None
  9. Feature 
 Extraction Parameter Estimation Model Training Feature 

    Model Predict Visualize/ Format
  10. Steps  in  different  technologies

  11. Steps  can  be  run  in  parallel

  12. Steps  have    complex   dependencies  among  them

  13. Workflows • Repeat     • Parametrize     •

    Resume   • Schedule  it
  14. None
  15. None
  16. “A Python framework for data flow definition and execution” Luigi

  17. Concepts

  18. Concepts Tasks   Parameters   Targets   Scheduler  &  Workers

  19. Tasks

  20. None
  21. 1

  22. 2

  23. 3

  24. 4

  25. WordCountTask file.txt wc.txt

  26. WordCountTask file.txt wc.txt ToJsonTask wc.json

  27. None
  28. Parameters

  29. None
  30. Parameters Used  to  idenNfy  the  task     From  arguments

     or  from  configuraNon   Many  types  of  Parameters  (int,  date,   boolean,  date  range,  Nme  delta,  dict,   enum)
  31. Targets

  32. Targets Resources  produced  by  a  Task   Typically  Local  files

     or  files  distributed  file   system  (HDFS)   Must  implement  the  method  exists()   Many  targets  available
  33. None
  34. Scheduler  &  Workers

  35. None
  36. Source:  h@p:/ /www.arashrouhani.com/luigid-­‐basics-­‐jun-­‐2015

  37. BaVeries  Included

  38. Batteries Included Package  contrib  filled  with  goodies   Good  support

     for  Hadoop     Different  Targets   Extensible
  39. Task Types Task  -­‐  Local   Hadoop  MR,  Pig,  Spark,

     etc   SalesForce,  ElasNcsearch,  etc.   ExternalProgram   check  luigi.contrib  !
  40. Target LocalTarget   HDFS,  S3,  FTP,  SSH,    WebHDFS,  etc.

      ESTarget,  MySQLTarget,  MSQL,  Hive,   SQLAlchemy,  etc.
  41. None
  42. Tips  &  Tricks

  43. Separate  pipeline  and  logic

  44. Extend  to  avoid  boilerplate  code

  45. DRY

  46. Conclusion Luigi  is  a  mature,  baVeries-­‐included   alternaNve  for  building

     data  pipelines   Lacks  of  powerful  visualizaNon  of  the   pipelines   Requires  a  external  way  of  launching  jobs   (i.e.  cron).   Hard  to  debug  MR  Jobs
  47. Lear More hVps:/ /github.com/spoNfy/luigi   hVp:/ /luigi.readthedocs.io/en/stable/

  48. Thanks!

  49. Credits • pipe  icon  by  Oliviu  Stoian  from  the  Noun

     Project   • Photo  Credit:  (CC)  h@ps:/ /www.flickr.com/photos/ 47244853@N03/29988510886  from  hb.s  via  Compfight     • Concrete  Mixer:  (CC)    h@ps:/ /www.flickr.com/photos/ 145708285@N03/30138453986  by  MasLabor  via   Compfight