Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE 2016 - Building Data 
 Pipelines with Python

PyConDE 2016 - Building Data 
 Pipelines with Python

Miguel Cabrera

October 31, 2016
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Building Data 

    Pipelines
    with
    Python
    Data  Engineer  @  TY

    @mfcabrera

    [email protected]
    Miguel  Cabrera

    PyCon  Deutschland  30.10.2016

    View full-size slide

  2. Agenda
    Context  
    Data  Pipelines  with  Luigi  
    Tips  and  Tricks  
    Examples

    View full-size slide

  3. Data Processing
    Pipelines

    View full-size slide

  4. cat  file.txt  |  wc  -­‐  l  |  

    mail  -­‐s  “hello”  [email protected]

    View full-size slide

  5. ETL
    • Extract  data  from  a  data  source  
    • Transform  the  data  
    • Load  into  a  sink  

    View full-size slide

  6. Feature 

    Extraction
    Parameter
    Estimation
    Model Training
    Feature 

    Extraction
    Model Predict
    Visualize/
    Format

    View full-size slide

  7. Steps  in  different  technologies

    View full-size slide

  8. Steps  can  be  run  in  parallel

    View full-size slide

  9. Steps  have    complex  
    dependencies  among  them

    View full-size slide

  10. Workflows
    • Repeat    
    • Parametrize    
    • Resume  
    • Schedule  it

    View full-size slide

  11. “A Python framework for
    data flow definition and
    execution”
    Luigi

    View full-size slide

  12. Concepts
    Tasks  
    Parameters  
    Targets  
    Scheduler  &  Workers

    View full-size slide

  13. WordCountTask
    file.txt wc.txt

    View full-size slide

  14. WordCountTask
    file.txt wc.txt
    ToJsonTask
    wc.json

    View full-size slide

  15. Parameters
    Used  to  idenNfy  the  task    
    From  arguments  or  from  configuraNon  
    Many  types  of  Parameters  (int,  date,  
    boolean,  date  range,  Nme  delta,  dict,  
    enum)

    View full-size slide

  16. Targets
    Resources  produced  by  a  Task  
    Typically  Local  files  or  files  distributed  file  
    system  (HDFS)  
    Must  implement  the  method  exists()  
    Many  targets  available

    View full-size slide

  17. Scheduler  &  Workers

    View full-size slide

  18. Source:  h@p:/
    /www.arashrouhani.com/luigid-­‐basics-­‐jun-­‐2015

    View full-size slide

  19. BaVeries  Included

    View full-size slide

  20. Batteries Included
    Package  contrib  filled  with  goodies  
    Good  support  for  Hadoop    
    Different  Targets  
    Extensible

    View full-size slide

  21. Task Types
    Task  -­‐  Local  
    Hadoop  MR,  Pig,  Spark,  etc  
    SalesForce,  ElasNcsearch,  etc.  
    ExternalProgram  
    check  luigi.contrib  !

    View full-size slide

  22. Target
    LocalTarget  
    HDFS,  S3,  FTP,  SSH,    WebHDFS,  etc.  
    ESTarget,  MySQLTarget,  MSQL,  Hive,  
    SQLAlchemy,  etc.

    View full-size slide

  23. Tips  &  Tricks

    View full-size slide

  24. Separate  pipeline  and  logic

    View full-size slide

  25. Extend  to  avoid  boilerplate  code

    View full-size slide

  26. Conclusion
    Luigi  is  a  mature,  baVeries-­‐included  
    alternaNve  for  building  data  pipelines  
    Lacks  of  powerful  visualizaNon  of  the  
    pipelines  
    Requires  a  external  way  of  launching  jobs  
    (i.e.  cron).  
    Hard  to  debug  MR  Jobs

    View full-size slide

  27. Lear More
    hVps:/
    /github.com/spoNfy/luigi  
    hVp:/
    /luigi.readthedocs.io/en/stable/

    View full-size slide

  28. Credits
    • pipe  icon  by  Oliviu  Stoian  from  the  Noun  Project  
    • Photo  Credit:  (CC)  h@ps:/
    /www.flickr.com/photos/
    47244853@N03/29988510886  from  hb.s  via  Compfight    
    • Concrete  Mixer:  (CC)    h@ps:/
    /www.flickr.com/photos/
    145708285@N03/30138453986  by  MasLabor  via  
    Compfight

    View full-size slide