Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConDE 2016 - Building Data 
 Pipelines with Python

PyConDE 2016 - Building Data 
 Pipelines with Python

Miguel Cabrera

October 31, 2016
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. Building Data 

    Pipelines
    with
    Python
    Data  Engineer  @  TY

    @mfcabrera

    [email protected]
    Miguel  Cabrera

    PyCon  Deutschland  30.10.2016

    View Slide

  2. Agenda

    View Slide

  3. Agenda
    Context  
    Data  Pipelines  with  Luigi  
    Tips  and  Tricks  
    Examples

    View Slide

  4. Data Processing
    Pipelines

    View Slide

  5. cat  file.txt  |  wc  -­‐  l  |  

    mail  -­‐s  “hello”  [email protected]

    View Slide

  6. ETL

    View Slide

  7. ETL
    • Extract  data  from  a  data  source  
    • Transform  the  data  
    • Load  into  a  sink  

    View Slide

  8. View Slide

  9. Feature 

    Extraction
    Parameter
    Estimation
    Model Training
    Feature 

    Extraction
    Model Predict
    Visualize/
    Format

    View Slide

  10. Steps  in  different  technologies

    View Slide

  11. Steps  can  be  run  in  parallel

    View Slide

  12. Steps  have    complex  
    dependencies  among  them

    View Slide

  13. Workflows
    • Repeat    
    • Parametrize    
    • Resume  
    • Schedule  it

    View Slide

  14. View Slide

  15. View Slide

  16. “A Python framework for
    data flow definition and
    execution”
    Luigi

    View Slide

  17. Concepts

    View Slide

  18. Concepts
    Tasks  
    Parameters  
    Targets  
    Scheduler  &  Workers

    View Slide

  19. Tasks

    View Slide

  20. View Slide

  21. 1

    View Slide

  22. 2

    View Slide

  23. 3

    View Slide

  24. 4

    View Slide

  25. WordCountTask
    file.txt wc.txt

    View Slide

  26. WordCountTask
    file.txt wc.txt
    ToJsonTask
    wc.json

    View Slide

  27. View Slide

  28. Parameters

    View Slide

  29. View Slide

  30. Parameters
    Used  to  idenNfy  the  task    
    From  arguments  or  from  configuraNon  
    Many  types  of  Parameters  (int,  date,  
    boolean,  date  range,  Nme  delta,  dict,  
    enum)

    View Slide

  31. Targets

    View Slide

  32. Targets
    Resources  produced  by  a  Task  
    Typically  Local  files  or  files  distributed  file  
    system  (HDFS)  
    Must  implement  the  method  exists()  
    Many  targets  available

    View Slide

  33. View Slide

  34. Scheduler  &  Workers

    View Slide

  35. View Slide

  36. Source:  [email protected]:/
    /www.arashrouhani.com/luigid-­‐basics-­‐jun-­‐2015

    View Slide

  37. BaVeries  Included

    View Slide

  38. Batteries Included
    Package  contrib  filled  with  goodies  
    Good  support  for  Hadoop    
    Different  Targets  
    Extensible

    View Slide

  39. Task Types
    Task  -­‐  Local  
    Hadoop  MR,  Pig,  Spark,  etc  
    SalesForce,  ElasNcsearch,  etc.  
    ExternalProgram  
    check  luigi.contrib  !

    View Slide

  40. Target
    LocalTarget  
    HDFS,  S3,  FTP,  SSH,    WebHDFS,  etc.  
    ESTarget,  MySQLTarget,  MSQL,  Hive,  
    SQLAlchemy,  etc.

    View Slide

  41. View Slide

  42. Tips  &  Tricks

    View Slide

  43. Separate  pipeline  and  logic

    View Slide

  44. Extend  to  avoid  boilerplate  code

    View Slide

  45. DRY

    View Slide

  46. Conclusion
    Luigi  is  a  mature,  baVeries-­‐included  
    alternaNve  for  building  data  pipelines  
    Lacks  of  powerful  visualizaNon  of  the  
    pipelines  
    Requires  a  external  way  of  launching  jobs  
    (i.e.  cron).  
    Hard  to  debug  MR  Jobs

    View Slide

  47. Lear More
    hVps:/
    /github.com/spoNfy/luigi  
    hVp:/
    /luigi.readthedocs.io/en/stable/

    View Slide

  48. Thanks!

    View Slide

  49. Credits
    • pipe  icon  by  Oliviu  Stoian  from  the  Noun  Project  
    • Photo  Credit:  (CC)  [email protected]:/
    /www.flickr.com/photos/
    [email protected]/29988510886  from  hb.s  via  Compfight    
    • Concrete  Mixer:  (CC)    [email protected]:/
    /www.flickr.com/photos/
    [email protected]/30138453986  by  MasLabor  via  
    Compfight

    View Slide