$30 off During Our Annual Pro Sale. View Details »

Machine Learning and Hadoop: Present and Future

Machine Learning and Hadoop: Present and Future

Josh Wills Data Science Director @Cloudera talk at Data Science London 06/09/12

Data Science London

September 07, 2012
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Machine  Learning  and  Hadoop  
    Present  and  Future  
    Josh  Wills  
    Cloudera  Data  Science  Team
     
    September  6th,  2012  

    View Slide

  2. About  Me  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  3. Outline  
    •  Part  1:  Industrial  Machine  Learning  
    •  Part  2:  ML  and  Hadoop:  The  State  of  the  World  
    •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  4. (Academic)  ML  vs.  (Academic)  StaIsIcs  
     
     
     
    “Machine  learning  is  sta/s/cs  minus  any  checking  of  
    models  and  assump/ons.”  
                   -­‐-­‐  Brian  Ripley,  UseR!  2004  
                   (provoca/vely  paraphrased)  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  5. Industrial  Machine  Learning:  Truth  #1  
     
     
     
    The  thing  that  we  are  trying  to  predict  is  rarely  the  thing  
    that  we  are  trying  to  opImize.
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  6. Industrial  Machine  Learning:  Truth  #2  
     
     
     
     
    Systems  precede  algorithms.
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  7. Industrial  Machine  Learning:  Truth  #3  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  
    Practice Over Theory Blog

    View Slide

  8. ImplicaIon  
     
     
     
    Data  science  requires  predicIon-­‐oriented  machine  
    learning  models  AND  classical,  rigorous  staIsIcal  
    analysis.
     
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  9. Outline  
    •  Part  1:  Industrial  Machine  Learning  
    •  Part  2:  ML  and  Hadoop:  The  State  of  the  World  
    •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  10. “Hadoop.  It’s  Where  The  Data  Is.”  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  11. Hadoop  PlaWorm:  Substrate  
    •  Commodity  servers  
    •  Open  source  operaFng  system  
    •  “”  ConfiguraFon  Management  
    •  “”  CoordinaFon  Service  
    •  “”  File  System  API  
    •  “”  Efficient  and  Extensible  File  Formats  
    •  “”  Efficient  and  Extensible  RPC  Libraries  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  12. Hadoop  PlaWorm:  MapReduce  Frameworks  
    •  Languages/Environments  
    •  PigLaFn  (Apache)  
    •  HiveQL  (Apache)  
    •  Jaql  (IBM)  
    •  Java/Scala  APIs  
    •  Crunch  (Apache  Incubator)  
    •  Scoobi  (NICTA)  
    •  Cascading  (Concurrent)  
    •  Pangool  
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  13. ML  and  Hadoop:  The  State  of  the  World
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  14. MapReduce  
    •  Great  for:  
    •  Data  PreparaFon  
    •  Feature  Engineering  
    •  Model  ValidaFon/EvaluaFon  
    •  Works  Well  For  Certain  Model  Fi\ng  Problems  
    •  CollaboraFve  Filtering  Algorithms  
    •  ExpectaFon  MaximizaFon  
    •  Decision  Trees  (PLANET;  Gradient  Boosted  Decision  Trees)  
    •  Not  A  PracIcal  OpIon  for  Many  Kinds  of  Problems  
    •  Way  More  Detail  in  the  KDD  2011  Talk  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  15. Apache  Mahout  
    •  The  starFng  place  for  MapReduce-­‐based  machine  
    learning  algorithms  
    •  Not  machine-­‐learning-­‐in-­‐a-­‐box  
    •  Custom  tweaks/modificaFons  are  the  rule  
    •  A  disparate  collecFon  of  algorithms  for:  
    •  RecommendaFons  
    •  Clustering  
    •  ClassificaFon  
    •  Frequent  Itemset  Mining  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  16. Apache  Mahout  (cont.)  
    •  Best  Library:  Taste  Recommender  
    •  Oldest  project,  most  widely-­‐deployed  in  producFon  
    •  SVD  implementaFon  is  parFcularly  acFve  
    •  Good  Libraries:  Online  SGD  
    •  Does  not  use  MapReduce  
    •  Vowpal  Rabbit  is  faster,  has  L-­‐BFGS  opFon  
    •  Roll  Your  Own  Instead:  Naïve  Bayes  
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  17. The  Ominous  Challenges
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  18. 1.  The  Secret  Sauce  Effect
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  19. 2.  Delta  Between  Mahout  and  the  Cu_ng  Edge
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  20. ML  and  Hadoop:  Where  Things  are  Headed
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  21. Moving  Beyond  MapReduce  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  22. The  Contenders
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  23. AllReduce  
    •  Developed  at  Yahoo!  Research  
    •  Defines  the  allreduce  operaFon  
    •  N  machines  each  have  a  number  =>  each  machine  has  the  
    sum  of  the  numbers  
    •  At  the  heart  of  Vowpal  Wabbit’s  performance  
    •  Implemented  in  C++  
    •  Can  be  patched  into  Apache  Hadoop  and  used  today  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  24. Spark  
    •  Developed  at  Berkeley’s  
    AMP  Lab  
    •  Defines  operaFons  on  
    distributed  in-­‐memory  
    collecFons  
    •  Wriken  in  Scala  
    •  Supports  reading  to  and  
    wriFng  from  HDFS  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  25. GraphLab  
    •  Developed  at  CMU  
    •  Lower-­‐level  primiFves  
    •  (but  higher  than  MPI)  
    •  Map/Reduce  =>  
    Update/Sort  
    •  Flexible,  allows  for  
    asynchronous  
    computaFons  
    •  Reads  from  HDFS  
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  26. How  Things  Measure  Up
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  27. Speed  vs.  Reliability
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  28. Memory  vs.  Disk
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  29. C++  vs.  JVM
     
    Copyright  2012  Cloudera  Inc.  All  rights  reserved  

    View Slide

  30. QuesIons?  
    (Ask  Anything.  Anything  At  All.)  
    [email protected]
     

    View Slide