Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning and Hadoop: Present and Future

Machine Learning and Hadoop: Present and Future

Josh Wills Data Science Director @Cloudera talk at Data Science London 06/09/12

Data Science London

September 07, 2012
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Machine  Learning  and  Hadoop   Present  and  Future   Josh

     Wills   Cloudera  Data  Science  Team   September  6th,  2012  
  2. Outline   •  Part  1:  Industrial  Machine  Learning   • 

    Part  2:  ML  and  Hadoop:  The  State  of  the  World   •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  3. (Academic)  ML  vs.  (Academic)  StaIsIcs         “Machine

     learning  is  sta/s/cs  minus  any  checking  of   models  and  assump/ons.”                  -­‐-­‐  Brian  Ripley,  UseR!  2004                  (provoca/vely  paraphrased)   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  4. Industrial  Machine  Learning:  Truth  #1         The

     thing  that  we  are  trying  to  predict  is  rarely  the  thing   that  we  are  trying  to  opImize.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  5. Industrial  Machine  Learning:  Truth  #2          

    Systems  precede  algorithms.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  6. Industrial  Machine  Learning:  Truth  #3   Copyright  2012  Cloudera  Inc.

     All  rights  reserved   Practice Over Theory Blog
  7. ImplicaIon         Data  science  requires  predicIon-­‐oriented  machine

      learning  models  AND  classical,  rigorous  staIsIcal   analysis.     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  8. Outline   •  Part  1:  Industrial  Machine  Learning   • 

    Part  2:  ML  and  Hadoop:  The  State  of  the  World   •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  9. Hadoop  PlaWorm:  Substrate   •  Commodity  servers   •  Open

     source  operaFng  system   •  “”  ConfiguraFon  Management   •  “”  CoordinaFon  Service   •  “”  File  System  API   •  “”  Efficient  and  Extensible  File  Formats   •  “”  Efficient  and  Extensible  RPC  Libraries   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  10. Hadoop  PlaWorm:  MapReduce  Frameworks   •  Languages/Environments   •  PigLaFn

     (Apache)   •  HiveQL  (Apache)   •  Jaql  (IBM)   •  Java/Scala  APIs   •  Crunch  (Apache  Incubator)   •  Scoobi  (NICTA)   •  Cascading  (Concurrent)   •  Pangool     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  11. ML  and  Hadoop:  The  State  of  the  World   Copyright

     2012  Cloudera  Inc.  All  rights  reserved  
  12. MapReduce   •  Great  for:   •  Data  PreparaFon  

    •  Feature  Engineering   •  Model  ValidaFon/EvaluaFon   •  Works  Well  For  Certain  Model  Fi\ng  Problems   •  CollaboraFve  Filtering  Algorithms   •  ExpectaFon  MaximizaFon   •  Decision  Trees  (PLANET;  Gradient  Boosted  Decision  Trees)   •  Not  A  PracIcal  OpIon  for  Many  Kinds  of  Problems   •  Way  More  Detail  in  the  KDD  2011  Talk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  13. Apache  Mahout   •  The  starFng  place  for  MapReduce-­‐based  machine

      learning  algorithms   •  Not  machine-­‐learning-­‐in-­‐a-­‐box   •  Custom  tweaks/modificaFons  are  the  rule   •  A  disparate  collecFon  of  algorithms  for:   •  RecommendaFons   •  Clustering   •  ClassificaFon   •  Frequent  Itemset  Mining   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  14. Apache  Mahout  (cont.)   •  Best  Library:  Taste  Recommender  

    •  Oldest  project,  most  widely-­‐deployed  in  producFon   •  SVD  implementaFon  is  parFcularly  acFve   •  Good  Libraries:  Online  SGD   •  Does  not  use  MapReduce   •  Vowpal  Rabbit  is  faster,  has  L-­‐BFGS  opFon   •  Roll  Your  Own  Instead:  Naïve  Bayes     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  15. 2.  Delta  Between  Mahout  and  the  Cu_ng  Edge   Copyright

     2012  Cloudera  Inc.  All  rights  reserved  
  16. ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012

     Cloudera  Inc.  All  rights  reserved  
  17. AllReduce   •  Developed  at  Yahoo!  Research   •  Defines

     the  allreduce  operaFon   •  N  machines  each  have  a  number  =>  each  machine  has  the   sum  of  the  numbers   •  At  the  heart  of  Vowpal  Wabbit’s  performance   •  Implemented  in  C++   •  Can  be  patched  into  Apache  Hadoop  and  used  today   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  18. Spark   •  Developed  at  Berkeley’s   AMP  Lab  

    •  Defines  operaFons  on   distributed  in-­‐memory   collecFons   •  Wriken  in  Scala   •  Supports  reading  to  and   wriFng  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  19. GraphLab   •  Developed  at  CMU   •  Lower-­‐level  primiFves

      •  (but  higher  than  MPI)   •  Map/Reduce  =>   Update/Sort   •  Flexible,  allows  for   asynchronous   computaFons   •  Reads  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved