Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Super-Fast Clustering Report in MapR

Super-Fast Clustering Report in MapR

Presentation by Ted Dunning core Mahout commiter, architect at MapR, and author of Mahout in Action at Data Science London 23/05/12

Data Science London

July 03, 2012
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. 2   ©MapR  Technologies  -­‐  Confiden6al   §  Contact:  

    –  [email protected]   –  @ted_dunning   §  TwiAer  for  this  talk   –  #mapr_uk   §  Slides  and  such:   –  hAp://info.mapr.com/ted-­‐uk-­‐05-­‐2012      
  2. 3   ©MapR  Technologies  -­‐  Confiden6al   Company  Background  

    §  MapR  provides  the  industry’s  best  Hadoop  Distribu6on   –  Combines  the  best  of  the  Hadoop  community     contribu6ons  with  significant  internally     financed  infrastructure  development   §  Background  of  Team   –  Deep  management  bench  with    extensive  analy6c,     storage,  virtualiza6on,  and  open  source  experience   –  Google,  EMC,  Cisco,  VMWare,  Network  Appliance,  IBM,   Microso[,  Apache  Founda6on,  Aster  Data,  Brio,  ParAccel   §  Proven     –  MapR  used  across  industries  (Financial  Services,  Media,     Telcom,  Health  Care,  Internet  Services,  Government)     –  Strategic  OEM  rela6onship  with  EMC  and  Cisco   –  Over  1,000  installs  
  3. 4   ©MapR  Technologies  -­‐  Confiden6al   We  Also  Do

     …   §  Open  source  development   –  Zookeeper   –  Hadoop   –  Mahout   –  Stuff   §  Partner  workshops   –  Machine  learning   –  Informa6on  architecture   –  Cluster  design  
  4. 5   ©MapR  Technologies  -­‐  Confiden6al   We  Also  Do

     …   §  Open  source  development   –  Zookeeper   –  Hadoop   –  Mahout   –  Stuff   §  Partner  workshops   –  Machine  learning   –  Informa6on  architecture   –  Cluster  design  
  5. 6   ©MapR  Technologies  -­‐  Confiden6al   The  Problem  

    §  A  certain  bank   –  had  lots  of  customers   –  had  lots  of  prospec6ve  customers   –  had  a  non-­‐trivial  number  of  fraudulent  customers   –  had  a  non-­‐trivial  number  of  fraudulent  merchants   §  They  also     –  collected  data   –  built  models   –  collected  more  data   –  built  more  models  
  6. 7   ©MapR  Technologies  -­‐  Confiden6al   But  …  

    §  These  models  were  arduous  to  build   §  And  hard  to  test   §  So  people  suggested  something  simpler   §  Like  k-­‐nearest  neighbor  
  7. 8   ©MapR  Technologies  -­‐  Confiden6al   What’s  that?  

    §  Find  the  k  nearest  training  examples   §  Use  the  average  value  of  the  target  variable  from  them   §  This  is  easy    …  but  hard   –  easy  because  it  is  so  conceptually  simple  and  you  don’t  have  knobs  to  turn   or  models  to  build   –  hard  because  of  the  stunning  amount  of  math   –  also  hard  because  we  need  top  50,000  results   §  Ini6al  prototype  was  massively  too  slow   –  3K  queries  x  200K  examples  takes  hours   –  needed  20M  x  25M  in  the  same  6me  
  8. 9   ©MapR  Technologies  -­‐  Confiden6al   What  We  Did

      §  Mechanism  for  extending  Mahout  Vectors   –  Delega6ngVector,  WeightedVector,  Centroid   §  Searcher  interface   –  Projec6onSearch,  KmeansSearch,  LshSearch,  Brute   §  Super-­‐fast  clustering   –  Kmeans,  StreamingKmeans  
  9. 10   ©MapR  Technologies  -­‐  Confiden6al   ProjecGon  Search  

    1.5 -2 -1.5 -1 -0.5 0.5 1 3 -3 -2 -1 1 2 X Axis Y Axis
  10. 11   ©MapR  Technologies  -­‐  Confiden6al   K-­‐means  Search  

    1.5 -2 -1.5 -1 -0.5 0.5 1 3 -3 -2 -1 1 2 X Axis Y Axis
  11. 12   ©MapR  Technologies  -­‐  Confiden6al   But  These  Require

     k-­‐means!   §  Need  a  new  k-­‐means  algorithm  to  get  speed   §  Streaming  k-­‐means  is   –  One  pass  (through  the  original  data)   –  Very  fast  (20  us  per  data  point  with  threads)   –  Very  parallelizable  
  12. 13   ©MapR  Technologies  -­‐  Confiden6al   How  It  Works

      §  For  each  point   –  Find  approximately  nearest  centroid  (distance  =  d)   –  If  d  >  threshold,  new  centroid   –  Else  possibly  new  cluster   –  Else  add  to  nearest  centroid   §  If  centroids  >  K  ~  C  log  N   –  Recursively  cluster  centroids  with  higher  threshold   §  Result  is  large  set  of  centroids   –  these  provide  approxima6on  of  original  distribu6on   –  we  can  cluster  centroids  to  get  a  close  approxima6on  of  clustering  original   –  or  we  can  just  use  the  result  directly  
  13. 14   ©MapR  Technologies  -­‐  Confiden6al   Parallel  Speedup?  

    1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Time per point (μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non-threaded Perfect Scaling ✓
  14. 15   ©MapR  Technologies  -­‐  Confiden6al   Warning,  Recursive  Descent

      §  Inner  loop  requires  finding  nearest  centroid   §  With  lots  of  centroids,  this  is  slow   §  But  wait,  we  have  classes  to  accelerate  that!  
  15. 16   ©MapR  Technologies  -­‐  Confiden6al   Warning,  Recursive  Descent

      §  Inner  loop  requires  finding  nearest  centroid   §  With  lots  of  centroids,  this  is  slow   §  But  wait,  we  have  classes  to  accelerate  that!      (Let’s  not  use  k-­‐means  searcher,  though)  
  16. 17   ©MapR  Technologies  -­‐  Confiden6al   §  Contact:  

    –  [email protected]   –  @ted_dunning   §  Slides  and  such:   –  hAp://info.mapr.com/ted-­‐uk-­‐05-­‐2012