Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MLlib Decision Trees at SF Scala-BAML Meetup

jkbradley
September 22, 2014

MLlib Decision Trees at SF Scala-BAML Meetup

Talk from SF Scala / Bay Area Machine Learning Meetup on 9-22-2014.

This talk discusses learning Decision Trees in a distributed computing cluster using MLlib, the machine learning library built on top of Spark. Decision trees are a powerful machine learning algorithm which are used in many applications. Spark is an open-source project for large-scale data analytics. This talk explains how trees are implemented on Spark, discusses how best to use MLlib trees in practice, and gives a number of examples.

jkbradley

September 22, 2014
Tweet

More Decks by jkbradley

Other Decks in Technology

Transcript

  1. Decision  Trees   Example:  Spam  detec<on   Get Viagra real

    cheap! Send money now to get… word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   Instance/ example   addressKnown   False   Feature   Goal:   Given  an  instance,   Examine  its  features,  &   Predict  a  label   Not  spam   Spam  
  2. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown     False  
  3. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown     False  
  4. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown     False  
  5. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown     False  
  6. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown     False  
  7. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown     False  
  8. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Given:  instance  (feature  vector)   word   count   get   2   Viagra   1   real   1   cheap   1   send   1   …   …   addressKnown   False   Internal  tree  node  tests  a  feature   Leaf  node  predicts  a  label  
  9. Decision  Trees   addressKnown    ==    True   Count(“Viagra”)

       >    0   No   Yes   Yes   No   Not  spam   Not  spam   Spam   Labels:   •  Categorical  (classifica<on)   •  Con<nuous  (regression)   Features:   •  Categorical   •  Con<nuous   •  Interpretable  models   •  Models  can  be  simple  (small)   or  expressive  (big)   These  are  industry   workhorses!  
  10. Outline   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  Prac<ce   – Model  selec<on   – Accuracy—communica<on  trade-­‐offs   •  Ac<ve  Development  
  11. Outline   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  Prac<ce   – Model  selec<on   – Accuracy—communica<on  trade-­‐offs   •  Ac<ve  Development  
  12. Learning  Decision  Trees   Get Viagra real cheap! Send money

    now to… Training  data:   Not  spam   Spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam  
  13. Learning  Decision  Trees   addressKnown  ==  True   No  

    Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Not  spam   Not  spam   Not  spam   Spam   Not  spam   Spam   Spam   Not  spam  
  14. Learning  Decision  Trees   addressKnown  ==  True   No  

    Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Spam   Not  spam   Spam   Spam   Not  spam   Not  spam  
  15. Learning  Decision  Trees   addressKnown  ==  True   No  

    Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Not  spam   Count(“Viagra”)  >  0   Yes   No   Spam   Not  spam   Spam   Spam   Not  spam  
  16. Learning  Decision  Trees   addressKnown  ==  True   No  

    Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Not  spam   Count(“Viagra”)  >  0   Yes   No   Not  spam   Spam   Recursively  par<<on   training  instances  
  17. Distributed  Learning  of  Trees   Get Viagra real cheap! Send

    money now to… Not  spam   Spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Not  spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Get Viagra real cheap! Send money now to… Spam   Distribute  data:  Par<<on  by  rows  (instances)  
  18. Distributed  Learning  of  Trees   addressKnown  ==  True   No

      Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Not  spam   Not  spam   Not  spam   Spam   Not  spam   Spam   Spam   Not  spam   Distribute  data:  Par<<on  by  rows  (instances)   Get Viagra real cheap! Send money now to… Not  spam   Spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Not  spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Get Viagra real cheap! Send money now to… Spam  
  19. Distributed  Learning  of  Trees   Split  to…   right  

    le]   right   right   le]   right   right   right   Tradi<onal  algorithm:   Shuffle  en<re  dataset   across  network  many   <mes    L   Distribute  data:  Par<<on  by  rows  (instances)   Get Viagra real cheap! Send money now to… Not  spam   Spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Not  spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Get Viagra real cheap! Send money now to… Spam  
  20. Spark   Fast  &  general  engine  for  large-­‐scale  data  processing

        •  Started  by  UC  Berkeley  AMPLab  in  2009   •  Big  open  source  community   •  One  of  fastest  growing  Apache  projects   •  250+  developers  from  50  companies   •  Included  in  major  Hadoop  distribu<ons   Mllib   •  Classifica<on   •  Regression   •  Recommenda<on   •  Clustering   •  Sta<s<cs   •  Linear  algebra  
  21. Spark  RDDs   Get Viagra real cheap! Send money now

    to… Not  spam   Spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Not  spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Get Viagra real cheap! Send money now to… Spam   Resilient  Distributed  Datasets  (RDDs)  
  22. Spark  RDDs   Get Viagra real cheap! Send money now

    to… Not  spam   Spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Not  spam   Hi! I haven’t seen you since the party last… Not  spam   Here’s the info I promised to send you for… Your password has been compromised! Go to… Spam   Get Viagra real cheap! Send money now to… Spam   map   +   +   +   aggregate   Resilient  Distributed  Datasets  (RDDs)  
  23. Itera<ve  Computa<on  on  Spark   +   +   +

      +   +   +   RDDs   •  In-­‐memory   •  Resilient  to  failures  
  24. Outline   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  Prac<ce   – Model  selec<on   – Accuracy—communica<on  trade-­‐offs   •  Ac<ve  Development  
  25. Outline   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  Prac<ce   – Model  selec<on   – Accuracy—communica<on  trade-­‐offs   •  Ac<ve  Development  
  26. Train  Level-­‐by-­‐Level   addressKnown  ==  True   No   Yes

      feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes  
  27. Choosing  How  to  Split   addressKnown  ==  True   No

      Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Not  spam   Not  spam   Not  spam   Spam   Not  spam   Spam   Spam   Not  spam   Choose  1  feature  xj  to  test:   Binary  feature:    xj  ==  True/False   #     Not  spam   Spam   #     #     Not  spam   Spam   #     Sufficient  sta<s<cs   for  this  split   à  compute  impurity  (informa<on  gain)   +   +   +   How  good  is  this  split?  
  28. High-­‐Level  View   addressKnown  ==  True   No   Yes

      feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   +   +   +   Aggregate  stats   +   +   +   Aggregate  stats   Broadcast  model  
  29. High-­‐Level  View   addressKnown  ==  True   No   Yes

      feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   +   +   +   +   +   +   Aggregate  stats   Aggregate  stats   +   +   +   Aggregate  stats   Broadcast  model   Broadcast  model   Only  1  pass  over   data  per  level   è Running  <me   scales  linearly   with  dataset  size.  
  30. Scaling  with  Dataset  Size   #"instances" 0" 100" 200" 300"

    400" 500" 600" 700" 0" 1000" 2000" 3000" 4000" Time"(seconds)" #"features" Spark"1.1:"Scaling"#"features" 16000" 160000" 1600000" 8000000" Binary  classifica<on   16-­‐node  EC2   6-­‐level  trees   Run<me  scales  linearly   with  #  features   224  GB  
  31. Scaling  with  Dataset  Size   Binary  classifica<on   16-­‐node  EC2

      6-­‐level  trees   Run<me  scales  linearly   with  #  instances   #"features" 0" 100" 200" 300" 400" 500" 600" 700" 0" 4000000" 8000000" Time"(seconds)" #"training"instances" Spark"1.1:"Scaling"#"instances" 100" 500" 1500" 3500"
  32. Outline   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  Prac<ce   – Model  selec<on   – Accuracy—communica<on  trade-­‐offs   •  Ac<ve  Development  
  33. Outline   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  PracAce   – Model  selecAon   – Accuracy—communicaAon  trade-­‐offs   •  Ac<ve  Development  
  34. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  35. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  36. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  37. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  38. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel   Measures  how  good  a  split  is.      (informa<on  gain)  
  39. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel   Max  #  levels  in  tree    (more  levels  =  more  expressive  model)  
  40. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  41. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  42. Choosing  How  to  Split   addressKnown  ==  True   No

      Yes   Not  spam   Spam   Not  spam   Not  spam   Spam   Spam   Not  spam   Not  spam   Not  spam   Not  spam   Not  spam   Spam   Not  spam   Spam   Spam   Not  spam   #     Not  spam   Spam   #     #     Not  spam   Spam   #     Bin  1   Bin  2   à  2  bins  /  feature   Binary  feature:    xj  ==  True/False  
  43. Binning  Features   Con<nuous  feature:    xj  <  (value)  

    #     Not  spam   Spam   #     Bin  1   #     Not  spam   Spam   #     Bin  2   Naively:        #  possible  bins  ≈  #  instances    L   Solu<on:  Discre<ze  data  
  44. Binning  Features   Con<nuous  feature:    xj  <  (value)  

    Naively:        #  possible  bins  ≈  #  instances    L   Solu<on:  Discre<ze  data   #     Not  spam   Spam   #     Bin  1   #     Not  spam   Spam   #     Bin  2   #     Not  spam   Spam   #     Bin  3   #     Not  spam   Spam   #     Bin  4   #     Not  spam   Spam   #     Bin  5   maxBins:      larger  =  higher  accuracy      smaller  =  less  communica<on  
  45. Communica<on   addressKnown  ==  True   No   Yes  

    feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   feature  x  ==  True   No   Yes   On  each  itera<on  (level)   For  each  tree  node,      For  each  feature,          For  each  bin,              Sufficient  sta<s<cs   #  sets  of  sta<s<cs    =    (#  nodes)  x  (#  features)  x  (#  bins/feature)   Set  using  maxBins   parameter  
  46. Communica<on   2  million  instances,  3500  features,  ~70  bins/feature  

    First  itera<on:            1  node   Second  itera<on:          2  nodes  
  47. MLlib  Trees  in  Prac<ce   def  trainClassifier( input: RDD[LabeledPoint], numClassesForClassification:

    Int, categoricalFeaturesInfo: Map[Int, Int], impurity: String, maxDepth: Int, maxBins: Int):      DecisionTreeModel  
  48. MLlib  Trees  in  Prac<ce   MLlib  supports:   •  Classifica<on

     (binary  &  mul<class  labels)  &  Regression  (con<nuous  labels)   •  Features:  binary,  k-­‐category,  con<nuous   •  Various  impurity  measures  &  other  seqngs   •  Python,  Scala  &  Java  APIs   Good  prac<ces:   •  maxDepth    à  Tune  with  data  (model  selec<on)   •  maxBins    à  Set  low,  increase  if  needed   •  #  RDD  par<<ons    à  Set  to  #  compute  cores  
  49. Performance  Improvements:   Spark  1.0  à  1.1   16-­‐node  EC2

     cluster.       6-­‐level  trees.   3500  features.   0   100   200   300   400   500   600   700   800   20000   200000   2000000   Running  Ame  (sec)   #  instances   v1.0   v1.1   4-­‐5X  faster  
  50. Performance  Improvements:   Spark  1.0  à  1.1   16-­‐node  EC2

     cluster.       6-­‐level  trees.   2  million  instances.   2-­‐4X  faster   0   100   200   300   400   500   600   700   800   100   500   1500   3500   Running  Ame  (sec)   #  features   v1.0   v1.1  
  51. MLlib  Trees:  Ac<ve  Development   Ensembles:  Random  Forests  &  Boos<ng

      à  PR  for  random  forests   à  Alpine  Labs  Sequoia  Forests:  coordina<ng  merge   à  Boos<ng  under  development   Model  selec<on  pipelines   à  Design  doc  published  on  JIRA   More  internal  op<miza<ons  
  52. Where  to  Go  from  Here?   •  Apache  Spark:  hvp://spark.apache.org/

      – Download  &  try  it  out   – Learn  with  videos,  exercises,  docs   – Contribute  via  Github!   •  Databricks:  hvp://databricks.com/   – Learn  about  Databricks  Cloud!   – Spark  training  resources  
  53. Summary   •  Decision  Trees  &  Spark   •  Learning

     Trees  on  Spark   •  Using  MLlib  Trees  in  Prac<ce   –  Model  selec<on   –  Accuracy—communica<on  trade-­‐offs   •  Ac<ve  Development   –  Ensembles  (forests  &  boos<ng)   –  Model  selec<on   –  More  op<miza<ons   Many  collaborators   Manish  Amde,  Hirakendu  Das,   Evan  Sparks,  Ameet  Talwalkar,   Xiangrui  Meng,  Qiping  Li,   Sung  Chung,  Lee  Yang,  …   Thanks!