Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning on Spark @ Strata Conference

Reynold Xin
February 26, 2013

Machine Learning on Spark @ Strata Conference

Talk by Shivaram Venkataraman

Reynold Xin

February 26, 2013
Tweet

More Decks by Reynold Xin

Other Decks in Programming

Transcript

  1. Implementing Machine Learning § Machine learning algorithms are -  Complex, multi-stage

    -  Iterative § MapReduce/Hadoop unsuitable § Need efficient primitives for data sharing
  2. §  Spark RDDs à efficient data sharing §  In-memory caching

    accelerates performance -  Up to 20x faster than Hadoop §  Easy to use high-level programming interface -  Express complex algorithms ~100 lines. Machine Learning using Spark
  3. K-Means: preliminaries Feature 1 Feature 2 Data: Collection of values

    data  =  lines.map(line=>          parseVector(line))
  4. K-Means: preliminaries Feature 1 Feature 2 K = Number of

    clusters Data assignments to clusters S1 , S2 ,. . ., SK
  5. K-Means: preliminaries Feature 1 Feature 2 K = Number of

    clusters Data assignments to clusters S1 , S2 ,. . ., SK
  6. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.
  7. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.
  8. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers  =  data.takeSample(          false,  K,  seed)
  9. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers  =  data.takeSample(          false,  K,  seed)
  10. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers  =  data.takeSample(          false,  K,  seed)
  11. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))  
  12. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))  
  13. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))  
  14. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()  
  15. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()   newCenters  =  pointsGroup.mapValues(          ps  =>  average(ps))  
  16. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()   newCenters  =  pointsGroup.mapValues(          ps  =>  average(ps))  
  17. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()   newCenters  =  pointsGroup.mapValues(          ps  =>  average(ps))  
  18. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()   newCenters  =pointsGroup.mapValues(          ps  =>  average(ps))   while  (dist(centers,  newCenters)  >  ɛ)
  19. K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers  =  data.takeSample(          false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()   newCenters  =pointsGroup.mapValues(          ps  =>  average(ps))   while  (dist(centers,  newCenters)  >  ɛ)
  20. K-Means Source Feature 1 Feature 2 centers  =  data.takeSample(  

           false,  K,  seed) closest  =  data.map(p  =>          (closestPoint(p,centers),p))   pointsGroup  =              closest.groupByKey()   newCenters  =pointsGroup.mapValues(          ps  =>  average(ps))   while  (d  >  ɛ)   { } d  =  distance(centers,  newCenters) centers  =  newCenters.map(_)
  21. Ease of use §  Interactive shell: Useful for featurization, pre-processing

    data §  Lines of code for K-Means -  Spark ~ 90 lines – (Part of hands-on tutorial !) -  Hadoop/Mahout ~ 4 files, > 300 lines
  22. 274 157 106 197 121 87 143 61 33 0

    50 100 150 200 250 300 25 50 100 Iteration time (s) Number of machines Hadoop HadoopBinMem Spark K-Means 184 111 76 116 80 62 15 6 3 0 50 100 150 200 250 25 50 100 Iteration time (s) Number of machines Hadoop HadoopBinMem Spark Logistic Regression Performance [Zaharia et. al, NSDI’12]
  23. §  K means clustering using Spark §  Hands-on exercise this

    afternoon ! Examples and more: www.spark-project.org §  Spark: Framework for cluster computing §  Fast and easy machine learning programs Conclusion