Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Machine Learning 101 @HadoopCon

Chu-Yu Hsu
September 19, 2015

Spark Machine Learning 101 @HadoopCon

Chu-Yu Hsu

September 19, 2015
Tweet

Other Decks in Technology

Transcript

  1. About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine

    Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml
  2. Apache Spark MLlib • To Make practical machine learning easy

    and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workflows Apache Spark spark.mllib spark.ml
  3. What’s in MLlib Utilities Data types Basic statistics Classification and

    regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative filtering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html
  4. Types of Recommenders • Editorial and hand curated • Simple

    aggregates • Tailored to individual users
  5. Collaborative Filtering • One of mostly known “Recommendation Algorithm” •

    Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible
  6. Collaborative Filtering Main idea: Find set N of other users

    whose ratings are “similar” to X’s ratings
  7. Users Preferences • This is a baby example • Users:

    > 2M • Items: > 30M • Sparsity: > 2%
  8. Low Rank Assumption • Matrix can be reduced to the

    product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller
  9. • Our goal is to find P and Q such

    that (Sum of Square Error):
 
 
 • Root Mean Square Error (RMSE)
 
 
 

  10. Alternative Least Squares • Because p and q are both

    unknown, the object function is not convex • If fix one of the unknowns > can be solved as a least squares problem
  11. Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4

    million products
 on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720
  12. Conclusion • Spark MLlib grows fast, but still need some

    time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is first priority
  13. References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with- spark-mllib.html

    • https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib