Spark Machine Learning 101 @HadoopCon

Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015

About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine
Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml

Outline • Introduction to Spark ML • Alternative Least Squares
(ALS) • Hands-on example

Apache Spark MLlib • To Make practical machine learning easy
and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workﬂows Apache Spark spark.mllib spark.ml

What’s in MLlib Utilities Data types Basic statistics Classiﬁcation and
regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative ﬁltering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html

ML Workﬂow can be VERY complex

Types of Recommenders • Editorial and hand curated • Simple
aggregates • Tailored to individual users

Who Uses Recommenders

Approaches • Content based method • Item based method •
Model based method

Collaborative Filtering • One of mostly known “Recommendation Algorithm” •
Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible

Collaborative Filtering Main idea: Find set N of other users
whose ratings are “similar” to X’s ratings

Users Preferences • This is a baby example • Users:
> 2M • Items: > 30M • Sparsity: > 2%

Low Rank Assumption • Matrix can be reduced to the
product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller

Low Rank Assumption Action Romance Thriller Action Romance Thriller

Matrix Factorization

• Our goal is to ﬁnd P and Q such
that (Sum of Square Error):      • Root Mean Square Error (RMSE)       

Alternative Least Squares • Because p and q are both
unknown, the object function is not convex • If ﬁx one of the unknowns > can be solved as a least squares problem

Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4
million products  on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720

Resources

And More Resources • Source code examples  https://github.com/apache/spark/tree/master/ examples •
Apache Spark JIRA  https://issues.apache.org/jira/browse/spark

Dataset • MovieLens Dataset  http://grouplens.org/datasets/movielens/ • “ratings.dat”  UserID::MovieID::Rating::Timestamp • “movies.dat” 
MovieID::Title::Genres

Conclusion • Spark MLlib grows fast, but still need some
time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is ﬁrst priority

Q&A Visit me on:  http://blog.chuyuhsu.ml Github:  http://github.com/ChuyuHsu Thanks

References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-ﬁltering-with- spark-mllib.html
• https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib

Spark Machine Learning 101 @HadoopCon

Spark Machine Learning 101 @HadoopCon

Chu-Yu Hsu

Other Decks in Technology

Featured

Transcript

Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015

About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine

Outline • Introduction to Spark ML • Alternative Least Squares

Apache Spark MLlib • To Make practical machine learning easy

What’s in MLlib Utilities Data types Basic statistics Classiﬁcation and

ML Workﬂow can be VERY complex

Types of Recommenders • Editorial and hand curated • Simple

Who Uses Recommenders

Approaches • Content based method • Item based method •

Collaborative Filtering • One of mostly known “Recommendation Algorithm” •

Collaborative Filtering Main idea: Find set N of other users

Users Preferences • This is a baby example • Users:

Low Rank Assumption • Matrix can be reduced to the

Low Rank Assumption Action Romance Thriller Action Romance Thriller

Matrix Factorization

• Our goal is to ﬁnd P and Q such

Alternative Least Squares • Because p and q are both

Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4

Resources

Resources

And More Resources • Source code examples  https://github.com/apache/spark/tree/master/ examples •

Dataset • MovieLens Dataset  http://grouplens.org/datasets/movielens/ • “ratings.dat”  UserID::MovieID::Rating::Timestamp • “movies.dat”

Conclusion • Spark MLlib grows fast, but still need some

Q&A Visit me on:  http://blog.chuyuhsu.ml Github:  http://github.com/ChuyuHsu Thanks

References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-ﬁltering-with- spark-mllib.html