Spark Machine Learning 101 @HadoopCon

Slide 1

Slide 1 text

Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015

Slide 2

Slide 2 text

About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml

Slide 3

Slide 3 text

Outline • Introduction to Spark ML • Alternative Least Squares (ALS) • Hands-on example

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Apache Spark MLlib • To Make practical machine learning easy and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workﬂows Apache Spark spark.mllib spark.ml

Slide 6

Slide 6 text

What’s in MLlib Utilities Data types Basic statistics Classiﬁcation and regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative ﬁltering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html

Slide 7

Slide 7 text

ML Workﬂow can be VERY complex

Slide 8

Slide 8 text

Types of Recommenders • Editorial and hand curated • Simple aggregates • Tailored to individual users

Slide 9

Slide 9 text

Who Uses Recommenders

Slide 10

Slide 10 text

Approaches • Content based method • Item based method • Model based method

Slide 11

Slide 11 text

Collaborative Filtering • One of mostly known “Recommendation Algorithm” • Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible

Slide 12

Slide 12 text

Collaborative Filtering Main idea: Find set N of other users whose ratings are “similar” to X’s ratings

Slide 13

Slide 13 text

Users Preferences • This is a baby example • Users: > 2M • Items: > 30M • Sparsity: > 2%

Slide 14

Slide 14 text

Low Rank Assumption • Matrix can be reduced to the product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller

Slide 15

Slide 15 text

Low Rank Assumption Action Romance Thriller Action Romance Thriller

Slide 16

Slide 16 text

Matrix Factorization

Slide 17

Slide 17 text

• Our goal is to ﬁnd P and Q such that (Sum of Square Error):      • Root Mean Square Error (RMSE)       

Slide 18

Slide 18 text

Alternative Least Squares • Because p and q are both unknown, the object function is not convex • If ﬁx one of the unknowns > can be solved as a least squares problem

Slide 19

Slide 19 text

Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4 million products  on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720

Slide 20

Slide 20 text

Resources

Slide 21

Slide 21 text

Resources

Slide 22

Slide 22 text

And More Resources • Source code examples  https://github.com/apache/spark/tree/master/ examples • Apache Spark JIRA  https://issues.apache.org/jira/browse/spark

Slide 23

Slide 23 text

Dataset • MovieLens Dataset  http://grouplens.org/datasets/movielens/ • “ratings.dat”  UserID::MovieID::Rating::Timestamp • “movies.dat”  MovieID::Title::Genres

Slide 24

Slide 24 text

Conclusion • Spark MLlib grows fast, but still need some time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is ﬁrst priority

Slide 25

Slide 25 text

Q&A Visit me on:  http://blog.chuyuhsu.ml Github:  http://github.com/ChuyuHsu Thanks

Slide 26

Slide 26 text

References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-ﬁltering-with- spark-mllib.html • https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib