Slide 1

Slide 1 text

Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015

Slide 2

Slide 2 text

About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml

Slide 3

Slide 3 text

Outline • Introduction to Spark ML • Alternative Least Squares (ALS) • Hands-on example

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Apache Spark MLlib • To Make practical machine learning easy and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workflows Apache Spark spark.mllib spark.ml

Slide 6

Slide 6 text

What’s in MLlib Utilities Data types Basic statistics Classification and regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative filtering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html

Slide 7

Slide 7 text

ML Workflow can be VERY complex

Slide 8

Slide 8 text

Types of Recommenders • Editorial and hand curated • Simple aggregates • Tailored to individual users

Slide 9

Slide 9 text

Who Uses Recommenders

Slide 10

Slide 10 text

Approaches • Content based method • Item based method • Model based method

Slide 11

Slide 11 text

Collaborative Filtering • One of mostly known “Recommendation Algorithm” • Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible

Slide 12

Slide 12 text

Collaborative Filtering Main idea: Find set N of other users whose ratings are “similar” to X’s ratings

Slide 13

Slide 13 text

Users Preferences • This is a baby example • Users: > 2M • Items: > 30M • Sparsity: > 2%

Slide 14

Slide 14 text

Low Rank Assumption • Matrix can be reduced to the product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller

Slide 15

Slide 15 text

Low Rank Assumption Action Romance Thriller Action Romance Thriller

Slide 16

Slide 16 text

Matrix Factorization

Slide 17

Slide 17 text

• Our goal is to find P and Q such that (Sum of Square Error):
 
 
 • Root Mean Square Error (RMSE)
 
 
 


Slide 18

Slide 18 text

Alternative Least Squares • Because p and q are both unknown, the object function is not convex • If fix one of the unknowns > can be solved as a least squares problem

Slide 19

Slide 19 text

Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4 million products
 on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720

Slide 20

Slide 20 text

Resources

Slide 21

Slide 21 text

Resources

Slide 22

Slide 22 text

And More Resources • Source code examples
 https://github.com/apache/spark/tree/master/ examples • Apache Spark JIRA
 https://issues.apache.org/jira/browse/spark

Slide 23

Slide 23 text

Dataset • MovieLens Dataset
 http://grouplens.org/datasets/movielens/ • “ratings.dat”
 UserID::MovieID::Rating::Timestamp • “movies.dat”
 MovieID::Title::Genres

Slide 24

Slide 24 text

Conclusion • Spark MLlib grows fast, but still need some time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is first priority

Slide 25

Slide 25 text

Q&A Visit me on:
 http://blog.chuyuhsu.ml Github:
 http://github.com/ChuyuHsu Thanks

Slide 26

Slide 26 text

References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with- spark-mllib.html • https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib