About Me
Chu-Yu Hsu, 許儲⽻羽
• Software Engineer
• Machine Learning Practicer
• Used Spark ML and Python in
daily work and Kaggle
competition
• http://blog.chuyuhsu.ml
Slide 3
Slide 3 text
Outline
• Introduction to Spark ML
• Alternative Least Squares (ALS)
• Hands-on example
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
Apache Spark MLlib
• To Make practical machine
learning easy and scalable
• spark.mllib - the primary API
• spark.ml - a higher-level API
for constructing ML workflows Apache Spark
spark.mllib
spark.ml
Slide 6
Slide 6 text
What’s in MLlib
Utilities
Data types
Basic statistics
Classification and
regression
SVM
Logistic regression
Linear regression
Naive Bayes
Decision trees
Ensembles of trees
Isotonic regression
Collaborative filtering
Alternating least
squares (ALS)
Clustering
K-means
Gaussian mixture
Power iteration clustering
Latent Dirichlet allocation
Streaming k-means
Dimensionality
reduction
SVD
PCA
Frequent pattern
mining
FP-growth
Optimization
Stochastic gradient descent
Limited-memory BFGS
https://spark.apache.org/docs/latest/mllib-guide.html
Slide 7
Slide 7 text
ML Workflow
can be VERY complex
Slide 8
Slide 8 text
Types of Recommenders
• Editorial and hand curated
• Simple aggregates
• Tailored to individual users
Slide 9
Slide 9 text
Who Uses
Recommenders
Slide 10
Slide 10 text
Approaches
• Content based method
• Item based method
• Model based method
Slide 11
Slide 11 text
Collaborative Filtering
• One of mostly known “Recommendation Algorithm”
• Widely used in E-commerce application
• The data size can be enormous
• Need to be delivered as soon as possible
Slide 12
Slide 12 text
Collaborative Filtering
Main idea: Find set N of other users whose ratings are
“similar” to X’s ratings
Slide 13
Slide 13 text
Users Preferences
• This is a baby example
• Users: > 2M
• Items: > 30M
• Sparsity: > 2%
Slide 14
Slide 14 text
Low Rank Assumption
• Matrix can be reduced to the
product of low rank matrixes
• That is also understood as
“latent factors”
• We assume that the low factor
can represent the hidden
factors we do not know
Action Romance Thriller
• Our goal is to find P and Q such that (Sum of
Square Error):
• Root Mean Square Error (RMSE)
Slide 18
Slide 18 text
Alternative Least Squares
• Because p and q are both unknown, the object
function is not convex
• If fix one of the unknowns > can be solved as a
least squares problem
Slide 19
Slide 19 text
Amazon Reviews Dataset
35 million ratings, 6.6 million users, 2.4 million products
on 16-node (m3.2xlarge)
https://github.com/apache/spark/pull/3720
Slide 20
Slide 20 text
Resources
Slide 21
Slide 21 text
Resources
Slide 22
Slide 22 text
And More Resources
• Source code examples
https://github.com/apache/spark/tree/master/
examples
• Apache Spark JIRA
https://issues.apache.org/jira/browse/spark
Conclusion
• Spark MLlib grows fast, but still need some time
• Spark MLlib is a strong tool, if you use it right
• Sharpening ML skills is first priority
Slide 25
Slide 25 text
Q&A
Visit me on:
http://blog.chuyuhsu.ml
Github:
http://github.com/ChuyuHsu
Thanks