Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Spark Machine Learning 101 @HadoopCon
Search
Chu-Yu Hsu
September 19, 2015
Technology
420
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Spark Machine Learning 101 @HadoopCon
Chu-Yu Hsu
September 19, 2015
Other Decks in Technology
See All in Technology
やさしいA2A入門
minorun365
PRO
11
1.7k
就職⽀援サービスにおけるキャリアアドバイザーのシフトスケジューリング
recruitengineers
PRO
1
140
Oracle AI Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
6
1.9k
AI駆動開発を通して感じた、 AI時代のデザイナーの役割変化
whisaiyo
0
240
「エンジニア進化論」2028年の開発完全自動化、エンジニアはどう進化するか
cyberagentdevelopers
PRO
4
4.5k
RSA暗号を手計算したくなること、ありますよね?? (20260615_orestudy6_rsa)
thousanda
0
220
2026TECHFRESH畢業分享會 - 葬送的通靈師:化系統與用戶雜訊成行動訊號
line_developers_tw
PRO
0
780
現地で盛り上がった WWDC26 Keynote
zozotech
PRO
1
200
EventBridge Connection
_kensh
5
690
Socrates × Looker 〜セマンティックレイヤーで進化するデータ分析エージェント〜
hanon52_
3
2.1k
2026TECHFRESH畢業分享會 - Lightning Talk - E起 See See : 電商推薦讀心術? 數據說了算
line_developers_tw
PRO
0
780
MCP Appsを作ってみよう
iwamot
PRO
4
530
Featured
See All Featured
Typedesign – Prime Four
hannesfritz
42
3.1k
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
4k
Git: the NoSQL Database
bkeepers
PRO
432
67k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
230
Reality Check: Gamification 10 Years Later
codingconduct
0
2.2k
Darren the Foodie - Storyboard
khoart
PRO
3
3.4k
Google's AI Overviews - The New Search
badams
0
1k
Faster Mobile Websites
deanohume
310
31k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
210
How to build a perfect <img>
jonoalderson
1
5.6k
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
570
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
160
Transcript
Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015
About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine
Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml
Outline • Introduction to Spark ML • Alternative Least Squares
(ALS) • Hands-on example
None
Apache Spark MLlib • To Make practical machine learning easy
and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workflows Apache Spark spark.mllib spark.ml
What’s in MLlib Utilities Data types Basic statistics Classification and
regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative filtering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html
ML Workflow can be VERY complex
Types of Recommenders • Editorial and hand curated • Simple
aggregates • Tailored to individual users
Who Uses Recommenders
Approaches • Content based method • Item based method •
Model based method
Collaborative Filtering • One of mostly known “Recommendation Algorithm” •
Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible
Collaborative Filtering Main idea: Find set N of other users
whose ratings are “similar” to X’s ratings
Users Preferences • This is a baby example • Users:
> 2M • Items: > 30M • Sparsity: > 2%
Low Rank Assumption • Matrix can be reduced to the
product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller
Low Rank Assumption Action Romance Thriller Action Romance Thriller
Matrix Factorization
• Our goal is to find P and Q such
that (Sum of Square Error): • Root Mean Square Error (RMSE)
Alternative Least Squares • Because p and q are both
unknown, the object function is not convex • If fix one of the unknowns > can be solved as a least squares problem
Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4
million products on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720
Resources
Resources
And More Resources • Source code examples https://github.com/apache/spark/tree/master/ examples •
Apache Spark JIRA https://issues.apache.org/jira/browse/spark
Dataset • MovieLens Dataset http://grouplens.org/datasets/movielens/ • “ratings.dat” UserID::MovieID::Rating::Timestamp • “movies.dat”
MovieID::Title::Genres
Conclusion • Spark MLlib grows fast, but still need some
time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is first priority
Q&A Visit me on: http://blog.chuyuhsu.ml Github: http://github.com/ChuyuHsu Thanks
References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with- spark-mllib.html
• https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib