Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Lightning-fast Machine Learning with Spark
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Probst Ludwine
November 11, 2014
Programming
1k
6
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Lightning-fast Machine Learning with Spark
Probst Ludwine
November 11, 2014
More Decks by Probst Ludwine
See All by Probst Ludwine
Tech Beyond Borders
nivdul
0
210
Tech Beyond Borders
nivdul
0
86
Analytics in the age of the Internet of Things
nivdul
1
220
Lightning-fast Machine Learning with Spark
nivdul
15
5.4k
Introduction to Spark
nivdul
4
650
Other Decks in Programming
See All in Programming
生成AI時代にこそ効くGo | Why Go Works in the Age of Generative AI
mom0tomo
8
3.2k
Oxcを導入して開発体験が向上した話
yug1224
4
320
New "Type" system on PicoRuby
pocke
1
960
Signal Forms: Beyond the Basics @ngBaguette 2026 in Paris
manfredsteyer
PRO
0
250
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
790
AI時代のUIはどこへ行く?その2!
yusukebe
21
7.2k
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.7k
ユニットテストの先へ:テスト技法で要求・仕様を整理するJava開発実践 / Beyond_Unit_Testing_Practical_Java_Development_Techniques_for_Organizing_Requirements_and_Specifications
shimashima35
0
410
コンテキストの使い捨てをやめる — ビジネスルール駆動開発と miko —
ioki
0
210
スマートグラスで並列バイブコーディング
hyshu
0
150
Vue × Nuxt × Oxc どこまで使える?実運用の現在地
andpad
0
260
A2UI という光を覗いてみる
satohjohn
1
140
Featured
See All Featured
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
620
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
210
Art, The Web, and Tiny UX
lynnandtonic
304
22k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
150
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
310
The Spectacular Lies of Maps
axbom
PRO
1
810
Done Done
chrislema
186
16k
How GitHub (no longer) Works
holman
316
150k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
390
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
210
GitHub's CSS Performance
jonrohan
1033
470k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.5k
Transcript
@nivdul #DV14 #MLwithSpark Lightning fast Machine Learning with Spark Ludwine
Probst
@nivdul #Devoxx #MLwithSpark me Data engineer at Leader of Duchess
France
@nivdul #Devoxx #MLwithSpark Machine Learning
@nivdul #DV14 #MLwithSpark MapReduce Lay of the land
@nivdul #Devoxx #MLwithSpark MapReduce
@nivdul #Devoxx #MLwithSpark HDFS with iterative algorithms
@nivdul #Devoxx #MLwithSpark
@nivdul #Devoxx #MLwithSpark is a fast and general engine for
large-scale data processing
@nivdul #DV14 #MLwithSpark •big data analytics in memory/disk •complements Hadoop
•fast and more flexible •Resilient Distributed Datasets (RDD) •shared variables
@nivdul #Devoxx #MLwithSpark Shared variables broadcast variables accumulators val broadcastVar
= sc.broadcast(Array(1, 2, 3)) val acc = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => acc += x)
@nivdul #DV14 #MLwithSpark RDD (Resilient Distributed Datasets) •process in parallel
•controllable persistence (memory, disk…) •higher-level operations (transformation & actions) •rebuilt automatically using lineage
@nivdul #Devoxx #MLwithSpark Data Storage InputFormat cassandra cassandra
@nivdul #Devoxx #MLwithSpark Spark data flow
@nivdul #Devoxx #MLwithSpark Languages interactive shell (scala & python) Lambda
(Java 8)
@nivdul #Devoxx #MLwithSpark val conf = new SparkConf() .setAppName("Spark word
count") .setMaster("local") ! val sc = new SparkContext(conf) WordCount example (scala)
@nivdul #DV14 #MLwithSpark // load the data val data =
sc.textFile("filepath/wordcount.txt") // map then reduce step val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) // persist the data wordCounts.cache()
@nivdul #DV14 #MLwithSpark // keep words which appear more than
3 times val filteredWordCount = wordCounts.filter { case (key, value) => value > 2 } ! filteredWordCount.count()
@nivdul #Devoxx #MLwithSpark Spark ecosystem
@nivdul #Devoxx #MLwithSpark streaming makes it easy to build scalable
fault-tolerant streaming applications
@nivdul #Devoxx #MLwithSpark SQL unifies access to structured data
@nivdul #Devoxx #MLwithSpark is Apache Spark's API for graphs and
graph-parallel computation
@nivdul #Devoxx #MLwithSpark MLlib is Apache Spark's scalable machine learning
library
@nivdul #Devoxx #MLwithSpark Machine learning with Spark / MLlib
@nivdul #Devoxx #MLwithSpark Machine learning libraries scikit
@nivdul #Devoxx #MLwithSpark Example make a movies recommender system
@nivdul #Devoxx #MLwithSpark Collaborative filtering with Alternating Least Square (ALS)
@nivdul #DV14 #MLwithSpark 1 3 5 1 28 4 2
18 3 2 5 5 userID movieID rating
@nivdul #DV14 #MLwithSpark // Load and parse the data val
data = sc.textFile("movies.txt") ! // create a RDD[Rating] val ratings = data.map(_.split("\\s+") match { case Array(user, movie, rate) => Rating(user.toInt, movie.toInt, rate.toDouble) })
@nivdul #DV14 #MLwithSpark // split the data into training set
and test set val splits = ratings.randomSplit(Array(0.8, 0.2)) ! // persist the training set val training = splits(0).cache() val test = splits(1)
@nivdul #DV14 #MLwithSpark // Build the recommendation model using ALS
! val model = ALS.train(training, rank = 10, iterations = 20, 1)
@nivdul #DV14 #MLwithSpark // Evaluate the model val userMovies =
test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } ! val ratesAndPreds = test.map { case Rating(user, movie, rate) => ((user, movie), rate) }.join(predictions) //measuring the Mean Squared Error of rating prediction val MSE = ratesAndPreds.map { case ((user, movie), (r1, r2)) => val err = (r1 - r2) err * err }.mean()
@nivdul #DV14 #MLwithSpark // recommending movies ! val recommendations =
model.recommendProducts(2, 10) .sortBy(- _.rating) ! var i = 1 recommendations.foreach { r => println(r.product + " with rating " + r.rating) i += 1 }
@nivdul #Devoxx #MLwithSpark Performance Spark core Hadoop MapReduce http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html How
fast a system can sort 100 TB of data on disk ?
@nivdul #Devoxx #MLwithSpark Performance Spark / MLlib Collaborative filtering with
MLlib vs Mahout https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
@nivdul #Devoxx #MLwithSpark Why should I care ? fast and
easy Machine Learning with MLlib fast & flexible in-memory /on-disk SQL Streaming MLlib
None