Apache Spark & MLlib

& APACHE SPARK MLlib APACHE SPARK MLlib

2 PARTS: APACHE SPARK MLlib

2 PARTS: APACHE SPARK what is

2 PARTS: MLlib how do I use on spark?

PART I APACHE SPARK what is

DATA IS EVERYWHERE

DATA IS UNLIMITED

DATA IS BIG

BIG DATA IS HARD and

CHALLENGES WITH BIG DATA 1. Capturing

CHALLENGES WITH BIG DATA 1. Capturing 2. Storing

CHALLENGES WITH BIG DATA 1. Capturing 2. Storing 3. Analyzing

REAL WORLD EXAMPLE

1 TWEET @addamh: Tacos are the best . Tom Brady
is the best. So, Tom Brady is . @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .

6000 TWEETS / SECOND X 12 MB / SECOND =
~ 200 BYTES / TWEET X 1 TWEET @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .

~ 200 BYTES / TWEET X 24 HOURS X 1TB / DAY = 1 TWEET @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is . @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .

~ 200 BYTES / TWEET X 24 HOURS X 1TB / DAY = 30 DAYS / MONTH X 30TB / MONTH = 1 TWEET @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is . @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .

…and that’s just Twitter…on an average day.

BIG DATA IS NOT BULLSHIT

CHALLENGES WITH BIG DATA 1. Capturing 2. Storing 3. Analyzing

DISTRIBUTED COMPUTING PLATFORM

IS NOT A BETTER HADOOP.

WORKS IN MEMORY. NOT ON DISK.

USES RESILIENT DISTRIBUTED DATASETS.

WHAT IS AN RDD?

DATA ARRAY RDD RDD Partition Host 02 RAM partition RAM
partition RAM partition RAM partition Host 01 RAM partition RAM partition RAM partition RAM partition Host n RAM partition RAM partition RAM partition RAM partition RDD Partition RDD Partition RDD Partition RDD Partition

MACHINE LEARNING ON SPARK?

MACHINE LEARNING ON SPARK? MLlib

WHY MLlib?

BUILT ON SPARK. FOR SPARK.

BUILT ON SPARK. FOR SPARK. ML ALGORITHMS WITH 100X SPEED
INCREASES OVER MAP REDUCE

PART II MLlib how do I use on spark?

MLlib is a machine learning toolbox that is tightly integrated
into Spark and has an RDD API. This allows MLlib’s algorithms to run on a distributed Spark cluster.

MLlib Classiﬁcation: Logistic Regression Naive Bayes Decision Trees: Random Forests
Gradient-Boost Trees Clustering: K-means Topic Modeling: latent Dirichlet allocation Recommendation: Alternating Least Squares Regression: Linear Regression Isotonic Regression is a machine learning toolbox that is tightly integrated into Spark and has an RDD API. This allows MLlib’s algorithms to run on a distributed Spark cluster.

COLLABORATIVE FILTERING

? ? ? 2 3 5 2 4 3 2
4 5 5 1 4 5

3 5 ? 5 2 5 1 4 ? 4
2 3 5 2 ? 4 ω = Our data as a dense matrix:

Minimization Problem Non-Convex Problem

WHERE IS THE GLOBAL MINIMA? Non-Convex

WHERE IS THE GLOBAL MINIMA? Not easy to ﬁnd Non-Convex

LOW-RANK MATRIX FACTORIZATION IS A NON-CONVEX PROBLEM. THIS MAKES FINDING
THE GLOBAL MINIMA VERY COSTLY. THE ALTERNATING LEAST SQUARES ALGORITHM ALLOWS US CONVERT TO A CONVEX PROBLEM.

WHERE IS THE GLOBAL MINIMA? Much Easier Convex

IT’S NOT THIS HARD I PROMISE

MLlib HAS ALREADY DONE THE HARD WORK.

LET’S LOOK AT SOME CODE

COLLECT YOUR DATA

Ratings userId,movieId,rating,timestamp 1,31,2.5,1260759144 1,1029,3.0,1260759179 1,1061,3.0,1260759182 1,1129,2.0,1260759185 1,1172,4.0,1260759205 1,1263,2.0,1260759151 1,1287,2.0,1260759187 1,1293,2.0,1260759148
1,1339,3.5,1260759125

PULL IN MLlib

SPLIT OUR DATA

TRAIN THE MODEL AND TUNE THE PARAMETERS

Train multiple models to ﬁnd best rank parameter

Train multiple models to ﬁnd best rank parameter Compute Root-Mean-Square
Error

Train multiple models to ﬁnd best rank parameter Compute Root-Mean-Square
Error Lowest wins

INPUT YOUR NEW RATINGS

(userId, movieId, rating)

(userId, movieId, rating) Distribute RDD to the Spark cluster

TRAIN A NEW MODEL BASED ON YOUR RATINGS

Merge new ratings with full rating set

Merge new ratings with full rating set Train an updated
model with the new ratings

GET PREDICTIONS WITH THE UPDATED MODEL

Get IDs of movies just rated

Remove just rated movies from ratings RDD Get IDs of
movies just rated

Remove just rated movies from ratings RDD Make predictions for
unrated movies from full dataset Get IDs of movies just rated

FIND THE MOVIES YOU DIDN’T KNOW YOU ❤

Get top 10 recommendations

Get top 10 recommendations Print them out

YOU DON’T HAVE TO BE A DATA SCIENTIST TO START
GETTING VALUE FROM THE TOOLS.

ADDAM HARDY @addamh Come talk to us about your data

Apache Spark & MLlib

Apache Spark & MLlib

More Decks by Addam Hardy

Other Decks in Technology

Featured

Transcript