Slide 1

Slide 1 text

@nivdul #DV14 #MLwithSpark Lightning fast Machine Learning with Spark Ludwine Probst

Slide 2

Slide 2 text

@nivdul #Devoxx #MLwithSpark me Data engineer at Leader of Duchess France

Slide 3

Slide 3 text

@nivdul #Devoxx #MLwithSpark Machine Learning

Slide 4

Slide 4 text

@nivdul #DV14 #MLwithSpark MapReduce Lay of the land

Slide 5

Slide 5 text

@nivdul #Devoxx #MLwithSpark MapReduce

Slide 6

Slide 6 text

@nivdul #Devoxx #MLwithSpark HDFS with iterative algorithms

Slide 7

Slide 7 text

@nivdul #Devoxx #MLwithSpark

Slide 8

Slide 8 text

@nivdul #Devoxx #MLwithSpark is a fast and general engine for large-scale data processing

Slide 9

Slide 9 text

@nivdul #DV14 #MLwithSpark •big data analytics in memory/disk •complements Hadoop •fast and more flexible •Resilient Distributed Datasets (RDD) •shared variables

Slide 10

Slide 10 text

@nivdul #Devoxx #MLwithSpark Shared variables broadcast variables accumulators val broadcastVar = sc.broadcast(Array(1, 2, 3)) val acc = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => acc += x)

Slide 11

Slide 11 text

@nivdul #DV14 #MLwithSpark RDD (Resilient Distributed Datasets) •process in parallel •controllable persistence (memory, disk…) •higher-level operations (transformation & actions) •rebuilt automatically using lineage

Slide 12

Slide 12 text

@nivdul #Devoxx #MLwithSpark Data Storage InputFormat cassandra cassandra

Slide 13

Slide 13 text

@nivdul #Devoxx #MLwithSpark Spark data flow

Slide 14

Slide 14 text

@nivdul #Devoxx #MLwithSpark Languages interactive shell (scala & python) Lambda (Java 8)

Slide 15

Slide 15 text

@nivdul #Devoxx #MLwithSpark val conf = new SparkConf() .setAppName("Spark word count") .setMaster("local") ! val sc = new SparkContext(conf) WordCount example (scala)

Slide 16

Slide 16 text

@nivdul #DV14 #MLwithSpark // load the data val data = sc.textFile("filepath/wordcount.txt") // map then reduce step val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) // persist the data wordCounts.cache()

Slide 17

Slide 17 text

@nivdul #DV14 #MLwithSpark // keep words which appear more than 3 times val filteredWordCount = wordCounts.filter { case (key, value) => value > 2 } ! filteredWordCount.count()

Slide 18

Slide 18 text

@nivdul #Devoxx #MLwithSpark Spark ecosystem

Slide 19

Slide 19 text

@nivdul #Devoxx #MLwithSpark streaming makes it easy to build scalable fault-tolerant streaming applications

Slide 20

Slide 20 text

@nivdul #Devoxx #MLwithSpark SQL unifies access to structured data

Slide 21

Slide 21 text

@nivdul #Devoxx #MLwithSpark is Apache Spark's API for graphs and graph-parallel computation

Slide 22

Slide 22 text

@nivdul #Devoxx #MLwithSpark MLlib is Apache Spark's scalable machine learning library

Slide 23

Slide 23 text

@nivdul #Devoxx #MLwithSpark Machine learning with Spark / MLlib

Slide 24

Slide 24 text

@nivdul #Devoxx #MLwithSpark Machine learning libraries scikit

Slide 25

Slide 25 text

@nivdul #Devoxx #MLwithSpark Example make a movies recommender system

Slide 26

Slide 26 text

@nivdul #Devoxx #MLwithSpark Collaborative filtering with Alternating Least Square (ALS)

Slide 27

Slide 27 text

@nivdul #DV14 #MLwithSpark 1 3 5 1 28 4 2 18 3 2 5 5 userID movieID rating

Slide 28

Slide 28 text

@nivdul #DV14 #MLwithSpark // Load and parse the data val data = sc.textFile("movies.txt") ! // create a RDD[Rating] val ratings = data.map(_.split("\\s+") match { case Array(user, movie, rate) => Rating(user.toInt, movie.toInt, rate.toDouble) })

Slide 29

Slide 29 text

@nivdul #DV14 #MLwithSpark // split the data into training set and test set val splits = ratings.randomSplit(Array(0.8, 0.2)) ! // persist the training set val training = splits(0).cache() val test = splits(1)

Slide 30

Slide 30 text

@nivdul #DV14 #MLwithSpark // Build the recommendation model using ALS ! val model = ALS.train(training, rank = 10, iterations = 20, 1)

Slide 31

Slide 31 text

@nivdul #DV14 #MLwithSpark // Evaluate the model val userMovies = test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } ! val ratesAndPreds = test.map { case Rating(user, movie, rate) => ((user, movie), rate) }.join(predictions) //measuring the Mean Squared Error of rating prediction val MSE = ratesAndPreds.map { case ((user, movie), (r1, r2)) => val err = (r1 - r2) err * err }.mean()

Slide 32

Slide 32 text

@nivdul #DV14 #MLwithSpark // recommending movies ! val recommendations = model.recommendProducts(2, 10) .sortBy(- _.rating) ! var i = 1 recommendations.foreach { r => println(r.product + " with rating " + r.rating) i += 1 }

Slide 33

Slide 33 text

@nivdul #Devoxx #MLwithSpark Performance Spark core Hadoop MapReduce http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html How fast a system can sort 100 TB of data on disk ?

Slide 34

Slide 34 text

@nivdul #Devoxx #MLwithSpark Performance Spark / MLlib Collaborative filtering with MLlib vs Mahout https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

Slide 35

Slide 35 text

@nivdul #Devoxx #MLwithSpark Why should I care ? fast and easy Machine Learning with MLlib fast & flexible in-memory /on-disk SQL Streaming MLlib

Slide 36

Slide 36 text

No content