Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lightning-fast Machine Learning with Spark

Lightning-fast Machine Learning with Spark

Probst Ludwine

November 11, 2014
Tweet

More Decks by Probst Ludwine

Other Decks in Programming

Transcript

  1. @nivdul #DV14 #MLwithSpark •big data analytics in memory/disk •complements Hadoop

    •fast and more flexible •Resilient Distributed Datasets (RDD) •shared variables
  2. @nivdul #Devoxx #MLwithSpark Shared variables broadcast variables accumulators val broadcastVar

    = sc.broadcast(Array(1, 2, 3)) val acc = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => acc += x)
  3. @nivdul #DV14 #MLwithSpark RDD (Resilient Distributed Datasets) •process in parallel

    •controllable persistence (memory, disk…) •higher-level operations (transformation & actions) •rebuilt automatically using lineage
  4. @nivdul #Devoxx #MLwithSpark val conf = new SparkConf() .setAppName("Spark word

    count") .setMaster("local") ! val sc = new SparkContext(conf) WordCount example (scala)
  5. @nivdul #DV14 #MLwithSpark // load the data val data =

    sc.textFile("filepath/wordcount.txt") // map then reduce step val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) // persist the data wordCounts.cache()
  6. @nivdul #DV14 #MLwithSpark // keep words which appear more than

    3 times val filteredWordCount = wordCounts.filter { case (key, value) => value > 2 } ! filteredWordCount.count()
  7. @nivdul #DV14 #MLwithSpark 1 3 5 1 28 4 2

    18 3 2 5 5 userID movieID rating
  8. @nivdul #DV14 #MLwithSpark // Load and parse the data val

    data = sc.textFile("movies.txt") ! // create a RDD[Rating] val ratings = data.map(_.split("\\s+") match { case Array(user, movie, rate) => Rating(user.toInt, movie.toInt, rate.toDouble) })
  9. @nivdul #DV14 #MLwithSpark // split the data into training set

    and test set val splits = ratings.randomSplit(Array(0.8, 0.2)) ! // persist the training set val training = splits(0).cache() val test = splits(1)
  10. @nivdul #DV14 #MLwithSpark // Build the recommendation model using ALS

    ! val model = ALS.train(training, rank = 10, iterations = 20, 1)
  11. @nivdul #DV14 #MLwithSpark // Evaluate the model val userMovies =

    test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } ! val ratesAndPreds = test.map { case Rating(user, movie, rate) => ((user, movie), rate) }.join(predictions) //measuring the Mean Squared Error of rating prediction val MSE = ratesAndPreds.map { case ((user, movie), (r1, r2)) => val err = (r1 - r2) err * err }.mean()
  12. @nivdul #DV14 #MLwithSpark // recommending movies ! val recommendations =

    model.recommendProducts(2, 10) .sortBy(- _.rating) ! var i = 1 recommendations.foreach { r => println(r.product + " with rating " + r.rating) i += 1 }
  13. @nivdul #Devoxx #MLwithSpark Performance Spark / MLlib Collaborative filtering with

    MLlib vs Mahout https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
  14. @nivdul #Devoxx #MLwithSpark Why should I care ? fast and

    easy Machine Learning with MLlib fast & flexible in-memory /on-disk SQL Streaming MLlib