Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lightning-fast Machine Learning with Spark

Lightning-fast Machine Learning with Spark

Today in the Big Data world, Hadoop and MapReduce are highly dominant for large scale data processing. However, the MapReduce model shows its limits for various types of treatment, especially for highly iterative algorithms like in Machine Learning. Spark is an in-memory data processing framework that, unlike Hadoop provides interactive and real-time analysis on large datasets. Furthermore Spark has a more flexible programming model and gives better performances than Hadoop.

This talk aims at introducing Spark and MLlib by showing off through a Machine Learning example how it differentiates itself from Hadoop, in regards to its API and performances. And as a closing note we will quickly explore the growing Spark ecosystem with projects like Spark streaming, MLlib, GraphX or Spark SQL.

12fb119c7420fc9ee5b3896c2cceb892?s=128

Probst Ludwine

October 28, 2014
Tweet

Transcript

  1. Lightning fast Machine Learning with Spark Ludwine Probst @nivdul ScalaIO,

    Paris 2014-10-23
  2. me Data engineer at @nivdul Leader of

  3. Machine Learning

  4. Lay of the land

  5. & MapReduce & HDFS

  6. MapReduce

  7. HDFS with iterative algorithms

  8. None
  9. is a fast and general engine for large-scale data processing

  10. • big data analytics in memory/disk • complements Hadoop •

    fast and more flexible • Resilient Distributed Datasets (RDD) • shared variables
  11. Shared variables broadcast variables accumulators val broadcastVar = sc.broadcast(Array(1, 2,

    3)) val accum = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => accum += x)
  12. RDD • process in parallel • higher-level operations (transformation &

    actions) • controllable persistence (memory, disk…) • rebuilt automatically using lineage fault-tolerant immutable distributed collections
  13. Data flow

  14. InputFormat Data storage

  15. Languages interactive shell (scala & python)

  16. Deployment standalone MESOS YARN

  17. ! val conf = new SparkConf() .setAppName("Spark word count") .setMaster("local")

    ! val sc = new SparkContext(conf) ! val data = sc.textFile("filepath/wordcount.txt") ! ! val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) ! wordCounts.cache() ! val filteredWordCount = wordCounts.filter{ case (key, value) => value > 2 } ! filteredWordCount.count() !
  18. Spark ecosystem

  19. streaming makes it easy to build scalable fault-tolerant streaming applications

  20. SQL unifies access to structured data

  21. is Apache Spark's API for graphs and graph-parallel computation

  22. MLlib is Apache Spark's scalable machine learning library

  23. Machine learning with Spark/MLlib

  24. Machine learning libraries scikits

  25. Classification classify mail into spam or non-spam with a logistic

    regression
  26. val prediction = test.map(p => (model.predict(p.features), p.label)) val accuracy =

    1.0 * prediction.filter(x => x._1 == x._2) .count() / test.count() training test model input data validation val parsedData = data.map { line => val parts = line.split(",").map(_.toDouble) ! LabeledPoint(parts(0), Vectors.dense(parts.tail)) } val splits = parsedData.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) val model = LogisticRegressionWithSGD.train(training, 100)
  27. Collaborative filtering make a recommender system with Alternative Least Square

    (ALS)
  28. val ratings = data.map(_.split("\\s+") match { case Array(user, item, rate)

    => Rating(user.toInt, item.toInt, rate.toDouble) }) val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) val model = ALS.train(training, rank = 10, iterations = 20, 0.01) val userMovies = test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } training test model input data validation https://github.com/nivdul/spark-ml-scalaio
  29. Clustering with K-means

  30. val parsedData = data.map(s => Vectors.dense(s.split("\\s+").map(_.toDouble))) val clusters = KMeans.train(training,

    k = 4, maxIterations = 20) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) training test model input data validation https://github.com/nivdul/spark-ml-scalaio
  31. Performance Spark core MapReduce http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html

  32. Performance Spark/MLlib Collaborative filtering with MLlib vs Mahout https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

  33. Why should I care ?

  34. https://github.com/nivdul/spark-ml-scalaio http://spark.apache.org/ http://spark.apache.org/mllib/