Lightning-fast Machine Learning with Spark

Lightning-fast Machine Learning with Spark

Today in the Big Data world, Hadoop and MapReduce are highly dominant for large scale data processing. However, the MapReduce model shows its limits for various types of treatment, especially for highly iterative algorithms like in Machine Learning. Spark is an in-memory data processing framework that, unlike Hadoop provides interactive and real-time analysis on large datasets. Furthermore Spark has a more flexible programming model and gives better performances than Hadoop.

This talk aims at introducing Spark and MLlib by showing off through a Machine Learning example how it differentiates itself from Hadoop, in regards to its API and performances. And as a closing note we will quickly explore the growing Spark ecosystem with projects like Spark streaming, MLlib, GraphX or Spark SQL.

12fb119c7420fc9ee5b3896c2cceb892?s=128

Probst Ludwine

October 28, 2014
Tweet

Transcript

  1. 8.
  2. 10.

    • big data analytics in memory/disk • complements Hadoop •

    fast and more flexible • Resilient Distributed Datasets (RDD) • shared variables
  3. 11.

    Shared variables broadcast variables accumulators val broadcastVar = sc.broadcast(Array(1, 2,

    3)) val accum = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => accum += x)
  4. 12.

    RDD • process in parallel • higher-level operations (transformation &

    actions) • controllable persistence (memory, disk…) • rebuilt automatically using lineage fault-tolerant immutable distributed collections
  5. 17.

    ! val conf = new SparkConf() .setAppName("Spark word count") .setMaster("local")

    ! val sc = new SparkContext(conf) ! val data = sc.textFile("filepath/wordcount.txt") ! ! val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) ! wordCounts.cache() ! val filteredWordCount = wordCounts.filter{ case (key, value) => value > 2 } ! filteredWordCount.count() !
  6. 26.

    val prediction = test.map(p => (model.predict(p.features), p.label)) val accuracy =

    1.0 * prediction.filter(x => x._1 == x._2) .count() / test.count() training test model input data validation val parsedData = data.map { line => val parts = line.split(",").map(_.toDouble) ! LabeledPoint(parts(0), Vectors.dense(parts.tail)) } val splits = parsedData.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) val model = LogisticRegressionWithSGD.train(training, 100)
  7. 28.

    val ratings = data.map(_.split("\\s+") match { case Array(user, item, rate)

    => Rating(user.toInt, item.toInt, rate.toDouble) }) val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) val model = ALS.train(training, rank = 10, iterations = 20, 0.01) val userMovies = test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } training test model input data validation https://github.com/nivdul/spark-ml-scalaio
  8. 30.

    val parsedData = data.map(s => Vectors.dense(s.split("\\s+").map(_.toDouble))) val clusters = KMeans.train(training,

    k = 4, maxIterations = 20) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) training test model input data validation https://github.com/nivdul/spark-ml-scalaio