$30 off During Our Annual Pro Sale. View Details »

Lightning-fast Machine Learning with Spark

Lightning-fast Machine Learning with Spark

Today in the Big Data world, Hadoop and MapReduce are highly dominant for large scale data processing. However, the MapReduce model shows its limits for various types of treatment, especially for highly iterative algorithms like in Machine Learning. Spark is an in-memory data processing framework that, unlike Hadoop provides interactive and real-time analysis on large datasets. Furthermore Spark has a more flexible programming model and gives better performances than Hadoop.

This talk aims at introducing Spark and MLlib by showing off through a Machine Learning example how it differentiates itself from Hadoop, in regards to its API and performances. And as a closing note we will quickly explore the growing Spark ecosystem with projects like Spark streaming, MLlib, GraphX or Spark SQL.

Probst Ludwine

October 28, 2014
Tweet

More Decks by Probst Ludwine

Other Decks in Programming

Transcript

  1. Lightning fast Machine
    Learning with Spark
    Ludwine Probst @nivdul
    ScalaIO, Paris 2014-10-23

    View Slide

  2. me
    Data engineer at
    @nivdul
    Leader of

    View Slide

  3. Machine Learning

    View Slide

  4. Lay of the land

    View Slide

  5. &
    MapReduce
    &
    HDFS

    View Slide

  6. MapReduce

    View Slide

  7. HDFS
    with iterative algorithms

    View Slide

  8. View Slide

  9. is a fast and general engine for large-scale
    data processing

    View Slide

  10. • big data analytics in memory/disk
    • complements Hadoop
    • fast and more flexible
    • Resilient Distributed Datasets (RDD)
    • shared variables

    View Slide

  11. Shared variables
    broadcast variables
    accumulators
    val broadcastVar = sc.broadcast(Array(1, 2, 3))
    val accum = sc.accumulator(0, "MyAccumulator")
    sc.parallelize(Array(1, 2, 3)).foreach(x => accum += x)

    View Slide

  12. RDD
    • process in parallel
    • higher-level operations (transformation & actions)
    • controllable persistence (memory, disk…)
    • rebuilt automatically using lineage
    fault-tolerant immutable distributed collections

    View Slide

  13. Data flow

    View Slide

  14. InputFormat
    Data storage

    View Slide

  15. Languages
    interactive shell (scala & python)

    View Slide

  16. Deployment
    standalone
    MESOS
    YARN

    View Slide

  17. !
    val conf = new SparkConf()
    .setAppName("Spark word count")
    .setMaster("local")
    !
    val sc = new SparkContext(conf)
    !
    val data = sc.textFile("filepath/wordcount.txt")
    !
    !
    val wordCounts = data.flatMap(line => line.split("\\s+"))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    !
    wordCounts.cache()
    !
    val filteredWordCount = wordCounts.filter{
    case (key, value) => value > 2
    }
    !
    filteredWordCount.count()
    !

    View Slide

  18. Spark ecosystem

    View Slide

  19. streaming
    makes it easy to build scalable fault-tolerant streaming applications

    View Slide

  20. SQL
    unifies access to structured data

    View Slide

  21. is Apache Spark's API for graphs and graph-parallel computation

    View Slide

  22. MLlib
    is Apache Spark's scalable machine learning library

    View Slide

  23. Machine learning with
    Spark/MLlib

    View Slide

  24. Machine learning libraries
    scikits

    View Slide

  25. Classification
    classify mail into spam or non-spam
    with
    a logistic regression

    View Slide

  26. val prediction = test.map(p => (model.predict(p.features), p.label))
    val accuracy = 1.0 * prediction.filter(x => x._1 == x._2)
    .count() / test.count()
    training test
    model
    input data
    validation
    val parsedData = data.map { line =>
    val parts = line.split(",").map(_.toDouble)
    !
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }
    val splits = parsedData.randomSplit(Array(0.8, 0.2))
    val training = splits(0).cache()
    val test = splits(1)
    val model = LogisticRegressionWithSGD.train(training, 100)

    View Slide

  27. Collaborative filtering
    make a recommender system
    with
    Alternative Least Square (ALS)

    View Slide

  28. val ratings = data.map(_.split("\\s+") match {
    case Array(user, item, rate) =>
    Rating(user.toInt, item.toInt, rate.toDouble)
    })
    val splits = ratings.randomSplit(Array(0.8, 0.2))
    val training = splits(0).cache()
    val test = splits(1)
    val model = ALS.train(training, rank = 10, iterations = 20, 0.01)
    val userMovies = test.map {
    case Rating(user, movie, rate) => (user, movie)
    }
    val predictions = model.predict(userMovies).map {
    case Rating(user, movie, rate) => ((user, movie), rate)
    }
    training test
    model
    input data
    validation
    https://github.com/nivdul/spark-ml-scalaio

    View Slide

  29. Clustering
    with K-means

    View Slide

  30. val parsedData = data.map(s =>
    Vectors.dense(s.split("\\s+").map(_.toDouble)))
    val clusters = KMeans.train(training, k = 4, maxIterations = 20)
    // Evaluate clustering by computing Within Set Sum of Squared Errors
    val WSSSE = clusters.computeCost(parsedData)
    val splits = ratings.randomSplit(Array(0.8, 0.2))
    val training = splits(0).cache()
    val test = splits(1)
    training test
    model
    input data
    validation
    https://github.com/nivdul/spark-ml-scalaio

    View Slide

  31. Performance Spark core
    MapReduce
    http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html

    View Slide

  32. Performance Spark/MLlib
    Collaborative filtering with MLlib vs Mahout
    https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

    View Slide

  33. Why should I care ?

    View Slide

  34. https://github.com/nivdul/spark-ml-scalaio
    http://spark.apache.org/
    http://spark.apache.org/mllib/

    View Slide