Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lightning-fast Machine Learning with Spark

Lightning-fast Machine Learning with Spark

Probst Ludwine

November 11, 2014
Tweet

More Decks by Probst Ludwine

Other Decks in Programming

Transcript

  1. @nivdul
    #DV14 #MLwithSpark
    Lightning fast Machine
    Learning with Spark
    Ludwine Probst

    View Slide

  2. @nivdul
    #Devoxx #MLwithSpark
    me
    Data engineer at
    Leader of
    Duchess
    France

    View Slide

  3. @nivdul
    #Devoxx #MLwithSpark
    Machine Learning

    View Slide

  4. @nivdul
    #DV14 #MLwithSpark
    MapReduce
    Lay of the land

    View Slide

  5. @nivdul
    #Devoxx #MLwithSpark
    MapReduce

    View Slide

  6. @nivdul
    #Devoxx #MLwithSpark
    HDFS
    with iterative algorithms

    View Slide

  7. @nivdul
    #Devoxx #MLwithSpark

    View Slide

  8. @nivdul
    #Devoxx #MLwithSpark
    is a fast and general engine for large-scale
    data processing

    View Slide

  9. @nivdul
    #DV14 #MLwithSpark
    •big data analytics in memory/disk
    •complements Hadoop
    •fast and more flexible
    •Resilient Distributed Datasets (RDD)
    •shared variables

    View Slide

  10. @nivdul
    #Devoxx #MLwithSpark
    Shared variables
    broadcast variables
    accumulators
    val broadcastVar = sc.broadcast(Array(1, 2, 3))
    val acc = sc.accumulator(0, "MyAccumulator")
    sc.parallelize(Array(1, 2, 3)).foreach(x => acc += x)

    View Slide

  11. @nivdul
    #DV14 #MLwithSpark
    RDD
    (Resilient Distributed Datasets)
    •process in parallel
    •controllable persistence (memory, disk…)
    •higher-level operations (transformation & actions)
    •rebuilt automatically using lineage

    View Slide

  12. @nivdul
    #Devoxx #MLwithSpark
    Data Storage
    InputFormat
    cassandra
    cassandra

    View Slide

  13. @nivdul
    #Devoxx #MLwithSpark
    Spark data flow

    View Slide

  14. @nivdul
    #Devoxx #MLwithSpark
    Languages
    interactive shell (scala & python)
    Lambda
    (Java 8)

    View Slide

  15. @nivdul
    #Devoxx #MLwithSpark
    val conf = new SparkConf()
    .setAppName("Spark word count")
    .setMaster("local")
    !
    val sc = new SparkContext(conf)
    WordCount example (scala)

    View Slide

  16. @nivdul
    #DV14 #MLwithSpark
    // load the data
    val data = sc.textFile("filepath/wordcount.txt")
    // map then reduce step
    val wordCounts = data.flatMap(line => line.split("\\s+"))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    // persist the data
    wordCounts.cache()

    View Slide

  17. @nivdul
    #DV14 #MLwithSpark
    // keep words which appear more than 3 times
    val filteredWordCount = wordCounts.filter {
    case (key, value) => value > 2
    }
    !
    filteredWordCount.count()

    View Slide

  18. @nivdul
    #Devoxx #MLwithSpark
    Spark ecosystem

    View Slide

  19. @nivdul
    #Devoxx #MLwithSpark
    streaming
    makes it easy to build scalable fault-tolerant streaming applications

    View Slide

  20. @nivdul
    #Devoxx #MLwithSpark
    SQL
    unifies access to structured data

    View Slide

  21. @nivdul
    #Devoxx #MLwithSpark
    is Apache Spark's API for graphs
    and
    graph-parallel computation

    View Slide

  22. @nivdul
    #Devoxx #MLwithSpark
    MLlib
    is Apache Spark's scalable machine learning library

    View Slide

  23. @nivdul
    #Devoxx #MLwithSpark
    Machine learning
    with
    Spark / MLlib

    View Slide

  24. @nivdul
    #Devoxx #MLwithSpark
    Machine learning libraries
    scikit

    View Slide

  25. @nivdul
    #Devoxx #MLwithSpark
    Example
    make a movies recommender system

    View Slide

  26. @nivdul
    #Devoxx #MLwithSpark
    Collaborative filtering
    with Alternating Least Square (ALS)

    View Slide

  27. @nivdul
    #DV14 #MLwithSpark
    1 3 5
    1 28 4
    2 18 3
    2 5 5
    userID movieID rating

    View Slide

  28. @nivdul
    #DV14 #MLwithSpark
    // Load and parse the data
    val data = sc.textFile("movies.txt")
    !
    // create a RDD[Rating]
    val ratings = data.map(_.split("\\s+") match {
    case Array(user, movie, rate) =>
    Rating(user.toInt, movie.toInt, rate.toDouble)
    })

    View Slide

  29. @nivdul
    #DV14 #MLwithSpark
    // split the data into training set and test set
    val splits = ratings.randomSplit(Array(0.8, 0.2))
    !
    // persist the training set
    val training = splits(0).cache()
    val test = splits(1)

    View Slide

  30. @nivdul
    #DV14 #MLwithSpark
    // Build the recommendation model using ALS
    !
    val model = ALS.train(training, rank = 10,
    iterations = 20, 1)

    View Slide

  31. @nivdul
    #DV14 #MLwithSpark
    // Evaluate the model
    val userMovies = test.map {
    case Rating(user, movie, rate) => (user, movie)
    }
    val predictions = model.predict(userMovies).map {
    case Rating(user, movie, rate) => ((user, movie), rate)
    }
    !
    val ratesAndPreds = test.map {
    case Rating(user, movie, rate) => ((user, movie), rate)
    }.join(predictions)
    //measuring the Mean Squared Error of rating prediction
    val MSE = ratesAndPreds.map { case ((user, movie), (r1, r2)) =>
    val err = (r1 - r2)
    err * err
    }.mean()

    View Slide

  32. @nivdul
    #DV14 #MLwithSpark
    // recommending movies
    !
    val recommendations = model.recommendProducts(2, 10)
    .sortBy(- _.rating)
    !
    var i = 1
    recommendations.foreach { r =>
    println(r.product + " with rating " + r.rating)
    i += 1
    }

    View Slide

  33. @nivdul
    #Devoxx #MLwithSpark
    Performance Spark core
    Hadoop MapReduce
    http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html
    How fast a system can sort 100 TB of data on disk ?

    View Slide

  34. @nivdul
    #Devoxx #MLwithSpark
    Performance Spark / MLlib
    Collaborative filtering with MLlib vs Mahout
    https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

    View Slide

  35. @nivdul
    #Devoxx #MLwithSpark
    Why should I care ?
    fast and easy Machine Learning
    with MLlib
    fast & flexible
    in-memory /on-disk SQL
    Streaming
    MLlib

    View Slide

  36. View Slide