Slide 1

Slide 1 text

Lightning fast Machine Learning with Spark Ludwine Probst @nivdul ScalaIO, Paris 2014-10-23

Slide 2

Slide 2 text

me Data engineer at @nivdul Leader of

Slide 3

Slide 3 text

Machine Learning

Slide 4

Slide 4 text

Lay of the land

Slide 5

Slide 5 text

& MapReduce & HDFS

Slide 6

Slide 6 text

MapReduce

Slide 7

Slide 7 text

HDFS with iterative algorithms

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

is a fast and general engine for large-scale data processing

Slide 10

Slide 10 text

• big data analytics in memory/disk • complements Hadoop • fast and more flexible • Resilient Distributed Datasets (RDD) • shared variables

Slide 11

Slide 11 text

Shared variables broadcast variables accumulators val broadcastVar = sc.broadcast(Array(1, 2, 3)) val accum = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => accum += x)

Slide 12

Slide 12 text

RDD • process in parallel • higher-level operations (transformation & actions) • controllable persistence (memory, disk…) • rebuilt automatically using lineage fault-tolerant immutable distributed collections

Slide 13

Slide 13 text

Data flow

Slide 14

Slide 14 text

InputFormat Data storage

Slide 15

Slide 15 text

Languages interactive shell (scala & python)

Slide 16

Slide 16 text

Deployment standalone MESOS YARN

Slide 17

Slide 17 text

! val conf = new SparkConf() .setAppName("Spark word count") .setMaster("local") ! val sc = new SparkContext(conf) ! val data = sc.textFile("filepath/wordcount.txt") ! ! val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) ! wordCounts.cache() ! val filteredWordCount = wordCounts.filter{ case (key, value) => value > 2 } ! filteredWordCount.count() !

Slide 18

Slide 18 text

Spark ecosystem

Slide 19

Slide 19 text

streaming makes it easy to build scalable fault-tolerant streaming applications

Slide 20

Slide 20 text

SQL unifies access to structured data

Slide 21

Slide 21 text

is Apache Spark's API for graphs and graph-parallel computation

Slide 22

Slide 22 text

MLlib is Apache Spark's scalable machine learning library

Slide 23

Slide 23 text

Machine learning with Spark/MLlib

Slide 24

Slide 24 text

Machine learning libraries scikits

Slide 25

Slide 25 text

Classification classify mail into spam or non-spam with a logistic regression

Slide 26

Slide 26 text

val prediction = test.map(p => (model.predict(p.features), p.label)) val accuracy = 1.0 * prediction.filter(x => x._1 == x._2) .count() / test.count() training test model input data validation val parsedData = data.map { line => val parts = line.split(",").map(_.toDouble) ! LabeledPoint(parts(0), Vectors.dense(parts.tail)) } val splits = parsedData.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) val model = LogisticRegressionWithSGD.train(training, 100)

Slide 27

Slide 27 text

Collaborative filtering make a recommender system with Alternative Least Square (ALS)

Slide 28

Slide 28 text

val ratings = data.map(_.split("\\s+") match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) val model = ALS.train(training, rank = 10, iterations = 20, 0.01) val userMovies = test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } training test model input data validation https://github.com/nivdul/spark-ml-scalaio

Slide 29

Slide 29 text

Clustering with K-means

Slide 30

Slide 30 text

val parsedData = data.map(s => Vectors.dense(s.split("\\s+").map(_.toDouble))) val clusters = KMeans.train(training, k = 4, maxIterations = 20) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) val splits = ratings.randomSplit(Array(0.8, 0.2)) val training = splits(0).cache() val test = splits(1) training test model input data validation https://github.com/nivdul/spark-ml-scalaio

Slide 31

Slide 31 text

Performance Spark core MapReduce http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html

Slide 32

Slide 32 text

Performance Spark/MLlib Collaborative filtering with MLlib vs Mahout https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

Slide 33

Slide 33 text

Why should I care ?

Slide 34

Slide 34 text

https://github.com/nivdul/spark-ml-scalaio http://spark.apache.org/ http://spark.apache.org/mllib/