Big Data Analytics with Scala

by Sam Bessalah

Slide 1

Slide 1 text

Big Data Analytics with Scala Sam BESSALAH @samklr

Slide 2

Slide 2 text

What is Big Data Analytics? It’s about doing aggregations and running complex models on large datasets, offline, in real time or both.

Slide 3

Slide 3 text

Lambda Architecture Blueprint for a Big Data analytics architecture

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Map Reduce redux map : (Km, Vm)  List (Km, Vm) in Scala : T => List[(K,V)] reduce :(Km, List(Vm))List(Kr, Vr) (K, List[V]) => List[(K,V)]

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Big data ‘’Hello World’’ : Word count

Slide 11

Slide 11 text

Enters Cascading

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Word Count Redux (Flat)Map -Reduce

Slide 14

Slide 14 text

SCALDING class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap ('line -> 'word) { line :String => line.split(“ \\s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }

Slide 15

Slide 15 text

SCALDING : Clustering with Mahout lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)

Slide 16

Slide 16 text

SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values

Slide 17

Slide 17 text

Scalding - Two APIs : Field based API, and Typed API - Field API : project, map, discard , groupBy… - Typed API : TypedPipe[T], works like scala.collection.Iterator[T] - Matrix Library - ALGEBIRD : Abstract Algebra library … we’ll talk about it later

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

STORM

Slide 20

Slide 20 text

- Distributed, fault tolerant, real time stream computation engine. - Four concepts - Streams : infinite sequence of tuples - Spouts : Source of streams - Bolts : Process and produces streams Can do : Filtering, aggregations, Joins, … - Topologies : define a flow or network of spouts and blots.

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Streaming Word Count

Slide 23

Slide 23 text

Trident TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Slide 24

Slide 24 text

ScalaStorm by Evan Chan class SplitSentence extends StormBolt(outputFields = List("word")) { def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

SummingBird Write your job once and run it on Storm and Hadoop

Slide 27

Slide 27 text

def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { line => line.split(‘’\\s+’’).map(_ -> 1L) } .sumByKey(store)

Slide 28

Slide 28 text

SummingBird trait Platform[P <: Platform[P]] { type Source[+T] type Store[-K, V] type Sink[-T] type Service[-K, +V] type Plan[T} }

Slide 29

Slide 29 text

On Storm -Source[+T] : Spout[(Long, T)] -Store[-K, V] : StormStore [K, V] -Sink[-T] : (T => Future[Unit]) -Service[-K, +V] : StormService[K,V] -Plan[T] : StormTopology

Slide 30

Slide 30 text

TypeSafety

Slide 31

Slide 31 text

SummingBird dependencies • StoreHaus • Chill • Scalding • Algebird • Tormenta

Slide 32

Slide 32 text

But - Can only aggregate values that are associative : Monoids!!!!!! trait Monoid [V] { def zero : V def aggregate(left : V, right :V): V }

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Clustering with Mahout redux def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = source .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)

Slide 35

Slide 35 text

Slide 36

Slide 36 text

APACHE SPARK

Slide 37

Slide 37 text

What is Spark? • Fast and expressive cluster computing system compatible with Apache Hadoop, but order of magnitude faster (order of magnitude faster) • Improves efficiency through: -General execution graphs -In-memory storage • Improves usability through: -Rich APIs in Java, Scala, Python -Interactive shell

Slide 38

Slide 38 text

Key idea • Write programs in terms of transformations on distributed datasets • Concept: resilient distributed datasets (RDDs) - Collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (e.g. caching in RAM)

Slide 39

Slide 39 text

Example: Word Count

Slide 40

Slide 40 text

Other RDD Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin

Slide 41

Slide 41 text

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(s => s.startswith(“ERROR”)) messages = errors.map(s => s.split(“\t”)) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(s=> s.contains(“foo”)).count() messages.filter(s=> s.contains(“bar”)).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data) Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data)

Slide 42

Slide 42 text

Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“\t”)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Slide 43

Slide 43 text

Spark Streaming - Extends Spark capabilities to large scale stream processing. - Scales to 100s of nodes and achieves second scale latencies -Efficient and fault-tolerant stateful stream processing - Simple batch-like API for implementing complex algorithms

Slide 44

Slide 44 text

Discretized Stream Processing 44 Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches

Slide 45

Slide 45 text

Discretized Stream Processing 45  Batch sizes as low as ½ second, latency of about 1 second  Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results

Slide 46

Slide 46 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API

Slide 47

Slide 47 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]

Slide 48

Slide 48 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { ... }) foreach: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream Write to database, update analytics UI, do whatever you want

Slide 49

Slide 49 text

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS

Slide 50

Slide 50 text

DStream of data Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval

Slide 51

Slide 51 text

Compute TopK Ip addresses val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = stream.map(ipAddress => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = cms.zero val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { ids.map(id => cms.create(id)) }) .reduce(_ ++ _)

Slide 52

Slide 52 text

topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() val partialTopK = partial.heavyHitters.map(id => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })

Slide 53

Slide 53 text

Multi purpose analytics stack Ad-hoc Queries Batch Processing Stream Processing Spark + Shark + Spark Streaming MLBASE GraphX BLINK DB TACHYON

Slide 54

Slide 54 text

SPARK SPARK STREAMING - Almost Similar API for batch or Streaming - Single¨Platform with fewer moving parts - Order of magnitude faster

Slide 55

Slide 55 text

References Sam Ritchie : SummingBird https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at- twitter Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning- with-scala-linkedin Apache Spark : http://spark.incubator.apache.org Matei Zaharia : Parallel Programming with Spark

Slide 56

Slide 56 text

No content