Big Data Analytics with Scala

Big Data Analytics with Scala Sam BESSALAH @samklr

What is Big Data Analytics? It’s about doing aggregations and
running complex models on large datasets, offline, in real time or both.

Lambda Architecture Blueprint for a Big Data analytics architecture

Map Reduce redux map : (Km, Vm)  List (Km,
Vm) in Scala : T => List[(K,V)] reduce :(Km, List(Vm))List(Kr, Vr) (K, List[V]) => List[(K,V)]

Big data ‘’Hello World’’ : Word count

Enters Cascading

Word Count Redux (Flat)Map -Reduce

SCALDING class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap
('line -> 'word) { line :String => line.split(“ \\s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }

SCALDING : Clustering with Mahout lazy val clust = new
StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)

SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream
{ centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values

Scalding - Two APIs : Field based API, and Typed
API - Field API : project, map, discard , groupBy… - Typed API : TypedPipe[T], works like scala.collection.Iterator[T] - Matrix Library - ALGEBIRD : Abstract Algebra library … we’ll talk about it later

- Distributed, fault tolerant, real time stream computation engine. -
Four concepts - Streams : infinite sequence of tuples - Spouts : Source of streams - Bolts : Process and produces streams Can do : Filtering, aggregations, Joins, … - Topologies : define a flow or network of spouts and blots.

Streaming Word Count

Trident TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1",
spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);

ScalaStorm by Evan Chan class SplitSentence extends StormBolt(outputFields = List("word"))
{ def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }

SummingBird Write your job once and run it on Storm
and Hadoop

def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long])
= source.flatMap { line => line.split(‘’\\s+’’).map(_ -> 1L) } .sumByKey(store)

SummingBird trait Platform[P <: Platform[P]] { type Source[+T] type Store[-K,
V] type Sink[-T] type Service[-K, +V] type Plan[T} }

On Storm -Source[+T] : Spout[(Long, T)] -Store[-K, V] : StormStore
[K, V] -Sink[-T] : (T => Future[Unit]) -Service[-K, +V] : StormService[K,V] -Plan[T] : StormTopology

TypeSafety

SummingBird dependencies • StoreHaus • Chill • Scalding • Algebird
• Tormenta

But - Can only aggregate values that are associative :
Monoids!!!!!! trait Monoid [V] { def zero : V def aggregate(left : V, right :V): V }

Clustering with Mahout redux def StreamClustering(source : Platform[P.String], store :
P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = source .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)

SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream
{ centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values .saveTo(store) }

APACHE SPARK

What is Spark? • Fast and expressive cluster computing system
compatible with Apache Hadoop, but order of magnitude faster (order of magnitude faster) • Improves efficiency through: -General execution graphs -In-memory storage • Improves usability through: -Rich APIs in Java, Scala, Python -Interactive shell

Key idea • Write programs in terms of transformations on
distributed datasets • Concept: resilient distributed datasets (RDDs) - Collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (e.g. caching in RAM)

Example: Word Count

Other RDD Operators • map • filter • groupBy •
sort • union • join • leftOuterJoin • rightOuterJoin

Example: Log Mining Load error messages from a log into
memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(s => s.startswith(“ERROR”)) messages = errors.map(s => s.split(“\t”)) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(s=> s.contains(“foo”)).count() messages.filter(s=> s.contains(“bar”)).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data) Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data)

Fault Recovery RDDs track lineage information that can be used
to efficiently recompute lost data Ex: msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“\t”)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Spark Streaming - Extends Spark capabilities to large scale stream
processing. - Scales to 100s of nodes and achieves second scale latencies -Efficient and fault-tolerant stateful stream processing - Simple batch-like API for implementing complex algorithms

Discretized Stream Processing 44 Spark Spark Streaming batches of X
seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches

Discretized Stream Processing 45  Batch sizes as low as
½ second, latency of about 1 second  Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()
DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API

val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]

val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { ... }) foreach: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream Write to database, update analytics UI, do whatever you want

val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS

DStream of data Window-based Transformations val tweets = ssc.twitterStream() val
hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval

Compute TopK Ip addresses val ssc = new StreamingContext(master, "AlgebirdCMS",
Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = stream.map(ipAddress => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = cms.zero val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { ids.map(id => cms.create(id)) }) .reduce(_ ++ _)

topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial
= rdd.first() val partialTopK = partial.heavyHitters.map(id => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })

Multi purpose analytics stack Ad-hoc Queries Batch Processing Stream Processing
Spark + Shark + Spark Streaming MLBASE GraphX BLINK DB TACHYON

SPARK SPARK STREAMING - Almost Similar API for batch or
Streaming - Single¨Platform with fewer moving parts - Order of magnitude faster

References Sam Ritchie : SummingBird https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at- twitter Chris Severs, Vitaly
Gordon : Scalable Machine Learning with Scala http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning- with-scala-linkedin Apache Spark : http://spark.incubator.apache.org Matei Zaharia : Parallel Programming with Spark

Big Data Analytics with Scala

Big Data Analytics with Scala

More Decks by Sam Bessalah

Featured

Transcript