Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics with Scala

Sam Bessalah
October 25, 2013
2.9k

Big Data Analytics with Scala

Sam Bessalah

October 25, 2013
Tweet

Transcript

  1. What is Big Data Analytics? It’s about doing aggregations and

    running complex models on large datasets, offline, in real time or both.
  2. Map Reduce redux map : (Km, Vm)  List (Km,

    Vm) in Scala : T => List[(K,V)] reduce :(Km, List(Vm))List(Kr, Vr) (K, List[V]) => List[(K,V)]
  3. SCALDING class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap

    ('line -> 'word) { line :String => line.split(“ \\s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }
  4. SCALDING : Clustering with Mahout lazy val clust = new

    StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
  5. SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream

    { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values
  6. Scalding - Two APIs : Field based API, and Typed

    API - Field API : project, map, discard , groupBy… - Typed API : TypedPipe[T], works like scala.collection.Iterator[T] - Matrix Library - ALGEBIRD : Abstract Algebra library … we’ll talk about it later
  7. - Distributed, fault tolerant, real time stream computation engine. -

    Four concepts - Streams : infinite sequence of tuples - Spouts : Source of streams - Bolts : Process and produces streams Can do : Filtering, aggregations, Joins, … - Topologies : define a flow or network of spouts and blots.
  8. Trident TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1",

    spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);
  9. ScalaStorm by Evan Chan class SplitSentence extends StormBolt(outputFields = List("word"))

    { def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }
  10. def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long])

    = source.flatMap { line => line.split(‘’\\s+’’).map(_ -> 1L) } .sumByKey(store)
  11. SummingBird trait Platform[P <: Platform[P]] { type Source[+T] type Store[-K,

    V] type Sink[-T] type Service[-K, +V] type Plan[T} }
  12. On Storm -Source[+T] : Spout[(Long, T)] -Store[-K, V] : StormStore

    [K, V] -Sink[-T] : (T => Future[Unit]) -Service[-K, +V] : StormService[K,V] -Plan[T] : StormTopology
  13. But - Can only aggregate values that are associative :

    Monoids!!!!!! trait Monoid [V] { def zero : V def aggregate(left : V, right :V): V }
  14. Clustering with Mahout redux def StreamClustering(source : Platform[P.String], store :

    P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = source .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
  15. SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream

    { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values .saveTo(store) }
  16. What is Spark? • Fast and expressive cluster computing system

    compatible with Apache Hadoop, but order of magnitude faster (order of magnitude faster) • Improves efficiency through: -General execution graphs -In-memory storage • Improves usability through: -Rich APIs in Java, Scala, Python -Interactive shell
  17. Key idea • Write programs in terms of transformations on

    distributed datasets • Concept: resilient distributed datasets (RDDs) - Collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (e.g. caching in RAM)
  18. Other RDD Operators • map • filter • groupBy •

    sort • union • join • leftOuterJoin • rightOuterJoin
  19. Example: Log Mining Load error messages from a log into

    memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(s => s.startswith(“ERROR”)) messages = errors.map(s => s.split(“\t”)) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(s=> s.contains(“foo”)).count() messages.filter(s=> s.contains(“bar”)).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data) Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data)
  20. Fault Recovery RDDs track lineage information that can be used

    to efficiently recompute lost data Ex: msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“\t”)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  21. Spark Streaming - Extends Spark capabilities to large scale stream

    processing. - Scales to 100s of nodes and achieves second scale latencies -Efficient and fault-tolerant stateful stream processing - Simple batch-like API for implementing complex algorithms
  22. Discretized Stream Processing 44 Spark Spark Streaming batches of X

    seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  23. Discretized Stream Processing 45  Batch sizes as low as

    ½ second, latency of about 1 second  Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results
  24. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  25. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  26. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { ... }) foreach: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream Write to database, update analytics UI, do whatever you want
  27. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  28. DStream of data Window-based Transformations val tweets = ssc.twitterStream() val

    hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
  29. Compute TopK Ip addresses val ssc = new StreamingContext(master, "AlgebirdCMS",

    Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = stream.map(ipAddress => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = cms.zero val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { ids.map(id => cms.create(id)) }) .reduce(_ ++ _)
  30. topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial

    = rdd.first() val partialTopK = partial.heavyHitters.map(id => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })
  31. Multi purpose analytics stack Ad-hoc Queries Batch Processing Stream Processing

    Spark + Shark + Spark Streaming MLBASE GraphX BLINK DB TACHYON
  32. SPARK SPARK STREAMING - Almost Similar API for batch or

    Streaming - Single¨Platform with fewer moving parts - Order of magnitude faster
  33. References Sam Ritchie : SummingBird https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at- twitter Chris Severs, Vitaly

    Gordon : Scalable Machine Learning with Scala http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning- with-scala-linkedin Apache Spark : http://spark.incubator.apache.org Matei Zaharia : Parallel Programming with Spark