Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics with Scala

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=47 Sam Bessalah
October 25, 2013
2.6k

Big Data Analytics with Scala

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

October 25, 2013
Tweet

Transcript

  1. Big Data Analytics with Scala Sam BESSALAH @samklr

  2. What is Big Data Analytics? It’s about doing aggregations and

    running complex models on large datasets, offline, in real time or both.
  3. Lambda Architecture Blueprint for a Big Data analytics architecture

  4. None
  5. None
  6. None
  7. None
  8. Map Reduce redux map : (Km, Vm)  List (Km,

    Vm) in Scala : T => List[(K,V)] reduce :(Km, List(Vm))List(Kr, Vr) (K, List[V]) => List[(K,V)]
  9. None
  10. Big data ‘’Hello World’’ : Word count

  11. Enters Cascading

  12. None
  13. Word Count Redux (Flat)Map -Reduce

  14. SCALDING class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap

    ('line -> 'word) { line :String => line.split(“ \\s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }
  15. SCALDING : Clustering with Mahout lazy val clust = new

    StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
  16. SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream

    { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values
  17. Scalding - Two APIs : Field based API, and Typed

    API - Field API : project, map, discard , groupBy… - Typed API : TypedPipe[T], works like scala.collection.Iterator[T] - Matrix Library - ALGEBIRD : Abstract Algebra library … we’ll talk about it later
  18. None
  19. STORM

  20. - Distributed, fault tolerant, real time stream computation engine. -

    Four concepts - Streams : infinite sequence of tuples - Spouts : Source of streams - Bolts : Process and produces streams Can do : Filtering, aggregations, Joins, … - Topologies : define a flow or network of spouts and blots.
  21. None
  22. Streaming Word Count

  23. Trident TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1",

    spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);
  24. ScalaStorm by Evan Chan class SplitSentence extends StormBolt(outputFields = List("word"))

    { def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }
  25. None
  26. SummingBird Write your job once and run it on Storm

    and Hadoop
  27. def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long])

    = source.flatMap { line => line.split(‘’\\s+’’).map(_ -> 1L) } .sumByKey(store)
  28. SummingBird trait Platform[P <: Platform[P]] { type Source[+T] type Store[-K,

    V] type Sink[-T] type Service[-K, +V] type Plan[T} }
  29. On Storm -Source[+T] : Spout[(Long, T)] -Store[-K, V] : StormStore

    [K, V] -Sink[-T] : (T => Future[Unit]) -Service[-K, +V] : StormService[K,V] -Plan[T] : StormTopology
  30. TypeSafety

  31. SummingBird dependencies • StoreHaus • Chill • Scalding • Algebird

    • Tormenta
  32. But - Can only aggregate values that are associative :

    Monoids!!!!!! trait Monoid [V] { def zero : V def aggregate(left : V, right :V): V }
  33. None
  34. Clustering with Mahout redux def StreamClustering(source : Platform[P.String], store :

    P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = source .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
  35. SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream

    { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values .saveTo(store) }
  36. APACHE SPARK

  37. What is Spark? • Fast and expressive cluster computing system

    compatible with Apache Hadoop, but order of magnitude faster (order of magnitude faster) • Improves efficiency through: -General execution graphs -In-memory storage • Improves usability through: -Rich APIs in Java, Scala, Python -Interactive shell
  38. Key idea • Write programs in terms of transformations on

    distributed datasets • Concept: resilient distributed datasets (RDDs) - Collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (e.g. caching in RAM)
  39. Example: Word Count

  40. Other RDD Operators • map • filter • groupBy •

    sort • union • join • leftOuterJoin • rightOuterJoin
  41. Example: Log Mining Load error messages from a log into

    memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(s => s.startswith(“ERROR”)) messages = errors.map(s => s.split(“\t”)) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(s=> s.contains(“foo”)).count() messages.filter(s=> s.contains(“bar”)).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data) Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data)
  42. Fault Recovery RDDs track lineage information that can be used

    to efficiently recompute lost data Ex: msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“\t”)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  43. Spark Streaming - Extends Spark capabilities to large scale stream

    processing. - Scales to 100s of nodes and achieves second scale latencies -Efficient and fault-tolerant stateful stream processing - Simple batch-like API for implementing complex algorithms
  44. Discretized Stream Processing 44 Spark Spark Streaming batches of X

    seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  45. Discretized Stream Processing 45  Batch sizes as low as

    ½ second, latency of about 1 second  Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results
  46. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  47. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  48. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { ... }) foreach: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream Write to database, update analytics UI, do whatever you want
  49. Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  50. DStream of data Window-based Transformations val tweets = ssc.twitterStream() val

    hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
  51. Compute TopK Ip addresses val ssc = new StreamingContext(master, "AlgebirdCMS",

    Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = stream.map(ipAddress => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = cms.zero val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { ids.map(id => cms.create(id)) }) .reduce(_ ++ _)
  52. topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial

    = rdd.first() val partialTopK = partial.heavyHitters.map(id => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })
  53. Multi purpose analytics stack Ad-hoc Queries Batch Processing Stream Processing

    Spark + Shark + Spark Streaming MLBASE GraphX BLINK DB TACHYON
  54. SPARK SPARK STREAMING - Almost Similar API for batch or

    Streaming - Single¨Platform with fewer moving parts - Order of magnitude faster
  55. References Sam Ritchie : SummingBird https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at- twitter Chris Severs, Vitaly

    Gordon : Scalable Machine Learning with Scala http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning- with-scala-linkedin Apache Spark : http://spark.incubator.apache.org Matei Zaharia : Parallel Programming with Spark
  56. None