Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics with Scala

Sam Bessalah
October 25, 2013
2.7k

Big Data Analytics with Scala

Sam Bessalah

October 25, 2013
Tweet

Transcript

  1. Big Data Analytics
    with Scala
    Sam BESSALAH
    @samklr

    View Slide

  2. What is Big Data Analytics?
    It’s about doing aggregations and running
    complex models on large datasets, offline, in
    real time or both.

    View Slide

  3. Lambda Architecture
    Blueprint for a Big Data analytics
    architecture

    View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. Map Reduce redux
    map : (Km, Vm)  List (Km, Vm)
    in Scala : T => List[(K,V)]
    reduce :(Km, List(Vm))List(Kr, Vr)
    (K, List[V]) => List[(K,V)]

    View Slide

  9. View Slide

  10. Big data ‘’Hello World’’ : Word count

    View Slide

  11. Enters Cascading

    View Slide

  12. View Slide

  13. Word Count Redux
    (Flat)Map -Reduce

    View Slide

  14. SCALDING
    class WordCount(args : Args) extends Job(args) {
    TextLine(args("input"))
    .flatMap ('line -> 'word) {
    line :String => line.split(“ \\s+”)
    }
    .groupBy('word){ group => group.size }
    .write(Tsv(args("output")))
    }

    View Slide

  15. SCALDING : Clustering with Mahout
    lazy val clust = new StreamingKMeans(new FastProjectionSearch(
    new EuclideanDistanceMeasure,5,10),
    args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
    val count = 0;
    val sloppyClusters = TextLine(args("input"))
    .map{ str =>
    val vec = str.split("\t").map(_.toDouble)
    val cent = new Centroid(count, new
    DenseVector(vec))
    count += 1
    cent }
    .unorderedFold [StreamingKMeans,Centroid](clust)
    {(cl,cent) => cl.cluster(cent);
    cl }
    .flatMap(c => c.iterator.asScala.toIterable)

    View Slide

  16. SCALDING : Clustering with Mahout
    val finalClusters = sloppyClusters.groupAll
    .mapValueStream { centList =>
    lazy val bclusterer = new BallKMeans(new BruteSearch(
    new EuclideanDistanceMeasure),
    args("numclusters").toInt, 100)
    bclusterer.cluster(centList.toList.asJava)
    bclusterer.iterator.asScala
    }
    .values

    View Slide

  17. Scalding
    - Two APIs : Field based API, and Typed API
    - Field API : project, map, discard , groupBy…
    - Typed API : TypedPipe[T], works like
    scala.collection.Iterator[T]
    - Matrix Library
    - ALGEBIRD : Abstract Algebra library … we’ll
    talk about it later

    View Slide

  18. View Slide

  19. STORM

    View Slide

  20. - Distributed, fault tolerant, real time stream
    computation engine.
    - Four concepts
    - Streams : infinite sequence of tuples
    - Spouts : Source of streams
    - Bolts : Process and produces streams
    Can do : Filtering, aggregations, Joins, …
    - Topologies : define a flow or network of
    spouts and blots.

    View Slide

  21. View Slide

  22. Streaming Word Count

    View Slide

  23. Trident
    TridentTopology topology = new TridentTopology();
    TridentState wordCounts =
    topology.newStream("spout1", spout)
    .each(new Fields("sentence"),
    new Split(), new Fields("word"))
    .groupBy(new Fields("word"))
    .persistentAggregate(new Factory(),
    new Count(),
    new Fields("count"))
    .parallelismHint(6);

    View Slide

  24. ScalaStorm by Evan Chan
    class SplitSentence extends
    StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
    case Seq(line: String) => line.split(‘’’’).foreach
    { word => using anchor t emit (word) }
    t ack
    }
    }

    View Slide

  25. View Slide

  26. SummingBird
    Write your job once and run it on Storm and
    Hadoop

    View Slide

  27. def wordCount[P (source: Producer[P, String], store: P#Store[String, Long]) =
    source.flatMap {
    line => line.split(‘’\\s+’’).map(_ -> 1L) }
    .sumByKey(store)

    View Slide

  28. SummingBird
    trait Platform[P {
    type Source[+T]
    type Store[-K, V]
    type Sink[-T]
    type Service[-K, +V]
    type Plan[T}
    }

    View Slide

  29. On Storm
    -Source[+T] : Spout[(Long, T)]
    -Store[-K, V] : StormStore [K, V]
    -Sink[-T] : (T => Future[Unit])
    -Service[-K, +V] : StormService[K,V]
    -Plan[T] : StormTopology

    View Slide

  30. TypeSafety

    View Slide

  31. SummingBird dependencies
    • StoreHaus
    • Chill
    • Scalding
    • Algebird
    • Tormenta

    View Slide

  32. But
    - Can only aggregate values that are
    associative : Monoids!!!!!!
    trait Monoid [V] {
    def zero : V
    def aggregate(left : V, right :V): V
    }

    View Slide

  33. View Slide

  34. Clustering with Mahout redux
    def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) {
    lazy val clust = new StreamingKMeans(new FastProjectionSearch(
    new EuclideanDistanceMeasure,5,10),
    args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
    val count = 0;
    val sloppyClusters = source
    .map{ str =>
    val vec = str.split("\t").map(_.toDouble)
    val cent = new Centroid(count, new
    DenseVector(vec))
    count += 1
    cent }
    .unorderedFold [StreamingKMeans,Centroid](clust)
    {(cl,cent) => cl.cluster(cent); cl }
    .flatMap(c => c.iterator.asScala.toIterable)

    View Slide

  35. SCALDING : Clustering with Mahout
    val finalClusters = sloppyClusters.groupAll
    .mapValueStream { centList =>
    lazy val bclusterer = new BallKMeans(new BruteSearch(
    new EuclideanDistanceMeasure),
    args("numclusters").toInt, 100)
    bclusterer.cluster(centList.toList.asJava)
    bclusterer.iterator.asScala
    }
    .values
    .saveTo(store)
    }

    View Slide

  36. APACHE SPARK

    View Slide

  37. What is Spark?
    • Fast and expressive cluster computing system
    compatible with Apache Hadoop, but order of magnitude
    faster (order of magnitude faster)
    • Improves efficiency through:
    -General execution graphs
    -In-memory storage
    • Improves usability through:
    -Rich APIs in Java, Scala, Python
    -Interactive shell

    View Slide

  38. Key idea
    • Write programs in terms of transformations on distributed
    datasets
    • Concept: resilient distributed datasets (RDDs)
    - Collections of objects spread across a cluster
    - Built through parallel transformations (map, filter, etc)
    - Automatically rebuilt on failure
    - Controllable persistence (e.g. caching in RAM)

    View Slide

  39. Example: Word Count

    View Slide

  40. Other RDD Operators
    • map
    • filter
    • groupBy
    • sort
    • union
    • join
    • leftOuterJoin
    • rightOuterJoin

    View Slide

  41. Example: Log Mining
    Load error messages from a log into memory,
    then interactively search for various patterns
    lines = spark.textFile(“hdfs://...”)
    errors = lines.filter(s => s.startswith(“ERROR”))
    messages = errors.map(s => s.split(“\t”))
    messages.cache()
    Block 1
    Block 2
    Block 3
    Worker
    Worker
    Worker
    Driver
    messages.filter(s=> s.contains(“foo”)).count()
    messages.filter(s=> s.contains(“bar”)).count()
    . . .
    tasks
    results
    Cache 1
    Cache 2
    Cache 3
    Base
    RDD
    Transformed
    RDD
    Action
    Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk
    data)
    Result: scaled to 1 TB data in 5 sec
    (vs 180 sec for on-disk data)

    View Slide

  42. Fault Recovery
    RDDs track lineage information that can be
    used to efficiently recompute lost data
    Ex: msgs = textFile.filter(-=> _.startsWith(“ERROR”))
    .map(_ => _.split(“\t”))
    HDFS File Filtered RDD Mapped RDD
    filter
    (func = _.contains(...))
    map
    (func = _.split(...))

    View Slide

  43. Spark Streaming
    - Extends Spark capabilities to large scale stream
    processing.
    - Scales to 100s of nodes and achieves second scale
    latencies
    -Efficient and fault-tolerant stateful stream processing
    - Simple batch-like API for implementing complex
    algorithms

    View Slide

  44. Discretized Stream
    Processing
    44
    Spark
    Spark
    Streaming
    batches of X
    seconds
    live data stream
    processed
    results
     Chop up the live stream into batches of X
    seconds
     Spark treats each batch of data as RDDs and
    processes them using RDD operations
     Finally, the processed results of the RDD
    operations are returned in batches

    View Slide

  45. Discretized Stream
    Processing
    45
     Batch sizes as low as ½ second, latency
    of about 1 second
     Potential for combining batch
    processing and streaming processing
    in the same system
    Spark
    Spark
    Streaming
    batches of X seconds
    live data stream
    processed
    results

    View Slide

  46. Example – Get hashtags from
    Twitter
    val tweets = ssc.twitterStream()
    DStream: a sequence of RDDs representing a stream
    of data
    batch @ t+1
    batch @ t batch @ t+2
    tweets DStream
    stored in memory as an RDD
    (immutable, distributed)
    Twitter Streaming API

    View Slide

  47. Example – Get hashtags from Twitter
    val tweets = ssc.twitterStream()
    val hashTags = tweets.flatMap (status => getTags(status))
    flatMap flatMap flatMap

    transformation: modify data in one DStream to create another
    DStream
    new DStream
    new RDDs created
    for every batch
    batch @ t+1
    batch @ t batch @ t+2
    tweets DStream
    hashTags Dstream
    [#cat, #dog, … ]

    View Slide

  48. Example – Get hashtags from Twitter
    val tweets = ssc.twitterStream()
    val hashTags = tweets.flatMap (status => getTags(status))
    hashTags.foreach(hashTagRDD => { ... })
    foreach: do whatever you want with the processed
    data
    flatMap flatMap flatMap
    foreach foreach foreach
    batch @ t+1
    batch @ t batch @ t+2
    tweets DStream
    hashTags
    DStream
    Write to database, update analytics
    UI, do whatever you want

    View Slide

  49. Example – Get hashtags from Twitter
    val tweets = ssc.twitterStream()
    val hashTags = tweets.flatMap (status => getTags(status))
    hashTags.saveAsHadoopFiles("hdfs://...")
    output operation: to push data to external storage
    flatMap flatMap flatMap
    save save save
    batch @ t+1
    batch @ t batch @ t+2
    tweets DStream
    hashTags DStream
    every batch
    saved to HDFS

    View Slide

  50. DStream of data
    Window-based Transformations
    val tweets = ssc.twitterStream()
    val hashTags = tweets.flatMap (status => getTags(status))
    val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
    sliding window
    operation
    window length sliding interval
    window length
    sliding interval

    View Slide

  51. Compute TopK Ip addresses
    val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …)
    val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..)
    val addresses = stream.map(ipAddress => ipAddress.getText)
    val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC)
    val globalCMS = cms.zero
    val mm = new MapMonoid[Long, Int]()
    //init
    val topAddresses = adresses.mapPartitions(ids => {
    ids.map(id => cms.create(id))
    })
    .reduce(_ ++ _)

    View Slide

  52. topAddresses.foreach(rdd => {
    if (rdd.count() != 0) {
    val partial = rdd.first()
    val partialTopK = partial.heavyHitters.map(id =>
    (id, partial.frequency(id).estimate))
    .toSeq.sortBy(_._2).reverse.slice(0, TOPK)
    globalCMS ++= partial
    val globalTopK = globalCMS.heavyHitters.map(id =>
    (id, globalCMS.frequency(id).estimate))
    .toSeq.sortBy(_._2).reverse.slice(0, TOPK)
    globalTopK.mkString("[", ",", "]")))
    }
    })

    View Slide

  53. Multi purpose analytics stack
    Ad-hoc
    Queries
    Batch
    Processing
    Stream
    Processing
    Spark
    +
    Shark
    +
    Spark
    Streaming
    MLBASE
    GraphX
    BLINK DB
    TACHYON

    View Slide

  54. SPARK
    SPARK STREAMING
    - Almost Similar API for batch or Streaming
    - Single¨Platform with fewer moving parts
    - Order of magnitude faster

    View Slide

  55. References
    Sam Ritchie : SummingBird
    https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-
    twitter
    Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala
    http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-
    with-scala-linkedin
    Apache Spark : http://spark.incubator.apache.org
    Matei Zaharia : Parallel Programming with Spark

    View Slide

  56. View Slide