Big Data using Scala

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=47 Sam Bessalah
May 29, 2013
1.6k

Big Data using Scala

Paris Scala Meetup May 29th 2013

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

May 29, 2013
Tweet

Transcript

  1. 2.

    Outline  Scala in the Hadoop World Hadoop and Map

    Reduce Basics Scalding A word about other Scala DSL : Scrunch and Scoobi  Spark and Co. Spark Spark Streaming  More projects using Scala for Data Analysis
  2. 5.
  3. 6.

    Hadoop  Redundant , fault tolerant data storage  Parallel

    computation framework  Job coordination
  4. 7.

    MapReduce  A programming model for expressing distributed computations at

    massive scale  An execution framework for organizing and performing those computations in an efficient and fault tolerant way,  Bundled within the hadoop framework
  5. 8.

    MapReduce redux ..  Implements two functions at a high

    level Map(k1, v1) → List(k2, v2) Reduce (k2, List(v2)) → List(v3,k3)  The framework takes care of all the plumbing and the distribution, sorting, shuffling ...  Values with the same key flowed to the same reducer
  6. 9.
  7. 10.
  8. 11.
  9. 12.

     Way too long for a simple word counting 

    This gave birth too new tools like Hive or Pig  Pig : Script language for dataflow text = LOAD 'text' USING TextLoader(); tokens = FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; wordcount = FOREACH (GROUP tokens BY word) GENERATE Group as word, COUNT_STAR($1) as ct ;
  10. 13.

    Cascading  Open source created by Chris Wensel, now developped

    at @Concurrent.  Written in Java, evolves around the concept of Pipes or Data flow eventually transformed into MapReduce jobs
  11. 14.

     Cascading change the MR programming model to a generic

    data flow oriented programming model  A Flow is composed of a Source, a Sink and a Pipe to connect them  A pipe is a set of transformations over the input data  Pipes can be combined to create more complex workflow  Contains a flow Optimizer that converts a user data flow to an optimized data flow, that can be converted in its turn to an efficient map reduce job.  We could think of pipes as distributed collections
  12. 16.
  13. 17.

    But ... - Cascading makes use of FP idioms. -

    Functions are wrapped in Objects - Constructors (New) define composition between pipes - Map Reduce paradigm itself derive from FP Why not use functional programming ?
  14. 18.

    SCALDING - A Scala DSL on top of Cascading -

    Open Source project developed at Twitter By Avi Bryant (@avibryant) Oscar Boykin (@posco) Argyris Zymnis (@argyris) -http://github.twitter.com/twitter/scalding
  15. 19.

    Scalding - Two APIs : * Field API : Primary

    API, using Cascading Fields, dynamic with errors at runtime * TypeSafe API : Uses Scala Types, errors at compile time. We’ll focus on this one - Both can be joined using pipe.Typed and TypedPipe.from
  16. 20.
  17. 22.

    In reality : val countedWords = groupedWord.size val countedWords =

    groupedWords.mapValues(x=>1L).sum val countedWords = groupedWords.mapValues(x =>1L) .reduce(implicit mon:Monoid[Long] ((l,r) => mon.plus(l,r))
  18. 23.

    Fields Based API # pipe.flatMap(existingFields -> additionalFields){function} # pipe.map(existingFields ->

    additionalFields){function} # pipe.project(fields) # pipe.discard(fields) # pipe.mapTo(existingFields -> additionalFields){function} # pipe.groupBy(fields){ group => ... } # group.reduce(field){function} # group.foldLeft(field){function} … https://github.com/twitter/scalding/wiki/Fields-based-API-Reference
  19. 24.

    Grouping and Mapping GroupBuilder : Builder Pattern object that operates

    over groups of rows in a pipe. Helps building several parallel aggregations : counting, summing, in one pass . Awesome for stream aggregation. Used for GroupBy, adds fields which are reduction of existing ones. MapReduceMap : map side aggregation, derived from cascading, using combiners intead of reducers. Gotcha : doesn’t work with FoldLeft, which is pushed to reducers
  20. 25.

    Type Safe API Two concepts : TypePipe[T] -Wraps Cascading Pipe

    object. Instances distributed on the cluster, on top of which transformations occur. -Similar interface as scala.collection.Iterator[T] KeyedList[K,V] - Sharding of Key value objects. Two implementations Grouped[K,V] : usual grouping on key K CoGrouped[K,V,W,Result] : a co group over two grouped pipes, used for joins.
  21. 26.

    Optimized Joins JoinWithTiny : map side joins Left side assymetric

    join with a smaller pipe. Uses Cascading HashJoin, a non blocking assymetrical join where the smaller join fits in memory. BlockJoinWithSmaller : Performs a block join, by replicating data. SkewJoinwithSmaller|Larger : deals with skewed pipes CrossWithTiny : Doing a cross product with a moderate sized pipe, can create a huge output.
  22. 27.
  23. 28.
  24. 29.

    MATRIX API Generic Matrix API build using Abstract Algebra(Monoids, Ring,

    ..) Value Operation : mapValues, filterValues, binarizeAs Vector Operations : getRow,reduceRowVectors … mapRows, rowL2Normalize, rowMeanCentering .. Usual Matrix operation : trnspose, product …. Pipe.toMatrix, pipe.flatMaptoMatrix(fields) mapping function ..
  25. 30.
  26. 31.

    Scalding is not the only Scala DSL for MR -

    Scrunch Build on top of Crunch, a MR pipelining library in Java developed by Cloudera. - Scoobi , build at NICTA Same idea as crunch, except fully written in Scala, uses Distributed Lists Dlist to mimic pipelines.
  27. 32.
  28. 33.

    Scrunch Style object WordCount extends PipelineApp { def ScrunchWordCount(file: String)

    = { read(from.textFile(file)) .flatMap(_.split("\\W+") .filter(!_.isEmpty())) .count } val counts = join(countWords(args(0)), countWords(args(1))) write(counts, to.textFile(args(2))) }
  29. 34.

    Spark In-Memory Interactive and Real time Analytics for Large DataSets

    Sam Bessalah @samklr Adapted Slides from Matei Zaharia, UC Berkeley
  30. 35.

    Fast, expressive cluster computing system compatible with Apache Hadoop Works

    with any Hadoop-supported storage system (HDFS, S3, Avro, …) Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Java, Scala, Python Interactive shell Up to 100× faster Often 2-10× less code What is Spark?
  31. 36.

    Key Idea Work with distributed collections as if they were

    local Concept: Resilient Distributed Datasets (RDDs) - Immutable collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (like caching in RAM)
  32. 37.

    Example: Log Mining L oad error messages from a log

    into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 1 Block 2 Block 2 Block 3 Block 3 Worke r Worke r Worke r Worke r Worke r Worke r Driver Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 1 Cache 2 Cache 2 Cache 3 Cache 3 Base RDD Base RDD Transformed RDD Transformed RDD Action Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
  33. 38.

    Fault Tolerance RDDs track lineage information that can be used

    to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File HDFS File Filtered RDD Filtered RDD Mapped RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  34. 39.

    Spark in Java and Scala Java API: JavaRDD<String> lines =

    spark.textFile(…); errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains(“ERROR”); } }); errors.count() Scala API: val lines = spark.textFile(…) errors = lines.filter(s => s.contains(“ERROR”)) // can also write filter(_.contains(“ERROR”)) errors.count
  35. 40.

    Which Language Should I Use? Standalone programs can be written

    in any, but console is only Python & Scala Python developers: can stay with Python for both Java developers: consider using Scala for console (to learn the API) Performance: Java / Scala will be faster (statically typed), but Python can do well for numerical work with NumPy
  36. 41.

    Scala Cheat Sheet Variables: var x: Int = 7 var

    x = 7 // type inferred val y = “hi” // read-only Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x // last line returned } Collections and closures: val nums = Array(1, 2, 3) nums.map((x: Int) => x + 2) // => Array(3, 4, 5) nums.map(x => x + 2) // => same nums.map(_ + 2) // => same nums.reduce((x, y) => x + y) // => 6 nums.reduce(_ + _) // => 6
  37. 42.

    Learning Spark Easiest way: Spark interpreter (spark-shell or pyspark) Special

    Scala and Python consoles for cluster use Runs in local mode on 1 thread by default, but can control with MASTER environment var: MASTER=local ./spark-shell # local, 1 thread MASTER=local[2] ./spark-shell # local, 2 threads MASTER=spark://host:port ./spark-shell # Spark standalone cluster
  38. 43.

    Main entry point to Spark functionality Created for you in

    Spark shells as variable sc In standalone programs, you’d make your own (see later for details) First Stop: SparkContext
  39. 44.

    Creating RDDs # Turn a local collection into an RDD

    sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) # Use any existing Hadoop InputFormat sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  40. 45.

    Basic Transformations nums = sc.parallelize([1, 2, 3]) # Pass each

    element through a function squares = nums.map(lambda x: x*x) # => {1, 4, 9} # Keep elements passing a predicate even = squares.filter(lambda x: x % 2 == 0) # => {4} # Map each element to zero or more others nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1) Range object (sequence of numbers 0, 1, …, x-1)
  41. 46.

    nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as

    a local collection nums.collect() # => [1, 2, 3] # Return first K elements nums.take(2) # => [1, 2] # Count number of elements nums.count() # => 3 # Merge elements with an associative function nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”) Basic Actions
  42. 47.

    Spark’s “distributed reduce” transformations act on RDDs of key-value pairs

    Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2 pair._1 // => a pair._2 // => b Working with Key-Value Pairs
  43. 48.

    Some Key-Value Operations pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”,

    2)]) pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} pets.groupByKey() # => {(cat, Seq(1, 2)), (dog, Seq(1)} pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
  44. 49.

    visits = sc.parallelize([(“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”)]) pageNames =

    sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)]) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) visits.cogroup(pageNames) # (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”))) # (“about.html”, (Seq(“3.4.5.6”), Seq(“About”))) Multiple Datasets
  45. 50.

    class MyCoolRddApp { val param = 3.14 val log =

    new Log(...) ... def work(rdd: RDD[Int]) { rdd.map(x => x + param) .reduce(...) } } How to get around it: class MyCoolRddApp { ... def work(rdd: RDD[Int]) { val param_ = param rdd.map(x => x + param_) .reduce(...) } } NotSerializableException: MyCoolRddApp (or Log) NotSerializableException: MyCoolRddApp (or Log) References only local variable instead of this.param References only local variable instead of this.param Closure Mishap Example
  46. 52.

    Components sc = new SparkContext f = sc. t ext

    Fi l e( “ … ” ) f . f i l t er ( … ) . count ( ) . . . Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager
  47. 53.

    Example Job val sc = new SparkContext( “spark://...”, “MyJob”, home,

    jars) val file = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.count() Resilient distributed datasets (RDDs) Resilient distributed datasets (RDDs) Action Action
  48. 54.

    RDD Graph HadoopRDD path = hdfs://... HadoopRDD path = hdfs://...

    FilteredRDD func = _.contains(…) shouldCache = true FilteredRDD func = _.contains(…) shouldCache = true file: errors: Partition-level view: Dataset-level view: Task 1Task 2 ...
  49. 55.

    Data Locality First run: data not in cache, so use

    HadoopRDD’s locality prefs (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS
  50. 56.

    Broadcast Variables When one creates a broadcast variable b with

    a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.
  51. 57.

    Accumulators Each accumulator is given a unique ID when it

    is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type. On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.
  52. 58.

    Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG

    agnostic to operators! agnostic to operators! doesn’t know about stages doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  53. 59.

    Example: HadoopRDD partitions = one per HDFS block dependencies =

    none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
  54. 60.

    Example: FilteredRDD partitions = same as parent RDD dependencies =

    “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
  55. 61.

    Example: JoinedRDD partitions = one per reduce task dependencies =

    “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks) Spark will now know this data is hashed! Spark will now know this data is hashed!
  56. 62.

    Dependency Types union groupByKey join with inputs not co-partitioned join

    with inputs co- partitioned map, filter “Narrow” deps: “Wide” (shuffle) deps:
  57. 63.

    DAG Scheduler Interface: receives a “target” RDD, a function to

    run on each partition, and a listener for results Roles: Build stages of Task objects (code + preferred loc.) Submit them to TaskScheduler as ready Resubmit failed stages if outputs are lost
  58. 64.

    Scheduler Optimizations Pipelines narrow ops. within a stage Picks join

    algorithms based on partitioning (minimize shuffles) Reuses previously cached data join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  59. 69.
  60. 71.

    K-Means: preliminaries Feature 1 Feature 2 K = Number of

    clusters Data assignments to clusters S 1 , S 2 ,. . ., S K
  61. 72.

    K-Means: preliminaries Feature 1 Feature 2 K = Number of

    clusters Data assignments to clusters S 1 , S 2 ,. . ., S K
  62. 73.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.
  63. 74.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.
  64. 75.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)
  65. 76.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)
  66. 77.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)
  67. 78.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))
  68. 79.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))
  69. 80.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))
  70. 81.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey()
  71. 82.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))
  72. 83.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))
  73. 84.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))
  74. 85.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)
  75. 86.

    K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

    centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)
  76. 87.

    K-Means Source Feature 1 Feature 2 centers = data.takeSample( false,

    K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (d > ɛ) { } d = distance(centers, newCenters) centers = newCenters.map(_)
  77. 88.

    Ease of use  Interactive shell: Useful for featurization, pre-processing

    data  Lines of code for K-Means - Spark ~ 90 lines – (Part of hands-on tutorial !) - Hadoop/Mahout ~ 4 files, > 300 lines
  78. 90.

    Why PageRank? Good example of a more complex algorithm Multiple

    stages of map & reduce Benefits from Spark’s in-memory caching Multiple iterations over the same data
  79. 91.

    Basic Idea Give pages ranks (scores) based on links to

    them Links from many pages  high rank Link from a high-rank page  high rank Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
  80. 92.

    Algorithm 1.0 1.0 1.0 1.0 1. Start each page at

    a rank of 1 2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
  81. 93.

    Algorithm 1. Start each page at a rank of 1

    2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 1.0 1.0 1.0 1.0 1 0.5 0.5 0.5 1 0.5
  82. 94.

    Algorithm 1. Start each page at a rank of 1

    2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 1.0 1.85 0.58
  83. 95.

    Algorithm 1. Start each page at a rank of 1

    2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 0.29 0.29 0.5 1.85 0.58 1.0 1.85 0.58 0.5
  84. 96.

    Algorithm 1. Start each page at a rank of 1

    2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.39 1.72 1.31 0.58 . . .
  85. 97.

    Algorithm 1. Start each page at a rank of 1

    2. On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.46 1.37 1.44 0.73 Final state:
  86. 98.

    Scala Implementation val links = // RDD of (url, neighbors)

    pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)
  87. 101.

    What is Spark Streaming? Framework for large scale stream processing

    Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
  88. 104.

    Stateful Stream Processing Traditional streaming systems have a event-driven record-at-a-time

    processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 input records 104
  89. 105.

    Existing Streaming Systems Storm Replays record if not processed by

    a node Processes each record at least once May update mutable state twice! Mutable state can be lost due to failure! Trident – Use transactions to update state Processes each record exactly once Per state transaction updates slow 105
  90. 106.

    Requirements Scalable to large clusters Second-scale latencies Simple programming model

    Integrated with batch & interactive processing Efficient fault-tolerance in stateful computations
  91. 107.

    Discretized Stream Processing Run a streaming computation as a series

    of very small, deterministic batch jobs 107 Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  92. 108.

    Discretized Stream Processing Run a streaming computation as a series

    of very small, deterministic batch jobs 108 Spark Spark Streamin batches of X seconds live data stream processed results  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system
  93. 109.

    Example – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter

    username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t+1 batch @ t batch @ t batch @ t+2 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  94. 110.

    Example – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter

    username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one Dstream to create another DStream new DStream new RDDs created for every batch batch @ t+1 batch @ t+1 batch @ t batch @ t batch @ t+2 batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  95. 111.

    Example – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter

    username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMa p flatMa p flatMa p save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  96. 112.

    Java Example Scala val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

    val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status>s = ssc.twitterStream(<Twitter username>, <Twitter password>) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the transformation
  97. 113.

    Fault-tolerance RDDs are remember the sequence of operations that created

    it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault- tolerant Data lost due to worker failure, can be recomputed from input data input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  98. 114.

    Key concepts DStream – sequence of RDDs representing a stream

    of data Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue, reduce, join, … Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results
  99. 115.

    Example 2 – Count the hashtags val tweets = ssc.twitterStream(<Twitter

    username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() flatMap map reduceByKey flatMap map reduceByKey … flatMap map reduceByKey batch @ t+1 batch @ t batch @ t+2 hashTags tweets tagCounts [(#cat, 10), (#dog, 25), ... ]
  100. 116.

    Fault-tolerant Stateful Processing All intermediate data are RDDs, hence can

    be recomputed if lost hashTags t-1 t t+1 t+2 t+3 tagCounts
  101. 117.

    Fault-tolerant Stateful Processing State data not lost even if a

    worker node dies Does not change the value of your result Exactly once semantics to all transformations No double counting! 117
  102. 118.

    Other Interesting Operations Maintaining arbitrary state, track sessions Maintain per-user

    mood as state, and update it with his/her tweets tweets.updateStateByKey(tweet => updateMood(tweet)) Do arbitrary Spark RDD computation within DStream Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
  103. 119.

    Performance Can process 6 GB/sec (60M records/sec) of data on

    100 nodes at sub- second latency Tested with 100 streams of data on 100 EC2 instances with 4 cores each 119
  104. 121.
  105. 122.

    Bibliography Slides for Scalding shamelessly inspired from - Mario Pasteurelli

    http://fr.slideshare.net/melrief/scalding-programming-model-for-hadoop -Dean Wampler (@deanwampler) Scalding workshop code https://github.com/ThinkBigAnalytics/scalding-workshop Slides : http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf SPARK : http://spark-project.org/documentation/