Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data using Scala

Sam Bessalah
May 29, 2013
1.6k

Big Data using Scala

Paris Scala Meetup May 29th 2013

Sam Bessalah

May 29, 2013
Tweet

Transcript

  1. SCALA + BIG DATA
    PARIS SCALA MEETUP, 05/29/2013
    Sam BESSALAH

    View Slide

  2. Outline

    Scala in the Hadoop World
    Hadoop and Map Reduce Basics
    Scalding
    A word about other Scala DSL : Scrunch and Scoobi

    Spark and Co.
    Spark
    Spark Streaming

    More projects using Scala for Data Analysis

    View Slide

  3. SCALA and HADOOP

    View Slide

  4. The new darling of data crunchers at scale

    View Slide

  5. View Slide

  6. Hadoop

    Redundant , fault tolerant data storage

    Parallel computation framework

    Job coordination

    View Slide

  7. MapReduce

    A programming model for expressing distributed computations
    at massive scale

    An execution framework for organizing and performing those
    computations in an efficient and fault tolerant way,

    Bundled within the hadoop framework

    View Slide

  8. MapReduce redux ..

    Implements two functions at a high level
    Map(k1, v1) → List(k2, v2)
    Reduce (k2, List(v2)) → List(v3,k3)

    The framework takes care of all the plumbing and the
    distribution, sorting, shuffling ...

    Values with the same key flowed to the same reducer

    View Slide

  9. View Slide

  10. View Slide

  11. View Slide


  12. Way too long for a simple word counting

    This gave birth too new tools like Hive or Pig

    Pig : Script language for dataflow
    text = LOAD 'text' USING TextLoader();
    tokens = FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word;
    wordcount = FOREACH (GROUP tokens BY word) GENERATE
    Group as word,
    COUNT_STAR($1) as ct ;

    View Slide

  13. Cascading

    Open source created by Chris Wensel, now developped at
    @Concurrent.

    Written in Java, evolves around the concept of Pipes or Data
    flow eventually transformed into MapReduce jobs

    View Slide


  14. Cascading change the MR programming model to a generic data
    flow oriented programming model

    A Flow is composed of a Source, a Sink and a Pipe to connect
    them

    A pipe is a set of transformations over the input data

    Pipes can be combined to create more complex workflow

    Contains a flow Optimizer that converts a user data flow to an
    optimized data flow, that can be converted in its turn to an efficient
    map reduce job.

    We could think of pipes as distributed collections

    View Slide

  15. Word Count redux ..

    View Slide

  16. View Slide

  17. But ...
    - Cascading makes use of FP idioms.
    - Functions are wrapped in Objects
    - Constructors (New) define composition
    between pipes
    - Map Reduce paradigm itself derive from FP
    Why not use functional programming ?

    View Slide

  18. SCALDING
    - A Scala DSL on top of Cascading
    - Open Source project developed at Twitter
    By Avi Bryant (@avibryant)
    Oscar Boykin (@posco)
    Argyris Zymnis (@argyris)
    -http://github.twitter.com/twitter/scalding

    View Slide

  19. Scalding
    - Two APIs :
    * Field API : Primary API, using Cascading
    Fields, dynamic with errors at runtime
    * TypeSafe API : Uses Scala Types, errors at
    compile time. We’ll focus on this one
    - Both can be joined using pipe.Typed and
    TypedPipe.from

    View Slide

  20. View Slide

  21. Scalding word count

    View Slide

  22. In reality :
    val countedWords = groupedWord.size
    val countedWords = groupedWords.mapValues(x=>1L).sum
    val countedWords = groupedWords.mapValues(x =>1L)
    .reduce(implicit mon:Monoid[Long] ((l,r) => mon.plus(l,r))

    View Slide

  23. Fields Based API
    # pipe.flatMap(existingFields -> additionalFields){function}
    # pipe.map(existingFields -> additionalFields){function}
    # pipe.project(fields)
    # pipe.discard(fields)
    # pipe.mapTo(existingFields -> additionalFields){function}
    # pipe.groupBy(fields){ group => ... }
    # group.reduce(field){function}
    # group.foldLeft(field){function}

    https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

    View Slide

  24. Grouping and Mapping
    GroupBuilder :
    Builder Pattern object that operates over groups of rows in a
    pipe.
    Helps building several parallel aggregations : counting,
    summing, in one pass . Awesome for stream aggregation.
    Used for GroupBy, adds fields which are reduction of existing
    ones.
    MapReduceMap : map side aggregation, derived from cascading,
    using combiners intead of reducers.
    Gotcha : doesn’t work with FoldLeft, which is pushed to reducers

    View Slide

  25. Type Safe API
    Two concepts :
    TypePipe[T]
    -Wraps Cascading Pipe object. Instances distributed on the cluster,
    on top of which transformations occur.
    -Similar interface as scala.collection.Iterator[T]
    KeyedList[K,V]
    - Sharding of Key value objects. Two implementations
    Grouped[K,V] : usual grouping on key K
    CoGrouped[K,V,W,Result] : a co group over two grouped
    pipes, used for joins.

    View Slide

  26. Optimized Joins
    JoinWithTiny : map side joins
    Left side assymetric join with a smaller pipe.
    Uses Cascading HashJoin, a non blocking assymetrical join where
    the smaller join fits in memory.
    BlockJoinWithSmaller : Performs a block join, by replicating data.
    SkewJoinwithSmaller|Larger : deals with skewed pipes
    CrossWithTiny : Doing a cross product with a moderate sized
    pipe, can create a huge output.

    View Slide

  27. View Slide

  28. View Slide

  29. MATRIX API
    Generic Matrix API build using Abstract Algebra(Monoids, Ring, ..)
    Value Operation : mapValues, filterValues, binarizeAs
    Vector Operations : getRow,reduceRowVectors …
    mapRows, rowL2Normalize, rowMeanCentering ..
    Usual Matrix operation : trnspose, product ….
    Pipe.toMatrix, pipe.flatMaptoMatrix(fields) mapping function ..

    View Slide

  30. View Slide

  31. Scalding is not the only Scala DSL for MR
    - Scrunch
    Build on top of Crunch, a MR pipelining
    library in Java developed by Cloudera.
    - Scoobi , build at NICTA
    Same idea as crunch, except fully written in
    Scala, uses Distributed Lists Dlist to mimic
    pipelines.

    View Slide

  32. View Slide

  33. Scrunch Style
    object WordCount extends PipelineApp {
    def ScrunchWordCount(file: String) = {
    read(from.textFile(file))
    .flatMap(_.split("\\W+")
    .filter(!_.isEmpty()))
    .count
    }
    val counts = join(countWords(args(0)), countWords(args(1)))
    write(counts, to.textFile(args(2)))
    }

    View Slide

  34. Spark
    In-Memory Interactive and Real time
    Analytics for Large DataSets
    Sam Bessalah
    @samklr
    Adapted Slides from Matei Zaharia, UC Berkeley

    View Slide

  35. Fast, expressive cluster computing system compatible with Apache
    Hadoop
    Works with any Hadoop-supported storage system (HDFS, S3,
    Avro, …)
    Improves efficiency through:
    In-memory computing primitives
    General computation graphs
    Improves usability through:
    Rich APIs in Java, Scala, Python
    Interactive shell
    Up to 100× faster
    Often 2-10× less code
    What is Spark?

    View Slide

  36. Key Idea
    Work with distributed collections as if they were local
    Concept: Resilient Distributed Datasets (RDDs)
    - Immutable collections of objects spread across a
    cluster
    - Built through parallel transformations (map, filter, etc)
    - Automatically rebuilt on failure
    - Controllable persistence (like caching in RAM)

    View Slide

  37. Example: Log Mining
    L
    oad error messages from a log into memory,
    then interactively search for various patterns
    lines = spark.textFile(“hdfs://...”)
    errors = lines.filter(_.startsWith(“ERROR”))
    messages = errors.map(_.split(‘\t’)(2))
    cachedMsgs = messages.cache()
    Block 1
    Block 1
    Block 2
    Block 2
    Block 3
    Block 3
    Worke
    r
    Worke
    r
    Worke
    r
    Worke
    r
    Worke
    r
    Worke
    r
    Driver
    Driver
    cachedMsgs.filter(_.contains(“foo”)).count
    cachedMsgs.filter(_.contains(“bar”)).count
    . . .
    tasks
    results
    Cache
    1
    Cache
    1
    Cache
    2
    Cache
    2
    Cache
    3
    Cache
    3
    Base
    RDD
    Base
    RDD
    Transformed
    RDD
    Transformed
    RDD
    Action
    Action
    Result: full-text search of Wikipedia in <1 sec (vs
    20 sec for on-disk data)
    Result: scaled to 1 TB data in 5-7 sec
    (vs 170 sec for on-disk data)

    View Slide

  38. Fault Tolerance
    RDDs track lineage information that can be used
    to efficiently reconstruct lost partitions
    Ex:
    messages = textFile(...).filter(_.startsWith(“ERROR”))
    .map(_.split(‘\t’)(2))
    HDFS File
    HDFS File Filtered RDD
    Filtered RDD Mapped RDD
    Mapped RDD
    filter
    (func = _.contains(...))
    map
    (func = _.split(...))

    View Slide

  39. Spark in Java and Scala
    Java API:
    JavaRDD lines =
    spark.textFile(…);
    errors = lines.filter(
    new Function() {
    public Boolean call(String s) {
    return s.contains(“ERROR”);
    }
    });
    errors.count()
    Scala API:
    val lines = spark.textFile(…)
    errors = lines.filter(s =>
    s.contains(“ERROR”))
    // can also write
    filter(_.contains(“ERROR”))
    errors.count

    View Slide

  40. Which Language Should I Use?
    Standalone programs can be written in any, but
    console is only Python & Scala
    Python developers: can stay with Python for both
    Java developers: consider using Scala for console
    (to learn the API)
    Performance: Java / Scala will be faster (statically
    typed), but Python can do well for numerical work
    with NumPy

    View Slide

  41. Scala Cheat Sheet
    Variables:
    var x: Int = 7
    var x = 7 // type inferred
    val y = “hi” // read-only
    Functions:
    def square(x: Int): Int = x*x
    def square(x: Int): Int = {
    x*x // last line returned
    }
    Collections and closures:
    val nums = Array(1, 2, 3)
    nums.map((x: Int) => x + 2) // => Array(3, 4,
    5)
    nums.map(x => x + 2) // => same
    nums.map(_ + 2) // => same
    nums.reduce((x, y) => x + y) // => 6
    nums.reduce(_ + _) // => 6

    View Slide

  42. Learning Spark
    Easiest way: Spark interpreter (spark-shell or
    pyspark)
    Special Scala and Python consoles for cluster use
    Runs in local mode on 1 thread by default, but can
    control with MASTER environment var:
    MASTER=local ./spark-shell # local, 1 thread
    MASTER=local[2] ./spark-shell # local, 2 threads
    MASTER=spark://host:port ./spark-shell # Spark standalone cluster

    View Slide

  43. Main entry point to Spark functionality
    Created for you in Spark shells as variable sc
    In standalone programs, you’d make your own
    (see later for details)
    First Stop: SparkContext

    View Slide

  44. Creating RDDs
    # Turn a local collection into an RDD
    sc.parallelize([1, 2, 3])
    # Load text file from local FS, HDFS, or S3
    sc.textFile(“file.txt”)
    sc.textFile(“directory/*.txt”)
    sc.textFile(“hdfs://namenode:9000/path/file”)
    # Use any existing Hadoop InputFormat
    sc.hadoopFile(keyClass, valClass, inputFmt, conf)

    View Slide

  45. Basic Transformations
    nums = sc.parallelize([1, 2, 3])
    # Pass each element through a function
    squares = nums.map(lambda x: x*x) # => {1, 4, 9}
    # Keep elements passing a predicate
    even = squares.filter(lambda x: x % 2 == 0) # => {4}
    # Map each element to zero or more others
    nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1,
    2} Range object (sequence
    of numbers 0, 1, …, x-1)
    Range object (sequence
    of numbers 0, 1, …, x-1)

    View Slide

  46. nums = sc.parallelize([1, 2, 3])
    # Retrieve RDD contents as a local collection
    nums.collect() # => [1, 2, 3]
    # Return first K elements
    nums.take(2) # => [1, 2]
    # Count number of elements
    nums.count() # => 3
    # Merge elements with an associative function
    nums.reduce(lambda x, y: x + y) # => 6
    # Write elements to a text file
    nums.saveAsTextFile(“hdfs://file.txt”)
    Basic Actions

    View Slide

  47. Spark’s “distributed reduce” transformations act on RDDs
    of key-value pairs
    Python: pair = (a, b)
    pair[0] # => a
    pair[1] # => b
    Scala: val pair = (a, b)
    pair._1 // => a
    pair._2 // => b
    Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2
    pair._1 // => a
    pair._2 // => b
    Working with Key-Value Pairs

    View Slide

  48. Some Key-Value Operations
    pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
    pets.reduceByKey(lambda x, y: x + y)
    # => {(cat, 3), (dog, 1)}
    pets.groupByKey()
    # => {(cat, Seq(1, 2)), (dog, Seq(1)}
    pets.sortByKey()
    # => {(cat, 1), (cat, 2), (dog, 1)}
    reduceByKey also automatically implements combiners on the map
    side

    View Slide

  49. visits = sc.parallelize([(“index.html”, “1.2.3.4”),
    (“about.html”, “3.4.5.6”),
    (“index.html”, “1.3.3.1”)])
    pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”,
    “About”)])
    visits.join(pageNames)
    # (“index.html”, (“1.2.3.4”, “Home”))
    # (“index.html”, (“1.3.3.1”, “Home”))
    # (“about.html”, (“3.4.5.6”, “About”))
    visits.cogroup(pageNames)
    # (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))
    # (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))
    Multiple Datasets

    View Slide

  50. class MyCoolRddApp {
    val param = 3.14
    val log = new Log(...)
    ...
    def work(rdd: RDD[Int]) {
    rdd.map(x => x + param)
    .reduce(...)
    }
    }
    How to get around it:
    class MyCoolRddApp {
    ...
    def work(rdd: RDD[Int]) {
    val param_ = param
    rdd.map(x => x + param_)
    .reduce(...)
    }
    }
    NotSerializableException:
    MyCoolRddApp (or Log)
    NotSerializableException:
    MyCoolRddApp (or Log) References only local
    variable instead of
    this.param
    References only local
    variable instead of
    this.param
    Closure Mishap Example

    View Slide

  51. SPARK INTERNALS

    View Slide

  52. Components
    sc = new SparkContext
    f = sc.
    t
    ext
    Fi
    l
    e(



    )
    f
    .
    f
    i
    l
    t
    er
    (

    )
    .
    count
    (
    )
    .
    .
    .
    Your program
    Spark client
    (app master)
    Spark worker
    HDFS, HBase, …
    Block
    manager
    Task
    threads
    RDD graph
    Scheduler
    Block tracker
    Shuffle tracker
    Cluster
    manager

    View Slide

  53. Example Job
    val sc = new SparkContext(
    “spark://...”, “MyJob”, home, jars)
    val file = sc.textFile(“hdfs://...”)
    val errors = file.filter(_.contains(“ERROR”))
    errors.cache()
    errors.count()
    Resilient distributed
    datasets (RDDs)
    Resilient distributed
    datasets (RDDs)
    Action
    Action

    View Slide

  54. RDD Graph
    HadoopRDD
    path = hdfs://...
    HadoopRDD
    path = hdfs://...
    FilteredRDD
    func = _.contains(…)
    shouldCache = true
    FilteredRDD
    func = _.contains(…)
    shouldCache = true
    file:
    errors:
    Partition-level view:
    Dataset-level view:
    Task 1Task 2 ...

    View Slide

  55. Data Locality
    First run: data not in cache, so use HadoopRDD’s
    locality prefs (from HDFS)
    Second run: FilteredRDD is in cache, so use its
    locations
    If something falls out of cache, go back to HDFS

    View Slide

  56. Broadcast Variables
    When one creates a broadcast variable b with a
    value v, v is saved to a file in a shared file
    system. The serialized form of b is a path to this
    file. When b’s value is queried on a worker
    node, Spark first checks whether v is in a local
    cache, and reads it from the file system if it isn’t.

    View Slide

  57. Accumulators
    Each accumulator is given a unique ID when it is
    created. When the accumulator is saved, its
    serialized form contains its ID and the “zero” value
    for its type.
    On the workers, a separate copy of the accumulator
    is created for each thread that runs a task using
    thread-local variables, and is reset to zero when a
    task begins. After each task runs, the worker
    sends a message to the driver program containing
    the updates it made to various accumulators.

    View Slide

  58. Scheduling Process
    rdd1.join(rdd2)
    .groupBy(…)
    .filter(…)
    RDD Objects
    build operator DAG
    agnostic to
    operators!
    agnostic to
    operators!
    doesn’t know
    about stages
    doesn’t know
    about stages
    DAGScheduler
    split graph into
    stages of tasks
    submit each
    stage as ready
    DAG
    TaskScheduler
    TaskSet
    launch tasks via
    cluster manager
    retry failed or
    straggling tasks
    Cluster
    manager
    Worker
    execute tasks
    store and serve
    blocks
    Block
    manager
    Threads
    Task
    stage
    failed

    View Slide

  59. Example: HadoopRDD
    partitions = one per HDFS block
    dependencies = none
    compute(partition) = read corresponding block
    preferredLocations(part) = HDFS block location
    partitioner = none

    View Slide

  60. Example: FilteredRDD
    partitions = same as parent RDD
    dependencies = “one-to-one” on parent
    compute(partition) = compute parent and filter it
    preferredLocations(part) = none (ask parent)
    partitioner = none

    View Slide

  61. Example: JoinedRDD
    partitions = one per reduce task
    dependencies = “shuffle” on each parent
    compute(partition) = read and join shuffled data
    preferredLocations(part) = none
    partitioner = HashPartitioner(numTasks)
    Spark will now know
    this data is hashed!
    Spark will now know
    this data is hashed!

    View Slide

  62. Dependency Types
    union
    groupByKey
    join with inputs not
    co-partitioned
    join with
    inputs co-
    partitioned
    map, filter
    “Narrow” deps: “Wide” (shuffle) deps:

    View Slide

  63. DAG Scheduler
    Interface: receives a “target” RDD, a function to
    run on each partition, and a listener for results
    Roles:
    Build stages of Task objects (code + preferred loc.)
    Submit them to TaskScheduler as ready
    Resubmit failed stages if outputs are lost

    View Slide

  64. Scheduler Optimizations
    Pipelines narrow ops.
    within a stage
    Picks join algorithms
    based on partitioning
    (minimize shuffles)
    Reuses previously
    cached data join
    union
    groupBy
    map
    Stage 3
    Stage 1
    Stage 2
    A: B:
    C: D:
    E:
    F:
    G:
    = previously computed partition
    Task

    View Slide

  65. Exemple : K-Means Clustering using
    Spark

    View Slide

  66. Clustering
    Grouping data according to
    similarity
    Distance East
    Distance North
    E.g. archaeological dig

    View Slide

  67. Clustering
    Grouping data according to
    similarity
    Distance East
    Distance North
    E.g. archaeological dig

    View Slide

  68. K-Means Algorithm
    Benefits
    •Popular
    •Fast
    •Conceptually straightforward
    Distance East
    Distance North
    E.g. archaeological dig

    View Slide

  69. K-Means: preliminaries
    Feature 1
    Feature 2
    Data: Collection of values
    data = lines.map(line=>
    parseVector(line))

    View Slide

  70. K-Means: preliminaries
    Feature 1
    Feature 2
    Dissimilarity:
    Squared Euclidean distance
    dist = p.squaredDist(q)

    View Slide

  71. K-Means: preliminaries
    Feature 1
    Feature 2
    K = Number of clusters
    Data assignments to
    clusters
    S
    1
    , S
    2
    ,. . ., S
    K

    View Slide

  72. K-Means: preliminaries
    Feature 1
    Feature 2
    K = Number of clusters
    Data assignments to
    clusters
    S
    1
    , S
    2
    ,. . ., S
    K

    View Slide

  73. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each data point to the
    cluster with the closest
    center.
    Assign each cluster center to
    be the mean of its cluster’s
    data points.

    View Slide

  74. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until
    convergence:
    Assign each data point to
    the cluster with the
    closest center.
    Assign each cluster
    center to be the mean
    of its cluster’s data
    points.

    View Slide

  75. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each data point to the
    cluster with the closest center.
    Assign each cluster center to be
    the mean of its cluster’s data
    points.
    centers = data.takeSample(
    false, K, seed)

    View Slide

  76. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each data point to the
    cluster with the closest center.
    Assign each cluster center to be
    the mean of its cluster’s data
    points.
    centers = data.takeSample(
    false, K, seed)

    View Slide

  77. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each data point to the
    cluster with the closest center.
    Assign each cluster center to be
    the mean of its cluster’s data
    points.
    centers = data.takeSample(
    false, K, seed)

    View Slide

  78. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each cluster center to be
    the mean of its cluster’s data
    points.
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))

    View Slide

  79. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each cluster center to be the
    mean of its cluster’s data points.
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))

    View Slide

  80. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    Assign each cluster center to be
    the mean of its cluster’s data
    points.
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))

    View Slide

  81. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()

    View Slide

  82. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()
    newCenters =
    pointsGroup.mapValues(
    ps => average(ps))

    View Slide

  83. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()
    newCenters = pointsGroup.mapValues(
    ps => average(ps))

    View Slide

  84. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()
    newCenters = pointsGroup.mapValues(
    ps => average(ps))

    View Slide

  85. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until convergence:
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()
    newCenters =pointsGroup.mapValues(
    ps => average(ps))
    while (dist(centers, newCenters) > ɛ)

    View Slide

  86. K-Means Algorithm
    Feature 1
    Feature 2
    • Initialize K cluster centers
    • Repeat until
    convergence:
    centers = data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()
    newCenters =pointsGroup.mapValues(
    ps => average(ps))
    while (dist(centers, newCenters) > ɛ)

    View Slide

  87. K-Means Source
    Feature 1
    Feature 2
    centers =
    data.takeSample(
    false, K, seed)
    closest = data.map(p =>
    (closestPoint(p,centers),p))
    pointsGroup =
    closest.groupByKey()
    newCenters =pointsGroup.mapValues(
    ps => average(ps))
    while (d > ɛ)
    {
    }
    d = distance(centers, newCenters)
    centers = newCenters.map(_)

    View Slide

  88. Ease of use
     Interactive shell:
    Useful for featurization, pre-processing data
     Lines of code for K-Means
    - Spark ~ 90 lines – (Part of hands-on tutorial !)
    - Hadoop/Mahout ~ 4 files, > 300 lines

    View Slide

  89. Example: PageRank

    View Slide

  90. Why PageRank?
    Good example of a more complex algorithm
    Multiple stages of map & reduce
    Benefits from Spark’s in-memory caching
    Multiple iterations over the same data

    View Slide

  91. Basic Idea
    Give pages ranks (scores) based on links to them
    Links from many pages  high rank
    Link from a high-rank page  high rank
    Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

    View Slide

  92. Algorithm
    1.0 1.0
    1.0
    1.0
    1. Start each page at a rank of 1
    2. On each iteration, have page p contribute
    rank
    p
    / |neighbors
    p
    | to its neighbors
    3. Set each page’s rank to 0.15 + 0.85 × contribs

    View Slide

  93. Algorithm
    1. Start each page at a rank of 1
    2. On each iteration, have page p contribute
    rank
    p
    / |neighbors
    p
    | to its neighbors
    3. Set each page’s rank to 0.15 + 0.85 × contribs
    1.0 1.0
    1.0
    1.0
    1
    0.5
    0.5
    0.5
    1
    0.5

    View Slide

  94. Algorithm
    1. Start each page at a rank of 1
    2. On each iteration, have page p contribute
    rank
    p
    / |neighbors
    p
    | to its neighbors
    3. Set each page’s rank to 0.15 + 0.85 × contribs
    0.58 1.0
    1.85
    0.58

    View Slide

  95. Algorithm
    1. Start each page at a rank of 1
    2. On each iteration, have page p contribute
    rank
    p
    / |neighbors
    p
    | to its neighbors
    3. Set each page’s rank to 0.15 + 0.85 × contribs
    0.58
    0.29
    0.29
    0.5
    1.85
    0.58 1.0
    1.85
    0.58
    0.5

    View Slide

  96. Algorithm
    1. Start each page at a rank of 1
    2. On each iteration, have page p contribute
    rank
    p
    / |neighbors
    p
    | to its neighbors
    3. Set each page’s rank to 0.15 + 0.85 × contribs
    0.39 1.72
    1.31
    0.58
    . . .

    View Slide

  97. Algorithm
    1. Start each page at a rank of 1
    2. On each iteration, have page p contribute
    rank
    p
    / |neighbors
    p
    | to its neighbors
    3. Set each page’s rank to 0.15 + 0.85 × contribs
    0.46 1.37
    1.44
    0.73
    Final state:

    View Slide

  98. Scala Implementation
    val links = // RDD of (url, neighbors) pairs
    var ranks = // RDD of (url, rank) pairs
    for (i <- 1 to ITERATIONS) {
    val contribs = links.join(ranks).flatMap {
    case (url, (links, rank)) =>
    links.map(dest => (dest, rank/links.size))
    }
    ranks = contribs.reduceByKey(_ + _)
    .mapValues(0.15 + 0.85 * _)
    }
    ranks.saveAsTextFile(...)

    View Slide

  99. PageRank Performance

    View Slide

  100. SPARK STREAMING

    View Slide

  101. What is Spark Streaming?
    Framework for large scale stream processing
    Scales to 100s of nodes
    Can achieve second scale latencies
    Integrates with Spark’s batch and interactive processing
    Provides a simple batch-like API for implementing complex algorithm
    Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

    View Slide

  102. Requirements
    Scalable to large clusters
    Second-scale latencies
    Simple programming model

    View Slide

  103. Requirements
    Scalable to large clusters
    Second-scale latencies
    Simple programming model
    Integrated with batch & interactive processing

    View Slide

  104. Stateful Stream Processing
    Traditional streaming systems have a
    event-driven record-at-a-time
    processing model
    Each node has mutable state
    For each record, update state & send
    new records
    State is lost if node dies!
    Making stateful stream processing be
    fault-tolerant is challenging
    mutable state
    node 1
    node 3
    input
    records
    node 2
    input
    records
    104

    View Slide

  105. Existing Streaming Systems
    Storm
    Replays record if not processed by a node
    Processes each record at least once
    May update mutable state twice!
    Mutable state can be lost due to failure!
    Trident – Use transactions to update
    state
    Processes each record exactly once
    Per state transaction updates slow
    105

    View Slide

  106. Requirements
    Scalable to large clusters
    Second-scale latencies
    Simple programming model
    Integrated with batch & interactive processing
    Efficient fault-tolerance in stateful computations

    View Slide

  107. Discretized Stream Processing
    Run a streaming computation as a series
    of very small, deterministic batch jobs
    107
    Spark
    Spark
    Streaming
    batches of X
    seconds
    live data stream
    processed
    results
     Chop up the live stream into batches of X
    seconds
     Spark treats each batch of data as RDDs
    and processes them using RDD operations
     Finally, the processed results of the RDD
    operations are returned in batches

    View Slide

  108. Discretized Stream Processing
    Run a streaming computation as a series
    of very small, deterministic batch jobs
    108
    Spark
    Spark
    Streamin
    batches of X
    seconds
    live data stream
    processed
    results
     Batch sizes as low as ½ second, latency ~ 1
    second
     Potential for combining batch processing
    and streaming processing in the same
    system

    View Slide

  109. Example – Get hashtags from
    Twitter
    val tweets = ssc.twitterStream(, )
    DStream: a sequence of RDD representing a stream
    of data
    batch @
    t+1
    batch @
    t+1
    batch @ t
    batch @ t batch @
    t+2
    batch @
    t+2
    tweets DStream
    stored in memory as an
    RDD (immutable,
    distributed)
    Twitter Streaming API

    View Slide

  110. Example – Get hashtags from
    Twitter
    val tweets = ssc.twitterStream(, )
    val hashTags = tweets.flatMap (status => getTags(status))
    flatMap flatMap flatMap

    transformation: modify data in one Dstream to create
    another DStream
    new DStream
    new RDDs created
    for every batch
    batch @
    t+1
    batch @
    t+1
    batch @ t
    batch @ t batch @
    t+2
    batch @
    t+2
    tweets DStream
    hashTags Dstream
    [#cat, #dog, … ]

    View Slide

  111. Example – Get hashtags from
    Twitter
    val tweets = ssc.twitterStream(, )
    val hashTags = tweets.flatMap (status => getTags(status))
    hashTags.saveAsHadoopFiles("hdfs://...")
    output operation: to push data to external
    storage
    flatMa
    p
    flatMa
    p
    flatMa
    p
    save save save
    batch @
    t+1
    batch @ t
    batch @
    t+2
    tweets DStream
    hashTags DStream
    every batch
    saved to HDFS

    View Slide

  112. Java Example
    Scala
    val tweets = ssc.twitterStream(, )
    val hashTags = tweets.flatMap (status => getTags(status))
    hashTags.saveAsHadoopFiles("hdfs://...")
    Java
    JavaDStreams = ssc.twitterStream(, password>)
    JavaDstream hashTags = tweets.flatMap(new Function<...> { })
    hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the
    transformation

    View Slide

  113. Fault-tolerance
    RDDs are remember the sequence
    of operations that created it from
    the original fault-tolerant input
    data
    Batches of input data are
    replicated in memory of multiple
    worker nodes, therefore fault-
    tolerant
    Data lost due to worker failure, can
    be recomputed from input data
    input data
    replicated
    in memory
    flatMap
    lost partitions
    recomputed on
    other workers
    tweets
    RDD
    hashTags
    RDD

    View Slide

  114. Key concepts
    DStream – sequence of RDDs representing a stream of data
    Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
    Transformations – modify data from on DStream to another
    Standard RDD operations – map, countByValue, reduce, join, …
    Stateful operations – window, countByValueAndWindow, …
    Output Operations – send data to external entity
    saveAsHadoopFiles – saves to HDFS
    foreach – do anything with each batch of results

    View Slide

  115. Example 2 – Count the hashtags
    val tweets = ssc.twitterStream(, )
    val hashTags = tweets.flatMap (status => getTags(status))
    val tagCounts = hashTags.countByValue()
    flatMap
    map
    reduceByKey
    flatMap
    map
    reduceByKey

    flatMap
    map
    reduceByKey
    batch @
    t+1
    batch @ t
    batch @
    t+2
    hashTags
    tweets
    tagCounts
    [(#cat, 10), (#dog, 25), ... ]

    View Slide

  116. Fault-tolerant Stateful Processing
    All intermediate data are RDDs, hence can be recomputed if lost
    hashTags
    t-1 t t+1 t+2 t+3
    tagCounts

    View Slide

  117. Fault-tolerant Stateful Processing
    State data not lost even if a worker node dies
    Does not change the value of your result
    Exactly once semantics to all transformations
    No double counting!
    117

    View Slide

  118. Other Interesting Operations
    Maintaining arbitrary state, track sessions
    Maintain per-user mood as state, and update it with his/her tweets
    tweets.updateStateByKey(tweet => updateMood(tweet))
    Do arbitrary Spark RDD computation within DStream
    Join incoming tweets with a spam file to filter out bad tweets
    tweets.transform(tweetsRDD => {
    tweetsRDD.join(spamHDFSFile).filter(...)
    })

    View Slide

  119. Performance
    Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-
    second latency
    Tested with 100 streams of data on 100 EC2 instances with 4 cores each
    119

    View Slide

  120. OTHER PROJECTS
    Scoobi
    Scrunch
    Scala NLP
    Breeze
    Saddle
    Factorie …
    **Scala Notebook

    View Slide

  121. THANKS

    View Slide

  122. Bibliography
    Slides for Scalding shamelessly inspired from
    - Mario Pasteurelli
    http://fr.slideshare.net/melrief/scalding-programming-model-for-hadoop
    -Dean Wampler (@deanwampler)
    Scalding workshop code
    https://github.com/ThinkBigAnalytics/scalding-workshop
    Slides : http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf
    SPARK : http://spark-project.org/documentation/

    View Slide