Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road to Summingbird: Stream Processing at (Every) Scale

Sam Ritchie
January 11, 2014

The Road to Summingbird: Stream Processing at (Every) Scale

Twitter's Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at "scale" yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it's not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup's code from day one. You'll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

Sam Ritchie

January 11, 2014
Tweet

More Decks by Sam Ritchie

Other Decks in Programming

Transcript

  1. THE ROAD TO SUMMINGBIRD
    Sam Ritchie :: @sritchie :: Data Day Texas 2014
    Stream Processing at (Every) Scale

    View full-size slide

  2. @summingbird

    View full-size slide

  3. https:// / /summingbird

    View full-size slide

  4. • Logging and Monitoring in the Small
    AGENDA

    View full-size slide

  5. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    AGENDA

    View full-size slide

  6. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    • What breaks at full scale?
    AGENDA

    View full-size slide

  7. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    • What breaks at full scale?
    • Summingbird’s Constraints, how they can help
    AGENDA

    View full-size slide

  8. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    • What breaks at full scale?
    • Summingbird’s Constraints, how they can help
    • Lessons Learned
    AGENDA

    View full-size slide

  9. WHAT TO MONITOR?

    View full-size slide

  10. • Application “Events”
    WHAT TO MONITOR?

    View full-size slide

  11. • Application “Events”
    WHAT TO MONITOR?
    • on certain events or patterns

    View full-size slide

  12. • Application “Events”
    WHAT TO MONITOR?
    • on certain events or patterns
    • Extract metrics from the event stream

    View full-size slide

  13. • Application “Events”
    WHAT TO MONITOR?
    • on certain events or patterns
    • Extract metrics from the event stream
    • Dashboards?

    View full-size slide

  14. PREPPING FOR SCALE

    View full-size slide

  15. PREPPING FOR SCALE

    View full-size slide

  16. PREPPING FOR SCALE

    View full-size slide

  17. LOG STATEMENTS
    (defn create-user! [username]
    (log/info "User Created: " username)
    (db/create {:type :user
    :name username
    :timestamp
    (System/currentTimeMillis)}))

    View full-size slide

  18. Your
    App Heroku Logs

    View full-size slide

  19. CENTRALIZED LOGGING

    View full-size slide

  20. WHAT DO YOU GET?

    View full-size slide

  21. • Ability to REACT to system events
    WHAT DO YOU GET?

    View full-size slide

  22. • Ability to REACT to system events
    • Long-term storage via S3
    WHAT DO YOU GET?

    View full-size slide

  23. • Ability to REACT to system events
    • Long-term storage via S3
    • Searchable Logs
    WHAT DO YOU GET?

    View full-size slide

  24. WHAT’S MISSING?

    View full-size slide

  25. • How many users per day?
    WHAT’S MISSING?

    View full-size slide

  26. • How many users per day?
    • How many times did this exception show up vs that?
    WHAT’S MISSING?

    View full-size slide

  27. • How many users per day?
    • How many times did this exception show up vs that?
    • Was this the first time I’ve seen that error?
    WHAT’S MISSING?

    View full-size slide

  28. • How many users per day?
    • How many times did this exception show up vs that?
    • Was this the first time I’ve seen that error?
    • Pattern Analysis requires Aggregations
    WHAT’S MISSING?

    View full-size slide

  29. STRUCTURED LOGGING

    View full-size slide

  30. IMPOSE STRUCTURE
    (log/info "User Created: " username)
    (log/info {:event "user_creation"
    :name "sritchie"
    :timestamp (now)
    :request-id request-id})

    View full-size slide

  31. EVENT PROCESSING

    View full-size slide

  32. Your
    App
    Github
    Mandrill
    S3

    View full-size slide

  33. EVENT PROCESSORS

    View full-size slide

  34. • FluentD (http://fluentd.org/)
    EVENT PROCESSORS

    View full-size slide

  35. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    EVENT PROCESSORS

    View full-size slide

  36. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    EVENT PROCESSORS

    View full-size slide

  37. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    • Simmer (https://github.com/avibryant/simmer)
    EVENT PROCESSORS

    View full-size slide

  38. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    • Simmer (https://github.com/avibryant/simmer)
    • StatsD + CollectD (https://github.com/etsy/statsd/)
    EVENT PROCESSORS

    View full-size slide

  39. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    • Simmer (https://github.com/avibryant/simmer)
    • StatsD + CollectD (https://github.com/etsy/statsd/)
    • Esper (http://esper.codehaus.org/)
    EVENT PROCESSORS

    View full-size slide

  40. What Breaks at Scale?

    View full-size slide

  41. SERIALIZATION

    View full-size slide

  42. • Thrift (http://thrift.apache.org/)
    SERIALIZATION

    View full-size slide

  43. • Thrift (http://thrift.apache.org/)
    • Protocol Buffers (https://code.google.com/p/protobuf/)
    SERIALIZATION

    View full-size slide

  44. • Thrift (http://thrift.apache.org/)
    • Protocol Buffers (https://code.google.com/p/protobuf/)
    • Avro (http://avro.apache.org/)
    SERIALIZATION

    View full-size slide

  45. • Thrift (http://thrift.apache.org/)
    • Protocol Buffers (https://code.google.com/p/protobuf/)
    • Avro (http://avro.apache.org/)
    • Kryo (https://github.com/EsotericSoftware/kryo)
    SERIALIZATION

    View full-size slide

  46. LOG COLLECTION

    View full-size slide

  47. • Kafka (https://kafka.apache.org/)
    LOG COLLECTION

    View full-size slide

  48. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    LOG COLLECTION

    View full-size slide

  49. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    • Flume (http://flume.apache.org/)
    LOG COLLECTION

    View full-size slide

  50. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    • Flume (http://flume.apache.org/)
    • Kinesis (http://aws.amazon.com/kinesis/)
    LOG COLLECTION

    View full-size slide

  51. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    • Flume (http://flume.apache.org/)
    • Kinesis (http://aws.amazon.com/kinesis/)
    • Scribe (https://github.com/facebook/scribe)
    LOG COLLECTION

    View full-size slide

  52. EVENT PROCESSING

    View full-size slide

  53. @summingbird

    View full-size slide

  54. - Declarative Streaming Map/Reduce DSL
    - Realtime platform that runs on Storm.
    - Batch platform that runs on Hadoop.
    - Batch / Realtime Hybrid platform
    What is Summingbird?

    View full-size slide

  55. val impressionCounts =
    impressionHose.flatMap(extractCounts(_))
    val engagementCounts =
    engagementHose.filter(_.isValid)
    .flatMap(engagementCounts(_))
    val totalCounts =
    (impressionCounts ++ engagementCounts)
    .flatMap(fanoutByTime(_))
    .sumByKey(onlineStore)
    val stormTopology =
    Storm.remote("stormName").plan(totalCounts)
    val hadoopJob =
    Scalding("scaldingName").plan(totalCounts)

    View full-size slide

  56. MAP/REDUCE
    f1 f1 f2 f2 f2
    + + + + +
    Event Stream 1 Event Stream 2
    FlatMappers
    Reducers
    Storage (Memcache / ElephantDB)

    View full-size slide

  57. - Source[+T]
    - Service[-K, +V]
    - Store[-K, V]
    - Sink[-T]

    View full-size slide

  58. - Source[+T]
    - Service[-K, +V]
    - Store[-K, V]
    - Sink[-T]
    The Four Ss!

    View full-size slide

  59. Store[-K, V]:
    What values are allowed?

    View full-size slide

  60. trait Monoid[T] {
    def zero: T
    def plus(l: T, r: T): T
    }

    View full-size slide

  61. Tons O’Monoids:
    CMS,
    HyperLogLog,
    ExponentialMA,
    BloomFilter,
    Moments,
    MinHash,
    TopK

    View full-size slide

  62. Algebird at Scale

    View full-size slide

  63. MONOID COMPOSITION

    View full-size slide

  64. // Views per URL Tweeted
    (URL, Int)

    View full-size slide

  65. // Views per URL Tweeted
    (URL, Int)
    // Unique Users per URL Tweeted
    (URL, Set[UserID])

    View full-size slide

  66. // Views per URL Tweeted
    (URL, Int)
    // Unique Users per URL Tweeted
    (URL, Set[UserID])
    // Views AND Unique Users per URL
    (URL, (Int, Set[UserID]))

    View full-size slide

  67. // Views per URL Tweeted
    (URL, Int)
    // Unique Users per URL Tweeted
    (URL, Set[UserID])
    // Views, Unique Users + Top-K Users
    (URL, (Int, Set[UserID], TopK[(User, Count)]))
    // Views AND Unique Users per URL
    (URL, (Int, Set[UserID]))

    View full-size slide

  68. ASSOCIATIVITY

    View full-size slide

  69. ;; 7 steps
    a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7

    View full-size slide

  70. ;; 7 steps
    (+ a0 a1 a2 a3 a4 a5 a6 a7)

    View full-size slide

  71. ;; 5 steps
    (+ (+ a0 a1)
    (+ a2 a3)
    (+ a4 a5)
    (+ a6 a7))

    View full-size slide

  72. ;; 3 steps
    (+ (+ (+ a0 a1)
    (+ a2 a3))
    (+ (+ a4 a5)
    (+ a6 a7)))

    View full-size slide

  73. PARALLELISM
    ASSOCIATIVITY

    View full-size slide

  74. BATCH / REALTIME
    0 1 2 3
    fault
    tolerant:
    Noisy: Realtime sums
    from 0, each
    batch
    Log
    Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    BatchID:

    View full-size slide

  75. BATCH / REALTIME
    0 1 2 3
    fault
    tolerant:
    Noisy:
    Log
    Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    Hadoop keeps
    a total sum
    (reliably)
    BatchID:

    View full-size slide

  76. BATCH / REALTIME
    0 1 2 3
    fault
    tolerant:
    Noisy:
    Log
    Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT Sum of RT
    Batch(i) +
    Hadoop
    Batch(i-1)
    has bounded
    noise,
    bounded
    read/write
    size
    BatchID:

    View full-size slide

  77. Your
    App
    Github
    Mandrill
    S3
    ElephantDB
    Memcached
    Clients

    View full-size slide

  78. TWEET EMBED
    COUNTS

    View full-size slide

  79. Approximate Maps
    - We would probably be okay if for each Key we
    could get an approximate Value.
    - We might not need to enumerate all resulting
    keys; perhaps only keys with large values would
    do.

    View full-size slide

  80. W
    D
    for (K,V) => add V to (i, h_i(K))

    View full-size slide

  81. W
    D
    To read, for each h_i(K), take the
    min.

    View full-size slide

  82. Count-Min Sketch is an
    Approximate Map
    - Each K is hashed to d values from [0 to w-1]
    - sum into those buckets
    - Result is min of all buckets.
    - Result is an upper bound on true value.
    - With prob > (1 - delta), error is at most eps *
    Total Count
    - w = 1 / eps, d = log(1/delta)
    - total cost in memory O(w * d)

    View full-size slide

  83. f f f
    + + + + +
    Tweets
    (Flat)Mappers
    Reducers
    HDFS/Queue
    HDFS/Queue
    reduce: (x,y) =>
    MapMonoid
    groupBy TweetID
    (TweetID, Map[URL, Long])

    View full-size slide

  84. Brief Explanation
    This job creates two types of keys:
    1: ((TweetId, TimeBucket) => CMS[URL, Impressions])
    2: TimeBucket => CMS[TweetId, Impressions]

    View full-size slide

  85. WHAT’S NEXT?

    View full-size slide

  86. - Akka, Spark, Tez Platforms
    - More Monoids
    - Pluggable graph optimizations
    - Auto-tuning Realtime Topologies
    Future Plans

    View full-size slide

  87. TAKEAWAYS
    • Scale - Fake it ‘til you Make It

    View full-size slide

  88. TAKEAWAYS
    • Scale - Fake it ‘til you Make It
    • Structured Logging

    View full-size slide

  89. TAKEAWAYS
    • Scale - Fake it ‘til you Make It
    • Structured Logging
    • Include timestamps EVERYWHERE

    View full-size slide

  90. TAKEAWAYS
    • Scale - Fake it ‘til you Make It
    • Structured Logging
    • Include timestamps EVERYWHERE
    • Record your Schemas

    View full-size slide

  91. Sam Ritchie :: @sritchie :: Data Day Texas 2014
    Questions?

    View full-size slide