Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road to Summingbird: Stream Processing at (Every) Scale

Sam Ritchie
January 11, 2014

The Road to Summingbird: Stream Processing at (Every) Scale

Twitter's Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at "scale" yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it's not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup's code from day one. You'll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

Sam Ritchie

January 11, 2014
Tweet

More Decks by Sam Ritchie

Other Decks in Programming

Transcript

  1. THE ROAD TO SUMMINGBIRD
    Sam Ritchie :: @sritchie :: Data Day Texas 2014
    Stream Processing at (Every) Scale

    View Slide

  2. @summingbird

    View Slide

  3. https:// / /summingbird

    View Slide

  4. AGENDA

    View Slide

  5. • Logging and Monitoring in the Small
    AGENDA

    View Slide

  6. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    AGENDA

    View Slide

  7. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    • What breaks at full scale?
    AGENDA

    View Slide

  8. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    • What breaks at full scale?
    • Summingbird’s Constraints, how they can help
    AGENDA

    View Slide

  9. • Logging and Monitoring in the Small
    • Scaling toward Summingbird - Tooling Overview
    • What breaks at full scale?
    • Summingbird’s Constraints, how they can help
    • Lessons Learned
    AGENDA

    View Slide

  10. WHAT TO MONITOR?

    View Slide

  11. • Application “Events”
    WHAT TO MONITOR?

    View Slide

  12. • Application “Events”
    WHAT TO MONITOR?
    • on certain events or patterns

    View Slide

  13. • Application “Events”
    WHAT TO MONITOR?
    • on certain events or patterns
    • Extract metrics from the event stream

    View Slide

  14. • Application “Events”
    WHAT TO MONITOR?
    • on certain events or patterns
    • Extract metrics from the event stream
    • Dashboards?

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. PREPPING FOR SCALE

    View Slide

  19. PREPPING FOR SCALE

    View Slide

  20. PREPPING FOR SCALE

    View Slide

  21. View Slide

  22. LOG STATEMENTS
    (defn create-user! [username]
    (log/info "User Created: " username)
    (db/create {:type :user
    :name username
    :timestamp
    (System/currentTimeMillis)}))

    View Slide

  23. View Slide

  24. Your
    App Heroku Logs

    View Slide

  25. CENTRALIZED LOGGING

    View Slide

  26. View Slide

  27. View Slide

  28. Your
    App
    S3

    View Slide

  29. WHAT DO YOU GET?

    View Slide

  30. • Ability to REACT to system events
    WHAT DO YOU GET?

    View Slide

  31. • Ability to REACT to system events
    • Long-term storage via S3
    WHAT DO YOU GET?

    View Slide

  32. • Ability to REACT to system events
    • Long-term storage via S3
    • Searchable Logs
    WHAT DO YOU GET?

    View Slide

  33. WHAT’S MISSING?

    View Slide

  34. • How many users per day?
    WHAT’S MISSING?

    View Slide

  35. • How many users per day?
    • How many times did this exception show up vs that?
    WHAT’S MISSING?

    View Slide

  36. • How many users per day?
    • How many times did this exception show up vs that?
    • Was this the first time I’ve seen that error?
    WHAT’S MISSING?

    View Slide

  37. • How many users per day?
    • How many times did this exception show up vs that?
    • Was this the first time I’ve seen that error?
    • Pattern Analysis requires Aggregations
    WHAT’S MISSING?

    View Slide

  38. View Slide

  39. STRUCTURED LOGGING

    View Slide

  40. IMPOSE STRUCTURE
    (log/info "User Created: " username)
    (log/info {:event "user_creation"
    :name "sritchie"
    :timestamp (now)
    :request-id request-id})

    View Slide

  41. View Slide

  42. View Slide

  43. EVENT PROCESSING

    View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. Your
    App
    Github
    Mandrill
    S3

    View Slide

  49. EVENT PROCESSORS

    View Slide

  50. • FluentD (http://fluentd.org/)
    EVENT PROCESSORS

    View Slide

  51. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    EVENT PROCESSORS

    View Slide

  52. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    EVENT PROCESSORS

    View Slide

  53. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    • Simmer (https://github.com/avibryant/simmer)
    EVENT PROCESSORS

    View Slide

  54. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    • Simmer (https://github.com/avibryant/simmer)
    • StatsD + CollectD (https://github.com/etsy/statsd/)
    EVENT PROCESSORS

    View Slide

  55. • FluentD (http://fluentd.org/)
    • Riemann (http://riemann.io/)
    • Splunk (http://www.splunk.com/)
    • Simmer (https://github.com/avibryant/simmer)
    • StatsD + CollectD (https://github.com/etsy/statsd/)
    • Esper (http://esper.codehaus.org/)
    EVENT PROCESSORS

    View Slide

  56. View Slide

  57. What Breaks at Scale?

    View Slide

  58. SERIALIZATION

    View Slide

  59. • Thrift (http://thrift.apache.org/)
    SERIALIZATION

    View Slide

  60. • Thrift (http://thrift.apache.org/)
    • Protocol Buffers (https://code.google.com/p/protobuf/)
    SERIALIZATION

    View Slide

  61. • Thrift (http://thrift.apache.org/)
    • Protocol Buffers (https://code.google.com/p/protobuf/)
    • Avro (http://avro.apache.org/)
    SERIALIZATION

    View Slide

  62. • Thrift (http://thrift.apache.org/)
    • Protocol Buffers (https://code.google.com/p/protobuf/)
    • Avro (http://avro.apache.org/)
    • Kryo (https://github.com/EsotericSoftware/kryo)
    SERIALIZATION

    View Slide

  63. LOG COLLECTION

    View Slide

  64. • Kafka (https://kafka.apache.org/)
    LOG COLLECTION

    View Slide

  65. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    LOG COLLECTION

    View Slide

  66. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    • Flume (http://flume.apache.org/)
    LOG COLLECTION

    View Slide

  67. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    • Flume (http://flume.apache.org/)
    • Kinesis (http://aws.amazon.com/kinesis/)
    LOG COLLECTION

    View Slide

  68. • Kafka (https://kafka.apache.org/)
    • LogStash (http://logstash.net/)
    • Flume (http://flume.apache.org/)
    • Kinesis (http://aws.amazon.com/kinesis/)
    • Scribe (https://github.com/facebook/scribe)
    LOG COLLECTION

    View Slide

  69. EVENT PROCESSING

    View Slide

  70. @summingbird

    View Slide

  71. - Declarative Streaming Map/Reduce DSL
    - Realtime platform that runs on Storm.
    - Batch platform that runs on Hadoop.
    - Batch / Realtime Hybrid platform
    What is Summingbird?

    View Slide

  72. View Slide

  73. val impressionCounts =
    impressionHose.flatMap(extractCounts(_))
    val engagementCounts =
    engagementHose.filter(_.isValid)
    .flatMap(engagementCounts(_))
    val totalCounts =
    (impressionCounts ++ engagementCounts)
    .flatMap(fanoutByTime(_))
    .sumByKey(onlineStore)
    val stormTopology =
    Storm.remote("stormName").plan(totalCounts)
    val hadoopJob =
    Scalding("scaldingName").plan(totalCounts)

    View Slide

  74. MAP/REDUCE
    f1 f1 f2 f2 f2
    + + + + +
    Event Stream 1 Event Stream 2
    FlatMappers
    Reducers
    Storage (Memcache / ElephantDB)

    View Slide

  75. - Source[+T]
    - Service[-K, +V]
    - Store[-K, V]
    - Sink[-T]

    View Slide

  76. - Source[+T]
    - Service[-K, +V]
    - Store[-K, V]
    - Sink[-T]
    The Four Ss!

    View Slide

  77. Store[-K, V]:
    What values are allowed?

    View Slide

  78. trait Monoid[T] {
    def zero: T
    def plus(l: T, r: T): T
    }

    View Slide

  79. Tons O’Monoids:
    CMS,
    HyperLogLog,
    ExponentialMA,
    BloomFilter,
    Moments,
    MinHash,
    TopK

    View Slide

  80. View Slide

  81. View Slide

  82. Algebird at Scale

    View Slide

  83. View Slide

  84. MONOID COMPOSITION

    View Slide

  85. // Views per URL Tweeted
    (URL, Int)

    View Slide

  86. // Views per URL Tweeted
    (URL, Int)
    // Unique Users per URL Tweeted
    (URL, Set[UserID])

    View Slide

  87. // Views per URL Tweeted
    (URL, Int)
    // Unique Users per URL Tweeted
    (URL, Set[UserID])
    // Views AND Unique Users per URL
    (URL, (Int, Set[UserID]))

    View Slide

  88. // Views per URL Tweeted
    (URL, Int)
    // Unique Users per URL Tweeted
    (URL, Set[UserID])
    // Views, Unique Users + Top-K Users
    (URL, (Int, Set[UserID], TopK[(User, Count)]))
    // Views AND Unique Users per URL
    (URL, (Int, Set[UserID]))

    View Slide

  89. ASSOCIATIVITY

    View Slide

  90. ;; 7 steps
    a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7

    View Slide

  91. ;; 7 steps
    (+ a0 a1 a2 a3 a4 a5 a6 a7)

    View Slide

  92. ;; 5 steps
    (+ (+ a0 a1)
    (+ a2 a3)
    (+ a4 a5)
    (+ a6 a7))

    View Slide

  93. ;; 3 steps
    (+ (+ (+ a0 a1)
    (+ a2 a3))
    (+ (+ a4 a5)
    (+ a6 a7)))

    View Slide

  94. PARALLELISM
    ASSOCIATIVITY

    View Slide

  95. BATCH / REALTIME
    0 1 2 3
    fault
    tolerant:
    Noisy: Realtime sums
    from 0, each
    batch
    Log
    Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    BatchID:

    View Slide

  96. BATCH / REALTIME
    0 1 2 3
    fault
    tolerant:
    Noisy:
    Log
    Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    Hadoop keeps
    a total sum
    (reliably)
    BatchID:

    View Slide

  97. BATCH / REALTIME
    0 1 2 3
    fault
    tolerant:
    Noisy:
    Log
    Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT Sum of RT
    Batch(i) +
    Hadoop
    Batch(i-1)
    has bounded
    noise,
    bounded
    read/write
    size
    BatchID:

    View Slide

  98. Your
    App
    Github
    Mandrill
    S3
    ElephantDB
    Memcached
    Clients

    View Slide

  99. TWEET EMBED
    COUNTS

    View Slide

  100. View Slide

  101. View Slide

  102. Approximate Maps
    - We would probably be okay if for each Key we
    could get an approximate Value.
    - We might not need to enumerate all resulting
    keys; perhaps only keys with large values would
    do.

    View Slide

  103. W
    D

    View Slide

  104. W
    D
    for (K,V) => add V to (i, h_i(K))

    View Slide

  105. W
    D
    To read, for each h_i(K), take the
    min.

    View Slide

  106. Count-Min Sketch is an
    Approximate Map
    - Each K is hashed to d values from [0 to w-1]
    - sum into those buckets
    - Result is min of all buckets.
    - Result is an upper bound on true value.
    - With prob > (1 - delta), error is at most eps *
    Total Count
    - w = 1 / eps, d = log(1/delta)
    - total cost in memory O(w * d)

    View Slide

  107. f f f
    + + + + +
    Tweets
    (Flat)Mappers
    Reducers
    HDFS/Queue
    HDFS/Queue
    reduce: (x,y) =>
    MapMonoid
    groupBy TweetID
    (TweetID, Map[URL, Long])

    View Slide

  108. Brief Explanation
    This job creates two types of keys:
    1: ((TweetId, TimeBucket) => CMS[URL, Impressions])
    2: TimeBucket => CMS[TweetId, Impressions]

    View Slide

  109. WHAT ELSE?

    View Slide

  110. View Slide

  111. WHAT’S NEXT?

    View Slide

  112. - Akka, Spark, Tez Platforms
    - More Monoids
    - Pluggable graph optimizations
    - Auto-tuning Realtime Topologies
    Future Plans

    View Slide

  113. TAKEAWAYS

    View Slide

  114. TAKEAWAYS
    • Scale - Fake it ‘til you Make It

    View Slide

  115. TAKEAWAYS
    • Scale - Fake it ‘til you Make It
    • Structured Logging

    View Slide

  116. TAKEAWAYS
    • Scale - Fake it ‘til you Make It
    • Structured Logging
    • Include timestamps EVERYWHERE

    View Slide

  117. TAKEAWAYS
    • Scale - Fake it ‘til you Make It
    • Structured Logging
    • Include timestamps EVERYWHERE
    • Record your Schemas

    View Slide

  118. Sam Ritchie :: @sritchie :: Data Day Texas 2014
    Questions?

    View Slide