The Road to Summingbird: Stream Processing at (Every) Scale

Cd378611a91eb7852ae19cd582de718a?s=47 Sam Ritchie
January 11, 2014

The Road to Summingbird: Stream Processing at (Every) Scale

Twitter's Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at "scale" yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it's not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup's code from day one. You'll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

Cd378611a91eb7852ae19cd582de718a?s=128

Sam Ritchie

January 11, 2014
Tweet

Transcript

  1. THE ROAD TO SUMMINGBIRD Sam Ritchie :: @sritchie :: Data

    Day Texas 2014 Stream Processing at (Every) Scale
  2. @summingbird

  3. https:// / /summingbird

  4. AGENDA

  5. • Logging and Monitoring in the Small AGENDA

  6. • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview AGENDA
  7. • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview • What breaks at full scale? AGENDA
  8. • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview • What breaks at full scale? • Summingbird’s Constraints, how they can help AGENDA
  9. • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview • What breaks at full scale? • Summingbird’s Constraints, how they can help • Lessons Learned AGENDA
  10. WHAT TO MONITOR?

  11. • Application “Events” WHAT TO MONITOR?

  12. • Application “Events” WHAT TO MONITOR? • on certain events

    or patterns
  13. • Application “Events” WHAT TO MONITOR? • on certain events

    or patterns • Extract metrics from the event stream
  14. • Application “Events” WHAT TO MONITOR? • on certain events

    or patterns • Extract metrics from the event stream • Dashboards?
  15. None
  16. None
  17. None
  18. PREPPING FOR SCALE

  19. PREPPING FOR SCALE

  20. PREPPING FOR SCALE

  21. None
  22. LOG STATEMENTS (defn create-user! [username] (log/info "User Created: " username)

    (db/create {:type :user :name username :timestamp (System/currentTimeMillis)}))
  23. None
  24. Your App Heroku Logs

  25. CENTRALIZED LOGGING

  26. None
  27. None
  28. Your App S3

  29. WHAT DO YOU GET?

  30. • Ability to REACT to system events WHAT DO YOU

    GET?
  31. • Ability to REACT to system events • Long-term storage

    via S3 WHAT DO YOU GET?
  32. • Ability to REACT to system events • Long-term storage

    via S3 • Searchable Logs WHAT DO YOU GET?
  33. WHAT’S MISSING?

  34. • How many users per day? WHAT’S MISSING?

  35. • How many users per day? • How many times

    did this exception show up vs that? WHAT’S MISSING?
  36. • How many users per day? • How many times

    did this exception show up vs that? • Was this the first time I’ve seen that error? WHAT’S MISSING?
  37. • How many users per day? • How many times

    did this exception show up vs that? • Was this the first time I’ve seen that error? • Pattern Analysis requires Aggregations WHAT’S MISSING?
  38. None
  39. STRUCTURED LOGGING

  40. IMPOSE STRUCTURE (log/info "User Created: " username) (log/info {:event "user_creation"

    :name "sritchie" :timestamp (now) :request-id request-id})
  41. None
  42. None
  43. EVENT PROCESSING

  44. None
  45. None
  46. None
  47. None
  48. Your App Github Mandrill S3

  49. EVENT PROCESSORS

  50. • FluentD (http://fluentd.org/) EVENT PROCESSORS

  51. • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) EVENT PROCESSORS

  52. • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) EVENT

    PROCESSORS
  53. • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) •

    Simmer (https://github.com/avibryant/simmer) EVENT PROCESSORS
  54. • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) •

    Simmer (https://github.com/avibryant/simmer) • StatsD + CollectD (https://github.com/etsy/statsd/) EVENT PROCESSORS
  55. • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) •

    Simmer (https://github.com/avibryant/simmer) • StatsD + CollectD (https://github.com/etsy/statsd/) • Esper (http://esper.codehaus.org/) EVENT PROCESSORS
  56. None
  57. What Breaks at Scale?

  58. SERIALIZATION

  59. • Thrift (http://thrift.apache.org/) SERIALIZATION

  60. • Thrift (http://thrift.apache.org/) • Protocol Buffers (https://code.google.com/p/protobuf/) SERIALIZATION

  61. • Thrift (http://thrift.apache.org/) • Protocol Buffers (https://code.google.com/p/protobuf/) • Avro (http://avro.apache.org/)

    SERIALIZATION
  62. • Thrift (http://thrift.apache.org/) • Protocol Buffers (https://code.google.com/p/protobuf/) • Avro (http://avro.apache.org/)

    • Kryo (https://github.com/EsotericSoftware/kryo) SERIALIZATION
  63. LOG COLLECTION

  64. • Kafka (https://kafka.apache.org/) LOG COLLECTION

  65. • Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) LOG COLLECTION

  66. • Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://flume.apache.org/) LOG

    COLLECTION
  67. • Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://flume.apache.org/) •

    Kinesis (http://aws.amazon.com/kinesis/) LOG COLLECTION
  68. • Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://flume.apache.org/) •

    Kinesis (http://aws.amazon.com/kinesis/) • Scribe (https://github.com/facebook/scribe) LOG COLLECTION
  69. EVENT PROCESSING

  70. @summingbird

  71. - Declarative Streaming Map/Reduce DSL - Realtime platform that runs

    on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird?
  72. None
  73. val impressionCounts = impressionHose.flatMap(extractCounts(_)) val engagementCounts = engagementHose.filter(_.isValid) .flatMap(engagementCounts(_)) val

    totalCounts = (impressionCounts ++ engagementCounts) .flatMap(fanoutByTime(_)) .sumByKey(onlineStore) val stormTopology = Storm.remote("stormName").plan(totalCounts) val hadoopJob = Scalding("scaldingName").plan(totalCounts)
  74. MAP/REDUCE f1 f1 f2 f2 f2 + + + +

    + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB)
  75. - Source[+T] - Service[-K, +V] - Store[-K, V] - Sink[-T]

  76. - Source[+T] - Service[-K, +V] - Store[-K, V] - Sink[-T]

    The Four Ss!
  77. Store[-K, V]: What values are allowed?

  78. trait Monoid[T] { def zero: T def plus(l: T, r:

    T): T }
  79. Tons O’Monoids: CMS, HyperLogLog, ExponentialMA, BloomFilter, Moments, MinHash, TopK

  80. None
  81. None
  82. Algebird at Scale

  83. None
  84. MONOID COMPOSITION

  85. // Views per URL Tweeted (URL, Int)

  86. // Views per URL Tweeted (URL, Int) // Unique Users

    per URL Tweeted (URL, Set[UserID])
  87. // Views per URL Tweeted (URL, Int) // Unique Users

    per URL Tweeted (URL, Set[UserID]) // Views AND Unique Users per URL (URL, (Int, Set[UserID]))
  88. // Views per URL Tweeted (URL, Int) // Unique Users

    per URL Tweeted (URL, Set[UserID]) // Views, Unique Users + Top-K Users (URL, (Int, Set[UserID], TopK[(User, Count)])) // Views AND Unique Users per URL (URL, (Int, Set[UserID]))
  89. ASSOCIATIVITY

  90. ;; 7 steps a0 + a1 + a2 + a3

    + a4 + a5 + a6 + a7
  91. ;; 7 steps (+ a0 a1 a2 a3 a4 a5

    a6 a7)
  92. ;; 5 steps (+ (+ a0 a1) (+ a2 a3)

    (+ a4 a5) (+ a6 a7))
  93. ;; 3 steps (+ (+ (+ a0 a1) (+ a2

    a3)) (+ (+ a4 a5) (+ a6 a7)))
  94. PARALLELISM ASSOCIATIVITY

  95. BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy:

    Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT BatchID:
  96. BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy:

    Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Hadoop keeps a total sum (reliably) BatchID:
  97. BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy:

    Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size BatchID:
  98. Your App Github Mandrill S3 ElephantDB Memcached Clients

  99. TWEET EMBED COUNTS

  100. None
  101. None
  102. Approximate Maps - We would probably be okay if for

    each Key we could get an approximate Value. - We might not need to enumerate all resulting keys; perhaps only keys with large values would do.
  103. W D

  104. W D for (K,V) => add V to (i, h_i(K))

  105. W D To read, for each h_i(K), take the min.

  106. Count-Min Sketch is an Approximate Map - Each K is

    hashed to d values from [0 to w-1] - sum into those buckets - Result is min of all buckets. - Result is an upper bound on true value. - With prob > (1 - delta), error is at most eps * Total Count - w = 1 / eps, d = log(1/delta) - total cost in memory O(w * d)
  107. f f f + + + + + Tweets (Flat)Mappers

    Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long])
  108. Brief Explanation This job creates two types of keys: 1:

    ((TweetId, TimeBucket) => CMS[URL, Impressions]) 2: TimeBucket => CMS[TweetId, Impressions]
  109. WHAT ELSE?

  110. None
  111. WHAT’S NEXT?

  112. - Akka, Spark, Tez Platforms - More Monoids - Pluggable

    graph optimizations - Auto-tuning Realtime Topologies Future Plans
  113. TAKEAWAYS

  114. TAKEAWAYS • Scale - Fake it ‘til you Make It

  115. TAKEAWAYS • Scale - Fake it ‘til you Make It

    • Structured Logging
  116. TAKEAWAYS • Scale - Fake it ‘til you Make It

    • Structured Logging • Include timestamps EVERYWHERE
  117. TAKEAWAYS • Scale - Fake it ‘til you Make It

    • Structured Logging • Include timestamps EVERYWHERE • Record your Schemas
  118. Sam Ritchie :: @sritchie :: Data Day Texas 2014 Questions?