The Road to Summingbird: Stream Processing at (Every) Scale

The Road to Summingbird: Stream Processing at (Every) Scale

Twitter's Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at "scale" yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it's not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup's code from day one. You'll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

Cd378611a91eb7852ae19cd582de718a?s=128

Sam Ritchie

January 11, 2014
Tweet

Transcript

  1. 1.

    THE ROAD TO SUMMINGBIRD Sam Ritchie :: @sritchie :: Data

    Day Texas 2014 Stream Processing at (Every) Scale
  2. 4.
  3. 6.

    • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview AGENDA
  4. 7.

    • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview • What breaks at full scale? AGENDA
  5. 8.

    • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview • What breaks at full scale? • Summingbird’s Constraints, how they can help AGENDA
  6. 9.

    • Logging and Monitoring in the Small • Scaling toward

    Summingbird - Tooling Overview • What breaks at full scale? • Summingbird’s Constraints, how they can help • Lessons Learned AGENDA
  7. 13.

    • Application “Events” WHAT TO MONITOR? • on certain events

    or patterns • Extract metrics from the event stream
  8. 14.

    • Application “Events” WHAT TO MONITOR? • on certain events

    or patterns • Extract metrics from the event stream • Dashboards?
  9. 15.
  10. 16.
  11. 17.
  12. 21.
  13. 22.

    LOG STATEMENTS (defn create-user! [username] (log/info "User Created: " username)

    (db/create {:type :user :name username :timestamp (System/currentTimeMillis)}))
  14. 23.
  15. 26.
  16. 27.
  17. 32.

    • Ability to REACT to system events • Long-term storage

    via S3 • Searchable Logs WHAT DO YOU GET?
  18. 35.

    • How many users per day? • How many times

    did this exception show up vs that? WHAT’S MISSING?
  19. 36.

    • How many users per day? • How many times

    did this exception show up vs that? • Was this the first time I’ve seen that error? WHAT’S MISSING?
  20. 37.

    • How many users per day? • How many times

    did this exception show up vs that? • Was this the first time I’ve seen that error? • Pattern Analysis requires Aggregations WHAT’S MISSING?
  21. 38.
  22. 40.

    IMPOSE STRUCTURE (log/info "User Created: " username) (log/info {:event "user_creation"

    :name "sritchie" :timestamp (now) :request-id request-id})
  23. 41.
  24. 42.
  25. 44.
  26. 45.
  27. 46.
  28. 47.
  29. 54.

    • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) •

    Simmer (https://github.com/avibryant/simmer) • StatsD + CollectD (https://github.com/etsy/statsd/) EVENT PROCESSORS
  30. 55.

    • FluentD (http://fluentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) •

    Simmer (https://github.com/avibryant/simmer) • StatsD + CollectD (https://github.com/etsy/statsd/) • Esper (http://esper.codehaus.org/) EVENT PROCESSORS
  31. 56.
  32. 68.

    • Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://flume.apache.org/) •

    Kinesis (http://aws.amazon.com/kinesis/) • Scribe (https://github.com/facebook/scribe) LOG COLLECTION
  33. 71.

    - Declarative Streaming Map/Reduce DSL - Realtime platform that runs

    on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird?
  34. 72.
  35. 73.

    val impressionCounts = impressionHose.flatMap(extractCounts(_)) val engagementCounts = engagementHose.filter(_.isValid) .flatMap(engagementCounts(_)) val

    totalCounts = (impressionCounts ++ engagementCounts) .flatMap(fanoutByTime(_)) .sumByKey(onlineStore) val stormTopology = Storm.remote("stormName").plan(totalCounts) val hadoopJob = Scalding("scaldingName").plan(totalCounts)
  36. 74.

    MAP/REDUCE f1 f1 f2 f2 f2 + + + +

    + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB)
  37. 80.
  38. 81.
  39. 83.
  40. 86.

    // Views per URL Tweeted (URL, Int) // Unique Users

    per URL Tweeted (URL, Set[UserID])
  41. 87.

    // Views per URL Tweeted (URL, Int) // Unique Users

    per URL Tweeted (URL, Set[UserID]) // Views AND Unique Users per URL (URL, (Int, Set[UserID]))
  42. 88.

    // Views per URL Tweeted (URL, Int) // Unique Users

    per URL Tweeted (URL, Set[UserID]) // Views, Unique Users + Top-K Users (URL, (Int, Set[UserID], TopK[(User, Count)])) // Views AND Unique Users per URL (URL, (Int, Set[UserID]))
  43. 90.

    ;; 7 steps a0 + a1 + a2 + a3

    + a4 + a5 + a6 + a7
  44. 92.

    ;; 5 steps (+ (+ a0 a1) (+ a2 a3)

    (+ a4 a5) (+ a6 a7))
  45. 93.

    ;; 3 steps (+ (+ (+ a0 a1) (+ a2

    a3)) (+ (+ a4 a5) (+ a6 a7)))
  46. 95.

    BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy:

    Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT BatchID:
  47. 96.

    BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy:

    Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Hadoop keeps a total sum (reliably) BatchID:
  48. 97.

    BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy:

    Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size BatchID:
  49. 100.
  50. 101.
  51. 102.

    Approximate Maps - We would probably be okay if for

    each Key we could get an approximate Value. - We might not need to enumerate all resulting keys; perhaps only keys with large values would do.
  52. 103.

    W D

  53. 106.

    Count-Min Sketch is an Approximate Map - Each K is

    hashed to d values from [0 to w-1] - sum into those buckets - Result is min of all buckets. - Result is an upper bound on true value. - With prob > (1 - delta), error is at most eps * Total Count - w = 1 / eps, d = log(1/delta) - total cost in memory O(w * d)
  54. 107.

    f f f + + + + + Tweets (Flat)Mappers

    Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long])
  55. 108.

    Brief Explanation This job creates two types of keys: 1:

    ((TweetId, TimeBucket) => CMS[URL, Impressions]) 2: TimeBucket => CMS[TweetId, Impressions]
  56. 109.
  57. 110.
  58. 112.

    - Akka, Spark, Tez Platforms - More Monoids - Pluggable

    graph optimizations - Auto-tuning Realtime Topologies Future Plans
  59. 113.
  60. 116.

    TAKEAWAYS • Scale - Fake it ‘til you Make It

    • Structured Logging • Include timestamps EVERYWHERE
  61. 117.

    TAKEAWAYS • Scale - Fake it ‘til you Make It

    • Structured Logging • Include timestamps EVERYWHERE • Record your Schemas