The Road to Summingbird: Stream Processing at (Every) Scale

Slide 1

Slide 1 text

THE ROAD TO SUMMINGBIRD Sam Ritchie :: @sritchie :: Data Day Texas 2014 Stream Processing at (Every) Scale

Slide 2

Slide 2 text

@summingbird

Slide 3

Slide 3 text

https:// / /summingbird

Slide 4

Slide 4 text

AGENDA

Slide 5

Slide 5 text

• Logging and Monitoring in the Small AGENDA

Slide 6

Slide 6 text

• Logging and Monitoring in the Small • Scaling toward Summingbird - Tooling Overview AGENDA

Slide 7

Slide 7 text

• Logging and Monitoring in the Small • Scaling toward Summingbird - Tooling Overview • What breaks at full scale? AGENDA

Slide 8

Slide 8 text

• Logging and Monitoring in the Small • Scaling toward Summingbird - Tooling Overview • What breaks at full scale? • Summingbird’s Constraints, how they can help AGENDA

Slide 9

Slide 9 text

• Logging and Monitoring in the Small • Scaling toward Summingbird - Tooling Overview • What breaks at full scale? • Summingbird’s Constraints, how they can help • Lessons Learned AGENDA

Slide 10

Slide 10 text

WHAT TO MONITOR?

Slide 11

Slide 11 text

• Application “Events” WHAT TO MONITOR?

Slide 12

Slide 12 text

• Application “Events” WHAT TO MONITOR? • on certain events or patterns

Slide 13

Slide 13 text

• Application “Events” WHAT TO MONITOR? • on certain events or patterns • Extract metrics from the event stream

Slide 14

Slide 14 text

• Application “Events” WHAT TO MONITOR? • on certain events or patterns • Extract metrics from the event stream • Dashboards?

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

PREPPING FOR SCALE

Slide 19

Slide 19 text

PREPPING FOR SCALE

Slide 20

Slide 20 text

PREPPING FOR SCALE

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

LOG STATEMENTS (defn create-user! [username] (log/info "User Created: " username) (db/create {:type :user :name username :timestamp (System/currentTimeMillis)}))

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Your App Heroku Logs

Slide 25

Slide 25 text

CENTRALIZED LOGGING

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Your App S3

Slide 29

Slide 29 text

WHAT DO YOU GET?

Slide 30

Slide 30 text

• Ability to REACT to system events WHAT DO YOU GET?

Slide 31

Slide 31 text

• Ability to REACT to system events • Long-term storage via S3 WHAT DO YOU GET?

Slide 32

Slide 32 text

• Ability to REACT to system events • Long-term storage via S3 • Searchable Logs WHAT DO YOU GET?

Slide 33

Slide 33 text

WHAT’S MISSING?

Slide 34

Slide 34 text

• How many users per day? WHAT’S MISSING?

Slide 35

Slide 35 text

• How many users per day? • How many times did this exception show up vs that? WHAT’S MISSING?

Slide 36

Slide 36 text

• How many users per day? • How many times did this exception show up vs that? • Was this the ﬁrst time I’ve seen that error? WHAT’S MISSING?

Slide 37

Slide 37 text

• How many users per day? • How many times did this exception show up vs that? • Was this the ﬁrst time I’ve seen that error? • Pattern Analysis requires Aggregations WHAT’S MISSING?

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

STRUCTURED LOGGING

Slide 40

Slide 40 text

IMPOSE STRUCTURE (log/info "User Created: " username) (log/info {:event "user_creation" :name "sritchie" :timestamp (now) :request-id request-id})

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

EVENT PROCESSING

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Your App Github Mandrill S3

Slide 49

Slide 49 text

EVENT PROCESSORS

Slide 50

Slide 50 text

• FluentD (http://ﬂuentd.org/) EVENT PROCESSORS

Slide 51

Slide 51 text

• FluentD (http://ﬂuentd.org/) • Riemann (http://riemann.io/) EVENT PROCESSORS

Slide 52

Slide 52 text

• FluentD (http://ﬂuentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) EVENT PROCESSORS

Slide 53

Slide 53 text

• FluentD (http://ﬂuentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) • Simmer (https://github.com/avibryant/simmer) EVENT PROCESSORS

Slide 54

Slide 54 text

• FluentD (http://ﬂuentd.org/) • Riemann (http://riemann.io/) • Splunk (http://www.splunk.com/) • Simmer (https://github.com/avibryant/simmer) • StatsD + CollectD (https://github.com/etsy/statsd/) EVENT PROCESSORS

Slide 55

Slide 55 text

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

What Breaks at Scale?

Slide 58

Slide 58 text

SERIALIZATION

Slide 59

Slide 59 text

• Thrift (http://thrift.apache.org/) SERIALIZATION

Slide 60

Slide 60 text

• Thrift (http://thrift.apache.org/) • Protocol Buffers (https://code.google.com/p/protobuf/) SERIALIZATION

Slide 61

Slide 61 text

• Thrift (http://thrift.apache.org/) • Protocol Buffers (https://code.google.com/p/protobuf/) • Avro (http://avro.apache.org/) SERIALIZATION

Slide 62

Slide 62 text

• Thrift (http://thrift.apache.org/) • Protocol Buffers (https://code.google.com/p/protobuf/) • Avro (http://avro.apache.org/) • Kryo (https://github.com/EsotericSoftware/kryo) SERIALIZATION

Slide 63

Slide 63 text

LOG COLLECTION

Slide 64

Slide 64 text

• Kafka (https://kafka.apache.org/) LOG COLLECTION

Slide 65

Slide 65 text

• Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) LOG COLLECTION

Slide 66

Slide 66 text

• Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://ﬂume.apache.org/) LOG COLLECTION

Slide 67

Slide 67 text

• Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://ﬂume.apache.org/) • Kinesis (http://aws.amazon.com/kinesis/) LOG COLLECTION

Slide 68

Slide 68 text

• Kafka (https://kafka.apache.org/) • LogStash (http://logstash.net/) • Flume (http://ﬂume.apache.org/) • Kinesis (http://aws.amazon.com/kinesis/) • Scribe (https://github.com/facebook/scribe) LOG COLLECTION

Slide 69

Slide 69 text

EVENT PROCESSING

Slide 70

Slide 70 text

@summingbird

Slide 71

Slide 71 text

- Declarative Streaming Map/Reduce DSL - Realtime platform that runs on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird?

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

val impressionCounts = impressionHose.flatMap(extractCounts(_)) val engagementCounts = engagementHose.filter(_.isValid) .flatMap(engagementCounts(_)) val totalCounts = (impressionCounts ++ engagementCounts) .flatMap(fanoutByTime(_)) .sumByKey(onlineStore) val stormTopology = Storm.remote("stormName").plan(totalCounts) val hadoopJob = Scalding("scaldingName").plan(totalCounts)

Slide 74

Slide 74 text

MAP/REDUCE f1 f1 f2 f2 f2 + + + + + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB)

Slide 75

Slide 75 text

- Source[+T] - Service[-K, +V] - Store[-K, V] - Sink[-T]

Slide 76

Slide 76 text

- Source[+T] - Service[-K, +V] - Store[-K, V] - Sink[-T] The Four Ss!

Slide 77

Slide 77 text

Store[-K, V]: What values are allowed?

Slide 78

Slide 78 text

trait Monoid[T] { def zero: T def plus(l: T, r: T): T }

Slide 79

Slide 79 text

Tons O’Monoids: CMS, HyperLogLog, ExponentialMA, BloomFilter, Moments, MinHash, TopK

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

Algebird at Scale

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

MONOID COMPOSITION

Slide 85

Slide 85 text

// Views per URL Tweeted (URL, Int)

Slide 86

Slide 86 text

// Views per URL Tweeted (URL, Int) // Unique Users per URL Tweeted (URL, Set[UserID])

Slide 87

Slide 87 text

// Views per URL Tweeted (URL, Int) // Unique Users per URL Tweeted (URL, Set[UserID]) // Views AND Unique Users per URL (URL, (Int, Set[UserID]))

Slide 88

Slide 88 text

// Views per URL Tweeted (URL, Int) // Unique Users per URL Tweeted (URL, Set[UserID]) // Views, Unique Users + Top-K Users (URL, (Int, Set[UserID], TopK[(User, Count)])) // Views AND Unique Users per URL (URL, (Int, Set[UserID]))

Slide 89

Slide 89 text

ASSOCIATIVITY

Slide 90

Slide 90 text

;; 7 steps a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7

Slide 91

Slide 91 text

;; 7 steps (+ a0 a1 a2 a3 a4 a5 a6 a7)

Slide 92

Slide 92 text

;; 5 steps (+ (+ a0 a1) (+ a2 a3) (+ a4 a5) (+ a6 a7))

Slide 93

Slide 93 text

;; 3 steps (+ (+ (+ a0 a1) (+ a2 a3)) (+ (+ a4 a5) (+ a6 a7)))

Slide 94

Slide 94 text

PARALLELISM ASSOCIATIVITY

Slide 95

Slide 95 text

BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy: Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT BatchID:

Slide 96

Slide 96 text

BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy: Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Hadoop keeps a total sum (reliably) BatchID:

Slide 97

Slide 97 text

BATCH / REALTIME 0 1 2 3 fault tolerant: Noisy: Log Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size BatchID:

Slide 98

Slide 98 text

Your App Github Mandrill S3 ElephantDB Memcached Clients

Slide 99

Slide 99 text

TWEET EMBED COUNTS

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

Approximate Maps - We would probably be okay if for each Key we could get an approximate Value. - We might not need to enumerate all resulting keys; perhaps only keys with large values would do.

Slide 103

Slide 103 text

W D

Slide 104

Slide 104 text

W D for (K,V) => add V to (i, h_i(K))

Slide 105

Slide 105 text

W D To read, for each h_i(K), take the min.

Slide 106

Slide 106 text

Count-Min Sketch is an Approximate Map - Each K is hashed to d values from [0 to w-1] - sum into those buckets - Result is min of all buckets. - Result is an upper bound on true value. - With prob > (1 - delta), error is at most eps * Total Count - w = 1 / eps, d = log(1/delta) - total cost in memory O(w * d)

Slide 107

Slide 107 text

f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long])

Slide 108

Slide 108 text

Brief Explanation This job creates two types of keys: 1: ((TweetId, TimeBucket) => CMS[URL, Impressions]) 2: TimeBucket => CMS[TweetId, Impressions]

Slide 109

Slide 109 text

WHAT ELSE?

Slide 110

Slide 110 text

No content

Slide 111

Slide 111 text

WHAT’S NEXT?

Slide 112

Slide 112 text

- Akka, Spark, Tez Platforms - More Monoids - Pluggable graph optimizations - Auto-tuning Realtime Topologies Future Plans

Slide 113

Slide 113 text

TAKEAWAYS

Slide 114

Slide 114 text

TAKEAWAYS • Scale - Fake it ‘til you Make It

Slide 115

Slide 115 text

TAKEAWAYS • Scale - Fake it ‘til you Make It • Structured Logging

Slide 116

Slide 116 text

TAKEAWAYS • Scale - Fake it ‘til you Make It • Structured Logging • Include timestamps EVERYWHERE

Slide 117

Slide 117 text

TAKEAWAYS • Scale - Fake it ‘til you Make It • Structured Logging • Include timestamps EVERYWHERE • Record your Schemas

Slide 118

Slide 118 text

Sam Ritchie :: @sritchie :: Data Day Texas 2014 Questions?