Slide 1

Slide 1 text

@summingbird Sunday, September 22, 13

Slide 2

Slide 2 text

Sunday, September 22, 13

Slide 3

Slide 3 text

Oscar Boykin - @posco Sam Ritchie - @sritchie Ashu Singhal - @daashu Sunday, September 22, 13

Slide 4

Slide 4 text

- What is Summingbird? - What can it do today? - Batch / Realtime Hybrids - Currently deployed systems - Upcoming Features Sunday, September 22, 13

Slide 5

Slide 5 text

Vision Sunday, September 22, 13

Slide 6

Slide 6 text

Write your logic once. Sunday, September 22, 13

Slide 7

Slide 7 text

- 200M+ Active Monthly Users - 500M Tweets / Day - Several 1K+ node Hadoop clusters Twitter’s Scale Sunday, September 22, 13

Slide 8

Slide 8 text

Solve systems problems once. Sunday, September 22, 13

Slide 9

Slide 9 text

Make non-trivial realtime compute as accessible as Scalding. Sunday, September 22, 13

Slide 10

Slide 10 text

- Declarative Streaming Map/Reduce DSL - Realtime platform that runs on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird? Sunday, September 22, 13

Slide 11

Slide 11 text

val impressionCounts = impressionHose.flatMap(extractCounts(_)) val engagementCounts = engagementHose.filter(_.isValid) .flatMap(engagementCounts(_)) val totalCounts = (impressionCounts ++ engagementCounts) .flatMap(fanoutByTime(_)) .sumByKey(onlineStore) val stormTopology = Storm.remote("stormName").plan(totalCounts) val hadoopJob = Scalding("scaldingName").plan(totalCounts) Sunday, September 22, 13

Slide 12

Slide 12 text

Map/Reduce f1 f1 f2 f2 f2 + + + + + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB) Sunday, September 22, 13

Slide 13

Slide 13 text

FlatMap flatMap: T => TraversableOnce[U] // g: (x: T => U) map(x) = flatMap(x => List(g(x)) // pred: T => Boolean filter(x) = flatMap { x => if (pred(x)) List(x) else Nil } Sunday, September 22, 13

Slide 14

Slide 14 text

- Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V] Sunday, September 22, 13

Slide 15

Slide 15 text

- Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V] The Four Ss! Sunday, September 22, 13

Slide 16

Slide 16 text

Store[-K, V]: What values are allowed? Sunday, September 22, 13

Slide 17

Slide 17 text

trait Monoid[V] { def zero: V def plus(l: V, r: V): V } Sunday, September 22, 13

Slide 18

Slide 18 text

• Tons O’Monoids: • CMS, HyperLogLog, ExponentialMA, BloomFilter, Moments, MinHash, TopK Sunday, September 22, 13

Slide 19

Slide 19 text

Sunday, September 22, 13

Slide 20

Slide 20 text

Sunday, September 22, 13

Slide 21

Slide 21 text

Associativity Sunday, September 22, 13

Slide 22

Slide 22 text

;; 7 steps a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7 Sunday, September 22, 13

Slide 23

Slide 23 text

;; 7 steps (+ a0 a1 a2 a3 a4 a5 a6 a7) Sunday, September 22, 13

Slide 24

Slide 24 text

;; 5 steps (+ (+ a0 a1) (+ a2 a3) (+ a4 a5) (+ a6 a7)) Sunday, September 22, 13

Slide 25

Slide 25 text

;; 3 steps (+ (+ (+ a0 a1) (+ a2 a3)) (+ (+ a4 a5) (+ a6 a7))) Sunday, September 22, 13

Slide 26

Slide 26 text

Parallelism Associativity Sunday, September 22, 13

Slide 27

Slide 27 text

Batch / Realtime 0 1 2 3 fault tolerant: Noisy: Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT BatchID: Sunday, September 22, 13

Slide 28

Slide 28 text

Batch / Realtime 0 1 2 3 fault tolerant: Noisy: Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Hadoop keeps a total sum (reliably) BatchID: Sunday, September 22, 13

Slide 29

Slide 29 text

Batch / Realtime 0 1 2 3 fault tolerant: Noisy: Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size BatchID: Sunday, September 22, 13

Slide 30

Slide 30 text

Tweet Embed Counts Sunday, September 22, 13

Slide 31

Slide 31 text

Sunday, September 22, 13

Slide 32

Slide 32 text

Sunday, September 22, 13

Slide 33

Slide 33 text

f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long]) Sunday, September 22, 13

Slide 34

Slide 34 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Sunday, September 22, 13

Slide 35

Slide 35 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Sunday, September 22, 13

Slide 36

Slide 36 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Sunday, September 22, 13

Slide 37

Slide 37 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Sum into Store Sunday, September 22, 13

Slide 38

Slide 38 text

Brief Explanation This job creates two types of keys: 1: ((TweetId, TimeBucket) => [URL, Impressions]) 2: TimeBucket => Map[TweetId, Impressions] Sunday, September 22, 13

Slide 39

Slide 39 text

What Else? Sunday, September 22, 13

Slide 40

Slide 40 text

Sunday, September 22, 13

Slide 41

Slide 41 text

What’s Next? Sunday, September 22, 13

Slide 42

Slide 42 text

- Akka, Spark, Tez Platforms - Pluggable graph optimizations - Metadata publishing via HCatalog - More tutorials! Future Plans Sunday, September 22, 13

Slide 43

Slide 43 text

Open Source! Sunday, September 22, 13

Slide 44

Slide 44 text

•Summingbird is appropriate for the majority of the real-time apps we have. •It’s all about the Monoid •Data scientists who are not familiar with systems can deploy realtime systems. •Systems engineers can reuse 90% of the code (batch/realtime merging). Summary Sunday, September 22, 13

Slide 45

Slide 45 text

Follow me at @sritchie Thank You! Sunday, September 22, 13