Pro Yearly is on sale from $80 to $50! »

Summingbird at CUFP

Cd378611a91eb7852ae19cd582de718a?s=47 Sam Ritchie
September 22, 2013

Summingbird at CUFP

Twitter's Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with scale. This talk will discuss our experience applying functional programming ideas and techniques to the development of the Summingbird library, the power of clean, mathematical abstractions and the massive creative leverage that functional design constraints can give to a project.

(Based on slides stolen from https://twitter.com/posco :)

Cd378611a91eb7852ae19cd582de718a?s=128

Sam Ritchie

September 22, 2013
Tweet

Transcript

  1. @summingbird Sunday, September 22, 13

  2. Sunday, September 22, 13

  3. Oscar Boykin - @posco Sam Ritchie - @sritchie Ashu Singhal

    - @daashu Sunday, September 22, 13
  4. - What is Summingbird? - What can it do today?

    - Batch / Realtime Hybrids - Currently deployed systems - Upcoming Features Sunday, September 22, 13
  5. Vision Sunday, September 22, 13

  6. Write your logic once. Sunday, September 22, 13

  7. - 200M+ Active Monthly Users - 500M Tweets / Day

    - Several 1K+ node Hadoop clusters Twitter’s Scale Sunday, September 22, 13
  8. Solve systems problems once. Sunday, September 22, 13

  9. Make non-trivial realtime compute as accessible as Scalding. Sunday, September

    22, 13
  10. - Declarative Streaming Map/Reduce DSL - Realtime platform that runs

    on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird? Sunday, September 22, 13
  11. val impressionCounts = impressionHose.flatMap(extractCounts(_)) val engagementCounts = engagementHose.filter(_.isValid) .flatMap(engagementCounts(_)) val

    totalCounts = (impressionCounts ++ engagementCounts) .flatMap(fanoutByTime(_)) .sumByKey(onlineStore) val stormTopology = Storm.remote("stormName").plan(totalCounts) val hadoopJob = Scalding("scaldingName").plan(totalCounts) Sunday, September 22, 13
  12. Map/Reduce f1 f1 f2 f2 f2 + + + +

    + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB) Sunday, September 22, 13
  13. FlatMap flatMap: T => TraversableOnce[U] // g: (x: T =>

    U) map(x) = flatMap(x => List(g(x)) // pred: T => Boolean filter(x) = flatMap { x => if (pred(x)) List(x) else Nil } Sunday, September 22, 13
  14. - Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V]

    Sunday, September 22, 13
  15. - Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V]

    The Four Ss! Sunday, September 22, 13
  16. Store[-K, V]: What values are allowed? Sunday, September 22, 13

  17. trait Monoid[V] { def zero: V def plus(l: V, r:

    V): V } Sunday, September 22, 13
  18. • Tons O’Monoids: • CMS, HyperLogLog, ExponentialMA, BloomFilter, Moments, MinHash,

    TopK Sunday, September 22, 13
  19. Sunday, September 22, 13

  20. Sunday, September 22, 13

  21. Associativity Sunday, September 22, 13

  22. ;; 7 steps a0 + a1 + a2 + a3

    + a4 + a5 + a6 + a7 Sunday, September 22, 13
  23. ;; 7 steps (+ a0 a1 a2 a3 a4 a5

    a6 a7) Sunday, September 22, 13
  24. ;; 5 steps (+ (+ a0 a1) (+ a2 a3)

    (+ a4 a5) (+ a6 a7)) Sunday, September 22, 13
  25. ;; 3 steps (+ (+ (+ a0 a1) (+ a2

    a3)) (+ (+ a4 a5) (+ a6 a7))) Sunday, September 22, 13
  26. Parallelism Associativity Sunday, September 22, 13

  27. Batch / Realtime 0 1 2 3 fault tolerant: Noisy:

    Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT BatchID: Sunday, September 22, 13
  28. Batch / Realtime 0 1 2 3 fault tolerant: Noisy:

    Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Hadoop keeps a total sum (reliably) BatchID: Sunday, September 22, 13
  29. Batch / Realtime 0 1 2 3 fault tolerant: Noisy:

    Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT Sum of RT Batch(i) + Hadoop Batch(i-1) has bounded noise, bounded read/write size BatchID: Sunday, September 22, 13
  30. Tweet Embed Counts Sunday, September 22, 13

  31. Sunday, September 22, 13

  32. Sunday, September 22, 13

  33. f f f + + + + + Tweets (Flat)Mappers

    Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long]) Sunday, September 22, 13
  34. object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink:

    P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Sunday, September 22, 13
  35. object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink:

    P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Sunday, September 22, 13
  36. object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink:

    P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Sunday, September 22, 13
  37. object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink:

    P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Sum into Store Sunday, September 22, 13
  38. Brief Explanation This job creates two types of keys: 1:

    ((TweetId, TimeBucket) => [URL, Impressions]) 2: TimeBucket => Map[TweetId, Impressions] Sunday, September 22, 13
  39. What Else? Sunday, September 22, 13

  40. Sunday, September 22, 13

  41. What’s Next? Sunday, September 22, 13

  42. - Akka, Spark, Tez Platforms - Pluggable graph optimizations -

    Metadata publishing via HCatalog - More tutorials! Future Plans Sunday, September 22, 13
  43. Open Source! Sunday, September 22, 13

  44. •Summingbird is appropriate for the majority of the real-time apps

    we have. •It’s all about the Monoid •Data scientists who are not familiar with systems can deploy realtime systems. •Systems engineers can reuse 90% of the code (batch/realtime merging). Summary Sunday, September 22, 13
  45. Follow me at @sritchie Thank You! Sunday, September 22, 13