Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Summingbird at CUFP

Sam Ritchie
September 22, 2013

Summingbird at CUFP

Twitter's Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with scale. This talk will discuss our experience applying functional programming ideas and techniques to the development of the Summingbird library, the power of clean, mathematical abstractions and the massive creative leverage that functional design constraints can give to a project.

(Based on slides stolen from https://twitter.com/posco :)

Sam Ritchie

September 22, 2013
Tweet

More Decks by Sam Ritchie

Other Decks in Programming

Transcript

  1. @summingbird
    Sunday, September 22, 13

    View Slide

  2. Sunday, September 22, 13

    View Slide

  3. Oscar Boykin - @posco
    Sam Ritchie - @sritchie
    Ashu Singhal - @daashu
    Sunday, September 22, 13

    View Slide

  4. - What is Summingbird?
    - What can it do today?
    - Batch / Realtime Hybrids
    - Currently deployed systems
    - Upcoming Features
    Sunday, September 22, 13

    View Slide

  5. Vision
    Sunday, September 22, 13

    View Slide

  6. Write your logic once.
    Sunday, September 22, 13

    View Slide

  7. - 200M+ Active Monthly Users
    - 500M Tweets / Day
    - Several 1K+ node Hadoop clusters
    Twitter’s Scale
    Sunday, September 22, 13

    View Slide

  8. Solve systems problems once.
    Sunday, September 22, 13

    View Slide

  9. Make non-trivial
    realtime compute
    as accessible
    as Scalding.
    Sunday, September 22, 13

    View Slide

  10. - Declarative Streaming Map/Reduce DSL
    - Realtime platform that runs on Storm.
    - Batch platform that runs on Hadoop.
    - Batch / Realtime Hybrid platform
    What is Summingbird?
    Sunday, September 22, 13

    View Slide

  11. val impressionCounts =
    impressionHose.flatMap(extractCounts(_))
    val engagementCounts =
    engagementHose.filter(_.isValid)
    .flatMap(engagementCounts(_))
    val totalCounts =
    (impressionCounts ++ engagementCounts)
    .flatMap(fanoutByTime(_))
    .sumByKey(onlineStore)
    val stormTopology =
    Storm.remote("stormName").plan(totalCounts)
    val hadoopJob =
    Scalding("scaldingName").plan(totalCounts)
    Sunday, September 22, 13

    View Slide

  12. Map/Reduce
    f1 f1 f2 f2 f2
    + + + + +
    Event Stream 1 Event Stream 2
    FlatMappers
    Reducers
    Storage (Memcache / ElephantDB)
    Sunday, September 22, 13

    View Slide

  13. FlatMap
    flatMap: T => TraversableOnce[U]
    // g: (x: T => U)
    map(x) = flatMap(x => List(g(x))
    // pred: T => Boolean
    filter(x) = flatMap { x =>
    if (pred(x)) List(x) else Nil
    }
    Sunday, September 22, 13

    View Slide

  14. - Source[+T]
    - Store[-K, V]
    - Sink[-T]
    - Service[-K, +V]
    Sunday, September 22, 13

    View Slide

  15. - Source[+T]
    - Store[-K, V]
    - Sink[-T]
    - Service[-K, +V]
    The Four Ss!
    Sunday, September 22, 13

    View Slide

  16. Store[-K, V]:
    What values are allowed?
    Sunday, September 22, 13

    View Slide

  17. trait Monoid[V] {
    def zero: V
    def plus(l: V, r: V): V
    }
    Sunday, September 22, 13

    View Slide

  18. • Tons O’Monoids:
    • CMS,
    HyperLogLog,
    ExponentialMA,
    BloomFilter,
    Moments,
    MinHash, TopK
    Sunday, September 22, 13

    View Slide

  19. Sunday, September 22, 13

    View Slide

  20. Sunday, September 22, 13

    View Slide

  21. Associativity
    Sunday, September 22, 13

    View Slide

  22. ;; 7 steps
    a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7
    Sunday, September 22, 13

    View Slide

  23. ;; 7 steps
    (+ a0 a1 a2 a3 a4 a5 a6 a7)
    Sunday, September 22, 13

    View Slide

  24. ;; 5 steps
    (+ (+ a0 a1)
    (+ a2 a3)
    (+ a4 a5)
    (+ a6 a7))
    Sunday, September 22, 13

    View Slide

  25. ;; 3 steps
    (+ (+ (+ a0 a1)
    (+ a2 a3))
    (+ (+ a4 a5)
    (+ a6 a7)))
    Sunday, September 22, 13

    View Slide

  26. Parallelism
    Associativity
    Sunday, September 22, 13

    View Slide

  27. Batch / Realtime
    0 1 2 3
    fault
    tolerant:
    Noisy: Realtime sums
    from 0, each
    batch
    Log
    Hadoop Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    BatchID:
    Sunday, September 22, 13

    View Slide

  28. Batch / Realtime
    0 1 2 3
    fault
    tolerant:
    Noisy:
    Log
    Hadoop Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    Hadoop keeps
    a total sum
    (reliably)
    BatchID:
    Sunday, September 22, 13

    View Slide

  29. Batch / Realtime
    0 1 2 3
    fault
    tolerant:
    Noisy:
    Log
    Hadoop Hadoop Hadoop Hadoop
    Log Log Log
    RT RT RT RT
    Sum of RT
    Batch(i) +
    Hadoop
    Batch(i-1)
    has bounded
    noise,
    bounded
    read/write
    size
    BatchID:
    Sunday, September 22, 13

    View Slide

  30. Tweet Embed Counts
    Sunday, September 22, 13

    View Slide

  31. Sunday, September 22, 13

    View Slide

  32. Sunday, September 22, 13

    View Slide

  33. f f f
    + + + + +
    Tweets
    (Flat)Mappers
    Reducers
    HDFS/Queue
    HDFS/Queue
    reduce: (x,y) =>
    MapMonoid
    groupBy TweetID
    (TweetID, Map[URL, Long])
    Sunday, September 22, 13

    View Slide

  34. object OuroborosJob {
    def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) =
    source.filter(filterEvents(_))
    .flatMap { event =>
    val widgetDetails = event.getWidget_details
    val referUrl: String = widgetDetails.getWidget_origin
    val timestamp: Long = event.getLog_base.getTimestamp
    val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame)
    for {
    tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids)
    timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp)
    } yield {
    val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_))
    val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String =>
    widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L)))
    }
    val impressionsValue: OuroborosValue = RawImpressions(
    impressions = 1L,
    approxUniqueUrls = urlHllOption,
    urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))),
    urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))),
    frameUrls = widgetFrameUrlsOption
    ).as[OuroborosValue]
    Seq(
    (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue),
    (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue)
    )
    }
    }.sumByKey(store)
    .set(MonoidIsCommutative(true))
    }
    Sunday, September 22, 13

    View Slide

  35. object OuroborosJob {
    def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) =
    source.filter(filterEvents(_))
    .flatMap { event =>
    val widgetDetails = event.getWidget_details
    val referUrl: String = widgetDetails.getWidget_origin
    val timestamp: Long = event.getLog_base.getTimestamp
    val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame)
    for {
    tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids)
    timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp)
    } yield {
    val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_))
    val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String =>
    widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L)))
    }
    val impressionsValue: OuroborosValue = RawImpressions(
    impressions = 1L,
    approxUniqueUrls = urlHllOption,
    urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))),
    urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))),
    frameUrls = widgetFrameUrlsOption
    ).as[OuroborosValue]
    Seq(
    (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue),
    (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue)
    )
    }
    }.sumByKey(store)
    .set(MonoidIsCommutative(true))
    }
    Filter Events
    Sunday, September 22, 13

    View Slide

  36. object OuroborosJob {
    def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) =
    source.filter(filterEvents(_))
    .flatMap { event =>
    val widgetDetails = event.getWidget_details
    val referUrl: String = widgetDetails.getWidget_origin
    val timestamp: Long = event.getLog_base.getTimestamp
    val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame)
    for {
    tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids)
    timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp)
    } yield {
    val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_))
    val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String =>
    widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L)))
    }
    val impressionsValue: OuroborosValue = RawImpressions(
    impressions = 1L,
    approxUniqueUrls = urlHllOption,
    urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))),
    urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))),
    frameUrls = widgetFrameUrlsOption
    ).as[OuroborosValue]
    Seq(
    (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue),
    (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue)
    )
    }
    }.sumByKey(store)
    .set(MonoidIsCommutative(true))
    }
    Filter Events
    Generate KV Pairs
    Sunday, September 22, 13

    View Slide

  37. object OuroborosJob {
    def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) =
    source.filter(filterEvents(_))
    .flatMap { event =>
    val widgetDetails = event.getWidget_details
    val referUrl: String = widgetDetails.getWidget_origin
    val timestamp: Long = event.getLog_base.getTimestamp
    val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame)
    for {
    tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids)
    timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp)
    } yield {
    val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_))
    val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String =>
    widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L)))
    }
    val impressionsValue: OuroborosValue = RawImpressions(
    impressions = 1L,
    approxUniqueUrls = urlHllOption,
    urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))),
    urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))),
    frameUrls = widgetFrameUrlsOption
    ).as[OuroborosValue]
    Seq(
    (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue),
    (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue)
    )
    }
    }.sumByKey(store)
    .set(MonoidIsCommutative(true))
    }
    Filter Events
    Generate KV Pairs
    Sum into Store
    Sunday, September 22, 13

    View Slide

  38. Brief Explanation
    This job creates two types of keys:
    1: ((TweetId, TimeBucket) => [URL, Impressions])
    2: TimeBucket => Map[TweetId, Impressions]
    Sunday, September 22, 13

    View Slide

  39. What Else?
    Sunday, September 22, 13

    View Slide

  40. Sunday, September 22, 13

    View Slide

  41. What’s Next?
    Sunday, September 22, 13

    View Slide

  42. - Akka, Spark, Tez Platforms
    - Pluggable graph optimizations
    - Metadata publishing via HCatalog
    - More tutorials!
    Future Plans
    Sunday, September 22, 13

    View Slide

  43. Open Source!
    Sunday, September 22, 13

    View Slide

  44. •Summingbird is appropriate for the majority
    of the real-time apps we have.
    •It’s all about the Monoid
    •Data scientists who are not familiar with
    systems can deploy realtime systems.
    •Systems engineers can reuse 90% of the
    code (batch/realtime merging).
    Summary
    Sunday, September 22, 13

    View Slide

  45. Follow me at @sritchie
    Thank You!
    Sunday, September 22, 13

    View Slide