Slide 1

@summingbird

Slide 2



Slide 3

Oscar Boykin - @posco Sam Ritchie - @sritchie Ashu Singhal - @daashu

Slide 4

- What is Summingbird? - What can it do today? - Why you should use it to write Storm! - Currently deployed systems - Upcoming Features

Slide 5

Vision

Slide 6

- 200M+ Active Monthly Users - 500M Tweets / Day - Several 1K+ node Hadoop clusters - THE Realtime Company (ostensibly) Twitter's Scale

Slide 7

Write your logic once.

Slide 8

Solve systems problems once.

Slide 9

Make non-trivial realtime compute as accessible as Scalding.

Slide 10

- Declarative Streaming Map/Reduce DSL - Realtime platform that runs on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird?

Slide 11

Why does Storm Care?

Slide 12

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", ""); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map counts = new HashMap(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } } public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if(args!=null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Slide 13

def tokenize(text: String) : TraversableOnce[String] = text.toLowerCase .replaceAll("[^a-zA-Z0-9\\s]", "") .split("\\s+") def wordCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tokenize(tweet.getText).map(_ -> 1L) }.sumByKey(store) }

Slide 14

Map/Reduce f1 f1 f2 f2 f2 + + + + + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB)

Slide 15

Functions, not Bolts! flatMap: T => TraversableOnce[U] // g: (x: T => U) map(x) = flatMap(x => List(g(x)) // pred: T => Boolean filter(x) = flatMap { x => if (pred(x)) List(x) else Nil }

Slide 16

- Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V]

Slide 17

- Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V] The Four Ss!

Slide 18

Source[+T] = Spout[(Long, T)] Store[-K, V] = StormStore[K, V] Sink[-T] = () => (T => Future[Unit]) Service[-K, +V] = StormService[K, V] Plan[T] = StormTopology The Storm Platform

Slide 19

Source[+T] = Spout[(Long, T)] Store[-K, V] = StormStore[K, V]

Slide 20

Source[+T] = Spout[(Long, T)]

Slide 21

Type Safety

Slide 22



Slide 23

trait Spout[+T] { def getSpout: IRichSpout def flatMap[U](fn: T => TraversableOnce[U]): Spout[U] }

Slide 24

val spout: Spout[String] = Spout.fromTraversable { List("call" "me" "ishmael") } val evenCountSpout: Spout[Int] = % 2 == 0) val characterSpout: Spout[String] = spout.flatMap { s => }

Slide 25

val characterSpout: Spout[String] = spout.flatMap { s => } val builder = new TopologyBuilder builder.setSpout("1", characterSpout.getSpout, 1) builder.setBolt("2", new TestGlobalCount()) .globalGrouping("1") val topo: StormTopology = builder.createTopology

Slide 26

Store[-K, V] = StormStore[K, V]

Slide 27



Slide 28

trait ReadableStore[-K, +V] extends Closeable { def get(k: K): Future[Option[V]] def multiGet[K1 <: K](ks: Set[K1]): Map[K1, Future[Option[V]]] override def close { } } trait Store[-K, V] extends ReadableStore[K, V] { def put(kv: (K, Option[V])): Future[Unit] def multiPut[K1 <: K](kvs: Map[K1, Option[V]]) : Map[K1, Future[Unit]] }

Slide 29

trait MergeableStore[-K, V] extends Store[K, V] { def monoid: Monoid[V] def merge(kv: (K, V)): Future[Unit] def multiMerge[K1 <: K](kvs: Map[K1, V]) : Map[K1, Future[Unit]] }

Slide 30

What values are allowed?

Slide 31

trait Monoid[V] { def zero: V def plus(l: V, r: V): V }

Slide 32

• Tons O'Monoids: • CMS, HyperLogLog, ExponentialMA, BloomFilter, Moments, MinHash, TopK

Slide 33



Slide 34



Slide 35

Associativity

Slide 36

;; 7 steps a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7

Slide 37

;; 7 steps (+ a0 a1 a2 a3 a4 a5 a6 a7)

Slide 38

;; 5 steps (+ (+ a0 a1) (+ a2 a3) (+ a4 a5) (+ a6 a7))

Slide 39

;; 3 steps (+ (+ (+ a0 a1) (+ a2 a3)) (+ (+ a4 a5) (+ a6 a7)))

Slide 40

Parallelism Associativity

Slide 41

Clients Storehaus

Slide 42

Tweet Embed Counts

Slide 43



Slide 44



Slide 45

f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long])

Slide 46

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) }

Slide 47

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events

Slide 48

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Friday, September 27, 13

Slide 49

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Sum into Store Friday, September 27, 13

Slide 50

Brief Explanation This job creates two types of keys: 1: ((TweetId, TimeBucket) => Map[URL, Impressions]) 2: TimeBucket => Map[TweetId, Impressions] Friday, September 27, 13

Slide 51

What Else? Friday, September 27, 13

Slide 52

Friday, September 27, 13

Slide 53

What’s Next? Friday, September 27, 13

Slide 54

- Akka, Spark, Tez Platforms - More Spouts, Stores and Monoids - Pluggable graph optimizations - More tutorials! Future Plans Friday, September 27, 13

Slide 55

Open Source! Friday, September 27, 13

Slide 56

•Summingbird is appropriate for the majority of the real-time apps we have. •Data scientists who are not familiar with systems can deploy realtime systems. •Systems engineers can reuse 90% of the code (batch/realtime merging). Summary Friday, September 27, 13

Slide 57

Follow me at @sritchie Thank You! Friday, September 27, 13