Slide 1

Slide 1 text

@summingbird Friday, September 27, 13

Slide 2

Slide 2 text

Friday, September 27, 13

Slide 3

Slide 3 text

Oscar Boykin - @posco Sam Ritchie - @sritchie Ashu Singhal - @daashu Friday, September 27, 13

Slide 4

Slide 4 text

- What is Summingbird? - What can it do today? - Why you should use it to write Storm! - Currently deployed systems - Upcoming Features Friday, September 27, 13

Slide 5

Slide 5 text

Vision Friday, September 27, 13

Slide 6

Slide 6 text

- 200M+ Active Monthly Users - 500M Tweets / Day - Several 1K+ node Hadoop clusters - THE Realtime Company (ostensibly) Twitter’s Scale Friday, September 27, 13

Slide 7

Slide 7 text

Write your logic once. Friday, September 27, 13

Slide 8

Slide 8 text

Solve systems problems once. Friday, September 27, 13

Slide 9

Slide 9 text

Make non-trivial realtime compute as accessible as Scalding. Friday, September 27, 13

Slide 10

Slide 10 text

- Declarative Streaming Map/Reduce DSL - Realtime platform that runs on Storm. - Batch platform that runs on Hadoop. - Batch / Realtime Hybrid platform What is Summingbird? Friday, September 27, 13

Slide 11

Slide 11 text

Why does Storm Care? Friday, September 27, 13

Slide 12

Slide 12 text

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map counts = new HashMap(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } } public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if(args!=null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } } Friday, September 27, 13

Slide 13

Slide 13 text

def tokenize(text: String) : TraversableOnce[String] = text.toLowerCase .replaceAll("[^a-zA-Z0-9\\s]", "") .split("\\s+") def wordCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tokenize(tweet.getText).map(_ -> 1L) }.sumByKey(store) } Friday, September 27, 13

Slide 14

Slide 14 text

Map/Reduce f1 f1 f2 f2 f2 + + + + + Event Stream 1 Event Stream 2 FlatMappers Reducers Storage (Memcache / ElephantDB) Friday, September 27, 13

Slide 15

Slide 15 text

Functions, not Bolts! flatMap: T => TraversableOnce[U] // g: (x: T => U) map(x) = flatMap(x => List(g(x)) // pred: T => Boolean filter(x) = flatMap { x => if (pred(x)) List(x) else Nil } Friday, September 27, 13

Slide 16

Slide 16 text

- Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V] Friday, September 27, 13

Slide 17

Slide 17 text

- Source[+T] - Store[-K, V] - Sink[-T] - Service[-K, +V] The Four Ss! Friday, September 27, 13

Slide 18

Slide 18 text

Source[+T] = Spout[(Long, T)] Store[-K, V] = StormStore[K, V] Sink[-T] = () => (T => Future[Unit]) Service[-K, +V] = StormService[K, V] Plan[T] = StormTopology The Storm Platform Friday, September 27, 13

Slide 19

Slide 19 text

Source[+T] = Spout[(Long, T)] Store[-K, V] = StormStore[K, V] Friday, September 27, 13

Slide 20

Slide 20 text

Source[+T] = Spout[(Long, T)] Friday, September 27, 13

Slide 21

Slide 21 text

Type Safety Friday, September 27, 13

Slide 22

Slide 22 text

Friday, September 27, 13

Slide 23

Slide 23 text

trait Spout[+T] { def getSpout: IRichSpout def flatMap[U](fn: T => TraversableOnce[U]): Spout[U] } Friday, September 27, 13

Slide 24

Slide 24 text

val spout: Spout[String] = Spout.fromTraversable { List("call" "me" "ishmael") } val evenCountSpout: Spout[Int] = spout.map(_.size).filter(_ % 2 == 0) val characterSpout: Spout[String] = spout.flatMap { s => s.toSeq.map(_.toString) } Friday, September 27, 13

Slide 25

Slide 25 text

val characterSpout: Spout[String] = spout.flatMap { s => s.toSeq.map(_.toString) } val builder = new TopologyBuilder builder.setSpout("1", characterSpout.getSpout, 1) builder.setBolt("2", new TestGlobalCount()) .globalGrouping("1") val topo: StormTopology = builder.createTopology Friday, September 27, 13

Slide 26

Slide 26 text

Store[-K, V] = StormStore[K, V] Friday, September 27, 13

Slide 27

Slide 27 text

Friday, September 27, 13

Slide 28

Slide 28 text

trait ReadableStore[-K, +V] extends Closeable { def get(k: K): Future[Option[V]] def multiGet[K1 <: K](ks: Set[K1]): Map[K1, Future[Option[V]]] override def close { } } trait Store[-K, V] extends ReadableStore[K, V] { def put(kv: (K, Option[V])): Future[Unit] def multiPut[K1 <: K](kvs: Map[K1, Option[V]]) : Map[K1, Future[Unit]] } Friday, September 27, 13

Slide 29

Slide 29 text

trait MergeableStore[-K, V] extends Store[K, V] { def monoid: Monoid[V] def merge(kv: (K, V)): Future[Unit] def multiMerge[K1 <: K](kvs: Map[K1, V]) : Map[K1, Future[Unit]] } Friday, September 27, 13

Slide 30

Slide 30 text

What values are allowed? Friday, September 27, 13

Slide 31

Slide 31 text

trait Monoid[V] { def zero: V def plus(l: V, r: V): V } Friday, September 27, 13

Slide 32

Slide 32 text

• Tons O’Monoids: • CMS, HyperLogLog, ExponentialMA, BloomFilter, Moments, MinHash, TopK Friday, September 27, 13

Slide 33

Slide 33 text

Friday, September 27, 13

Slide 34

Slide 34 text

Friday, September 27, 13

Slide 35

Slide 35 text

Associativity Friday, September 27, 13

Slide 36

Slide 36 text

;; 7 steps a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7 Friday, September 27, 13

Slide 37

Slide 37 text

;; 7 steps (+ a0 a1 a2 a3 a4 a5 a6 a7) Friday, September 27, 13

Slide 38

Slide 38 text

;; 5 steps (+ (+ a0 a1) (+ a2 a3) (+ a4 a5) (+ a6 a7)) Friday, September 27, 13

Slide 39

Slide 39 text

;; 3 steps (+ (+ (+ a0 a1) (+ a2 a3)) (+ (+ a4 a5) (+ a6 a7))) Friday, September 27, 13

Slide 40

Slide 40 text

Parallelism Associativity Friday, September 27, 13

Slide 41

Slide 41 text

Clients Storehaus Friday, September 27, 13

Slide 42

Slide 42 text

Tweet Embed Counts Friday, September 27, 13

Slide 43

Slide 43 text

Friday, September 27, 13

Slide 44

Slide 44 text

Friday, September 27, 13

Slide 45

Slide 45 text

f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue reduce: (x,y) => MapMonoid groupBy TweetID (TweetID, Map[URL, Long]) Friday, September 27, 13

Slide 46

Slide 46 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Friday, September 27, 13

Slide 47

Slide 47 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Friday, September 27, 13

Slide 48

Slide 48 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Friday, September 27, 13

Slide 49

Slide 49 text

object OuroborosJob { def apply[P <: Platform[P]](source: Producer[P, ClientEvent], sink: P#Store[OuroborosKey, OuroborosValue]) = source.filter(filterEvents(_)) .flatMap { event => val widgetDetails = event.getWidget_details val referUrl: String = widgetDetails.getWidget_origin val timestamp: Long = event.getLog_base.getTimestamp val widgetFrameUrlOpt: Option[String] = Option(widgetDetails.getWidget_frame) for { tweetId: java.lang.Long <- javaToScalaSafe(event.getEvent_details.getItem_ids) timeBucketOption: Option[TimeBucket] <- timeBucketsForTimestamp(timestamp) } yield { val urlHllOption = canonicalUrl(referUrl).map(hllMonoid.create(_)) val widgetFrameUrlsOption = widgetFrameUrlOpt map { widgetUrl: String => widgetFrameUrlsSmMonoid.create((referUrl, (widgetFrameUrlSetSmMonoid.create((widgetUrl, 1L)), 1L))) } val impressionsValue: OuroborosValue = RawImpressions( impressions = 1L, approxUniqueUrls = urlHllOption, urlCounts = Some(embedCountSmMonoid.create((referUrl, 1L))), urlDates = Some(embedDateSmMonoid.create((referUrl, timestamp))), frameUrls = widgetFrameUrlsOption ).as[OuroborosValue] Seq( (OuroborosKey.ImpressionsKey(ImpressionsKey(tweetId.longValue, timeBucketOption)), impressionsValue), (OuroborosKey.TopTweetsKey(TopTweetsKey(timeBucketOption)), topTweetsValue) ) } }.sumByKey(store) .set(MonoidIsCommutative(true)) } Filter Events Generate KV Pairs Sum into Store Friday, September 27, 13

Slide 50

Slide 50 text

Brief Explanation This job creates two types of keys: 1: ((TweetId, TimeBucket) => Map[URL, Impressions]) 2: TimeBucket => Map[TweetId, Impressions] Friday, September 27, 13

Slide 51

Slide 51 text

What Else? Friday, September 27, 13

Slide 52

Slide 52 text

Friday, September 27, 13

Slide 53

Slide 53 text

What’s Next? Friday, September 27, 13

Slide 54

Slide 54 text

- Akka, Spark, Tez Platforms - More Spouts, Stores and Monoids - Pluggable graph optimizations - More tutorials! Future Plans Friday, September 27, 13

Slide 55

Slide 55 text

Open Source! Friday, September 27, 13

Slide 56

Slide 56 text

•Summingbird is appropriate for the majority of the real-time apps we have. •Data scientists who are not familiar with systems can deploy realtime systems. •Systems engineers can reuse 90% of the code (batch/realtime merging). Summary Friday, September 27, 13

Slide 57

Slide 57 text

Follow me at @sritchie Thank You! Friday, September 27, 13