Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

@rweald Stores, Monoids and Dependency Injection Spark Meetup 01/16/2014 Ryan
Weald @rweald Friday, January 17, 14

@rweald What We’re Going to Cover •What we do and
Why we choose Spark •Common patterns in spark streaming jobs •Monoids as an abstraction for aggregation •Abstraction for saving the results of jobs •Using dependency injection for improved testability and developer happiness Friday, January 17, 14

@rweald What is Sharethrough? Advertising for the Modern Internet Function
Form Friday, January 17, 14

@rweald What is Sharethrough? Friday, January 17, 14

@rweald Why Spark Streaming? Friday, January 17, 14

@rweald Why Spark Streaming •Liked theoretical foundation of mini-batch •Scala
codebase + functional API •Young project with opportunities to contribute •Batch model for iterative ML algorithms Friday, January 17, 14

@rweald Great... Now maintain dozens of streaming jobs Friday, January
17, 14

@rweald Common Patterns & Functional Programming Friday, January 17, 14

@rweald Map -> Aggregate ->Store Common Job Pattern Friday, January
17, 14

@rweald Which publisher pages has an ad unit appeared on?
Real World Example Friday, January 17, 14

@rweald Mapping Data inputData.map { rawRequest => val params =
QueryParams.parse(rawRequest) val pubPage = params.getOrElse( "pub_page_location", "http://example.com") val creative = params.getOrElse( "creative_key", "unknown") val uri = new java.net.URI(pubPage) val cleanPubPage = uri.getHost + "/" + uri.getPath (creative, cleanPubPage) } Friday, January 17, 14

@rweald Aggregation Friday, January 17, 14

@rweald Basic Aggregation Add each pub page to a creative’s
set Friday, January 17, 14

@rweald Basic Aggregation val sum: (Set[String], Set[String]) => Set[String] =
_ ++ _ creativePubPages.map { case(ckey, pubPage) (ckey, Set(pubPage)) }.reduceByKey(sum) Friday, January 17, 14

@rweald Way too much memory usage in production as data
size grows Friday, January 17, 14

@rweald We need bloom filter to keep memory usage fixed
Friday, January 17, 14

@rweald Total code re-write :( Friday, January 17, 14

@rweald Monoids to the Rescue Friday, January 17, 14

@rweald WTF is a Monoid? trait Monoid[T] { def zero:
T def plus(r: T, l: T): T } * Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5 Friday, January 17, 14

@rweald Monoid Example SetMonoid extends Monoid[Set[String]] { def zero =
Set.empty[String] def plus(l: Set[String], r: Set[String]) = l ++ r } SetMonoid.plus(Set("a"), Set("b")) //returns Set("a", "b") SetMonoid.plus(Set("a"), Set("a")) //returns Set("a") Friday, January 17, 14

@rweald Twitter Algebird http://github.com/twitter/algebird Friday, January 17, 14

@rweald Algebird Based Aggregation import com.twitter.algebird._ val bfMonoid = BloomFilter(500000,
0.01) creativePubPages.map { case(ckey, pubPage) (ckey, bfMonoid.create(pubPage)) }.reduceByKey(bfMonoid.plus(_, _)) Friday, January 17, 14

@rweald Add set of users who have seen creative to
same job Friday, January 17, 14

@rweald Algebird Based Aggregation val aggregator = new Monoid[(BF, BF)]
{ def zero = (bfMonoid.zero, bfMonoid.zero) def plus(l: (BF, BF), r: (BF, BF)) = { (bfMonoid.plus(l._1, r._1), bfMonoid.plus(l._2, r._2)) } } creativePubPages.map { case(ckey, pubPage, userId) ( ckey, bfMonoid.create(pubPage), bfMonoid.create(userID) ) }.reduceByKey(aggregator.plus(_, _)) Friday, January 17, 14

@rweald Monoids == Reusable Aggregation Friday, January 17, 14

@rweald Common Job Pattern Map -> Aggregate ->Store Friday, January
17, 14

@rweald Store Friday, January 17, 14

@rweald How do we store the results? Friday, January 17,
14

@rweald Storage API Requirements •Incremental updates (preferably associative) •Pluggable to
support “big data” stores •Allow for testing jobs Friday, January 17, 14

@rweald Storage API trait MergeableStore[K, V] { def get(key: K):
V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V } Friday, January 17, 14

@rweald Twitter Storehaus http://github.com/twitter/storehaus Friday, January 17, 14

@rweald Storing Spark Results def saveResults(result: DStream[String, BF], store: HBaseStore[String,
BF]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } Friday, January 17, 14

@rweald What if we don’t have HBase locally? Friday, January
17, 14

@rweald Dependency Injection to the rescue Friday, January 17, 14

@rweald Generic storage with environment specific binding Friday, January 17,
14

@rweald Generic Storage Method def saveResults(result: DStream[String, BF], store: StorageFactory)
= { val store = StorageFactory.create result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } Friday, January 17, 14

@rweald Google Guice https://github.com/sptz45/sse-guice Friday, January 17, 14

@rweald DI the Store You Need! trait StorageFactory { def
create: Store[String, BF] } class DevModule extends ScalaModule { def configure() { bind[StorageFactory].to[InMemoryStorageFactory] } } class ProdModule extends ScalaModule { def configure() { bind[StorageFactory].to[HBaseStorageFactory] } } Friday, January 17, 14

@rweald Moving Forward Friday, January 17, 14

@rweald Potential API additions? class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator:
Monoid[V]) def store(store: MergeableStore[K, V]) } Friday, January 17, 14

@rweald Twitter Summingbird http://github.com/twitter/summingbird *https://github.com/twitter/summingbird/issues/387 Friday, January 17, 14

@rweald Ryan Weald @rweald Thank You Friday, January 17, 14

Monoids, Store, and Dependency Injection - Abst...

Monoids, Store, and Dependency Injection - Abstractions for Spark Streaming Jobs

More Decks by Ryan Weald

Other Decks in Technology

Featured

Transcript