Productionalizing Spark Streaming

@rweald Productionalizing Spark Streaming Spark Summit 2013 Ryan Weald @rweald
Monday, December 2, 13

@rweald What We’re Going to Cover •What we do and
Why we choose Spark •Fault tolerance for long lived streaming jobs •Common patterns and functional abstractions •Testing before we “do it live” Monday, December 2, 13

@rweald Special focus on common patterns and their solutions Monday,
December 2, 13

@rweald What is Sharethrough? Advertising for the Modern Internet Function
Form Monday, December 2, 13

@rweald What is Sharethrough? Monday, December 2, 13

@rweald Why Spark Streaming? Monday, December 2, 13

@rweald Why Spark Streaming •Liked theoretical foundation of mini-batch •Scala
codebase + functional API •Young project with opportunities to contribute •Batch model for iterative ML algorithms Monday, December 2, 13

@rweald Great... Now productionalize it Monday, December 2, 13

@rweald Fault Tolerance Monday, December 2, 13

@rweald Keys to Fault Tolerance 1.Receiver fault tolerance 2.Monitoring job
progress Monday, December 2, 13

@rweald Receiver Fault Tolerance •Use Actors with supervisors •Use self
healing connection pools Monday, December 2, 13

@rweald Use Actors class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String)
extends Actor with Receiver with Logging { implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages } def receive: Receive = { case _ => logInfo("unknown message") } } Monday, December 2, 13

@rweald Track All Outputs •Low watermarks - Google MillWheel •Database
updated_at •Expected output file size alerting Monday, December 2, 13

@rweald Common Patterns & Functional Programming Monday, December 2, 13

@rweald Map -> Aggregate ->Store Common Job Pattern Monday, December
2, 13

@rweald Mapping Data inputData.map { rawRequest => val params =
QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L) } Monday, December 2, 13

@rweald Aggregation Monday, December 2, 13

@rweald Basic Aggregation //beacons is DStream[String, Long] //example Seq(("click", 1L),
("click", 1L)) val sum: (Long, Long) => Long = _ + _ beacons.reduceByKey(sum) Monday, December 2, 13

@rweald What Happens when we want to sum multiple things?
Monday, December 2, 13

@rweald Long Basic Aggregation val inputData = Seq( ("user_1",(1L, 1L,
1L)), ("user_1",(2L, 2L, 2L)) ) def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) } inputData.reduceByKey(sum) Monday, December 2, 13

@rweald Now Sum 4 Ints instead (ůಥӹಥʣů ᵲᴸᵲ Monday, December
2, 13

@rweald Monoids to the Rescue Monday, December 2, 13

@rweald WTF is a Monoid? trait Monoid[T] { def zero:
T def plus(r: T, l: T): T } * Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5 Monday, December 2, 13

@rweald Monoid Based Aggregation object LongMonoid extends Monoid[(Long, Long, Long)]
{ def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) } } inputData.reduceByKey(LongMonid.plus(_, _)) Monday, December 2, 13

@rweald Twitter Algebird http://github.com/twitter/algebird Monday, December 2, 13

@rweald Algebird Based Aggregation import com.twitter.algebird._ val aggregator = implicitly[Monoid[(Long,Long,
Long)]] inputData.reduceByKey(aggregator.plus(_, _)) Monday, December 2, 13

@rweald How many unique users per publisher? Monday, December 2,
13

@rweald Too big for memory based naive Map Monday, December
2, 13

@rweald HyperLogLog FTW Monday, December 2, 13

@rweald HLL Aggregation import com.twitter.algebird._ val aggregator = new HyperLogLogMonoid(12)
inputData.reduceByKey(aggregator.plus(_, _)) Monday, December 2, 13

@rweald Monoids == Reusable Aggregation Monday, December 2, 13

@rweald Common Job Pattern Map -> Aggregate ->Store Monday, December
2, 13

@rweald Store Monday, December 2, 13

@rweald How do we store the results? Monday, December 2,
13

@rweald Storage API Requirements •Incremental updates (preferably associative) •Pluggable to
support “big data” stores •Allow for testing jobs Monday, December 2, 13

@rweald Storage API trait MergeableStore[K, V] { def get(key: K):
V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V } Monday, December 2, 13

@rweald Twitter Storehaus http://github.com/twitter/storehaus Monday, December 2, 13

@rweald Storing Spark Results def saveResults(result: DStream[String, Long], store: RedisStore[String,
Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } Monday, December 2, 13

@rweald Everyone can benefit Monday, December 2, 13

@rweald Potential API additions? class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator:
Monoid[V]) def store(store: MergeableStore[K, V]) } Monday, December 2, 13

@rweald Twitter Summingbird http://github.com/twitter/summingbird *https://github.com/twitter/summingbird/issues/387 Monday, December 2, 13

@rweald Testing Your Jobs Monday, December 2, 13

@rweald Testing best Practices •Try and avoid full integration tests
•Use in-memory stores for testing •Keep logic outside of Spark •Use Summingbird in memory platform??? Monday, December 2, 13

@rweald Ryan Weald @rweald Thank You Monday, December 2, 13

Productionalizing Spark Streaming

Productionalizing Spark Streaming

More Decks by Ryan Weald

Other Decks in Programming

Featured

Transcript