Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Productionalizing Spark Streaming

Ryan Weald
December 02, 2013

Productionalizing Spark Streaming

Spark Summit 2013 Talk:

At Sharethrough we have deployed Spark to our production environment to support several user facing product features. While building these features we uncovered a consistent set of challenges across multiple streaming jobs. By addressing these challenges you can speed up development of future streaming jobs. In this talk we will discuss the 3 major challenges we encountered while developing production streaming jobs and how we overcame them.

First we will look at how to write jobs to ensure fault tolerance since streaming jobs need to run 24/7 even under failure conditions. Second we will look at the programming abstractions we created using functional programming and existing libraries. Finally we will look at the way we test all the pieces of a job –from manipulating data through writing to external databases– to give us confidence in our code before we deploy to production

Ryan Weald

December 02, 2013
Tweet

More Decks by Ryan Weald

Other Decks in Programming

Transcript

  1. @rweald What We’re Going to Cover •What we do and

    Why we choose Spark •Fault tolerance for long lived streaming jobs •Common patterns and functional abstractions •Testing before we “do it live” Monday, December 2, 13
  2. @rweald Why Spark Streaming •Liked theoretical foundation of mini-batch •Scala

    codebase + functional API •Young project with opportunities to contribute •Batch model for iterative ML algorithms Monday, December 2, 13
  3. @rweald Receiver Fault Tolerance •Use Actors with supervisors •Use self

    healing connection pools Monday, December 2, 13
  4. @rweald Use Actors class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String)

    extends Actor with Receiver with Logging { implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages } def receive: Receive = { case _ => logInfo("unknown message") } } Monday, December 2, 13
  5. @rweald Track All Outputs •Low watermarks - Google MillWheel •Database

    updated_at •Expected output file size alerting Monday, December 2, 13
  6. @rweald Mapping Data inputData.map { rawRequest => val params =

    QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L) } Monday, December 2, 13
  7. @rweald Basic Aggregation //beacons is DStream[String, Long] //example Seq(("click", 1L),

    ("click", 1L)) val sum: (Long, Long) => Long = _ + _ beacons.reduceByKey(sum) Monday, December 2, 13
  8. @rweald Long Basic Aggregation val inputData = Seq( ("user_1",(1L, 1L,

    1L)), ("user_1",(2L, 2L, 2L)) ) def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) } inputData.reduceByKey(sum) Monday, December 2, 13
  9. @rweald WTF is a Monoid? trait Monoid[T] { def zero:

    T def plus(r: T, l: T): T } * Just need to make sure plus is associative. (1+ 5) + 2 == (2 + 1) + 5 Monday, December 2, 13
  10. @rweald Monoid Based Aggregation object LongMonoid extends Monoid[(Long, Long, Long)]

    { def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) } } inputData.reduceByKey(LongMonid.plus(_, _)) Monday, December 2, 13
  11. @rweald Algebird Based Aggregation import com.twitter.algebird._ val aggregator = implicitly[Monoid[(Long,Long,

    Long)]] inputData.reduceByKey(aggregator.plus(_, _)) Monday, December 2, 13
  12. @rweald HLL Aggregation import com.twitter.algebird._ val aggregator = new HyperLogLogMonoid(12)

    inputData.reduceByKey(aggregator.plus(_, _)) Monday, December 2, 13
  13. @rweald Storage API Requirements •Incremental updates (preferably associative) •Pluggable to

    support “big data” stores •Allow for testing jobs Monday, December 2, 13
  14. @rweald Storage API trait MergeableStore[K, V] { def get(key: K):

    V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V } Monday, December 2, 13
  15. @rweald Storing Spark Results def saveResults(result: DStream[String, Long], store: RedisStore[String,

    Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } } Monday, December 2, 13
  16. @rweald Potential API additions? class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator:

    Monoid[V]) def store(store: MergeableStore[K, V]) } Monday, December 2, 13
  17. @rweald Testing best Practices •Try and avoid full integration tests

    •Use in-memory stores for testing •Keep logic outside of Spark •Use Summingbird in memory platform??? Monday, December 2, 13