Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Connect S3 with Kafka leveraging Akka Streams

Connect S3 with Kafka leveraging Akka Streams

A brief introduction to Akka Streams and show an application example that connect S3 and Kafka with high throughput by leveraging SQS.

6a9b683ceacd31faed11db6bc2b4fdcd?s=128

Saint1991

May 13, 2017
Tweet

Transcript

  1. Connect S3 with Kafka leveraging Akka Streams

  2. pSeiya Mizuno @Saint1991 pDeveloping data processing platform like below Who

    am I?
  3. pIntroduction to Akka Streams uComponents of Akka Streams uGlance at

    GraphStage p$POOFDU4XJUI,BGLBVTJOH"MQBLLB Agenda HERE!
  4. Introduction to Akka Streams

  5. pThe toolkit to process data streams on Akka actors pDescribe

    processing pipeline as a graph uEasy to define complex pipeline What is Akka Streams? Source Flow Sink Broadcast Flow Merge Input Generating stream elements Fetching stream elements from outside Processing Processing stream elements sent from upstreams one by one Output To a File To outer resources
  6. Sample code! implicit val system = ActorSystem() implicit val dispatcher

    = system.dispatcher implicit val mat = ActorMaterializer() val s3Keys = List(“key1”, “key2”) val sinkForeach = Sink.foreach(println) val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) { implicit builder: GraphDSL.Builder[Future[Done]] => sink: Sink[String, Future[Done]]#Shape => import GraphDSL.Implicits._ val src = Source(s3Keys) val flowA = Flow[String].map(key => s“s3://bucketA/$key”) val flowB = Flow[String].map(key => s"s3://bucketB/$key") val broadcast = builder.add(Broadcast[String](2)) val merge = builder.add(Merge[String](2)) src ~> broadcast ~> flowA ~> merge ~> sink broadcast ~> flowB ~> merge ClosedShape }) blueprint.run() onComplete { _ => Await.ready(system.terminate(), 10 seconds) } // stream elements // a sink that prints received stream elements // a source send elements defined above // a flow maps received element to the URL of Bucket A // a flow maps received element to the URL of Bucket B // a Junction that broadcasts received elements to 2 outlets // a Junction that merge received elements from 2 inlets // THIS IS GREAT FUNCTIONALITY OF GraphDSL // easy to describe graph // Run the graph!!! // terminate actor system when the graph is completed
  7. pEasy to use without knowing the detail of Akka Actor

    GOOD!
  8. Akka Streams implicitly do everything implicit val system = ActorSystem()

    implicit val dispatcher = system.dispatcher implicit val mat = ActorMaterializer() val s3Keys = List(“key1”, “key2”) val sinkForeach = Sink.foreach(println) val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) { implicit builder: GraphDSL.Builder[Future[Done]] => sink: Sink[String, Future[Done]]#Shape => import GraphDSL.Implicits._ val src = Source(s3Keys) val flowA = Flow[String].map(key => s“s3://bucketA/$key”) val flowB = Flow[String].map(key => s"s3://bucketB/$key") val broadcast = builder.add(Broadcast[String](2)) val merge = builder.add(Merge[String](2)) src ~> broadcast ~> flowA ~> merge ~> sink broadcast ~> flowB ~> merge ClosedShape }) blueprint.run() onComplete { _ => Await.ready(system.terminate(), 10 seconds) } // dispatch threads to actors // create actors Materializer creates Akka Actors based on the blueprint when called RunnableGraph#run and processing is going!!!
  9. Conclusion Built a graph with Source, Flow, Sink etc Declare

    materializer with implicit RunnableGraph ActorMaterializer Actors Almost Automatically working with actors!!!
  10. Tips implicit val system = ActorSystem() implicit val dispatcher =

    system.dispatcher implicit val mat = ActorMaterializer() val s3Keys = List(“key1”, “key2”) val sinkForeach = Sink.foreach(println) val blueprint: RunnableGraph[Future[Done]] = RunnableGraph.fromGraph(GraphDSL.create(sinkForeach) { implicit builder: GraphDSL.Builder[Future[Done]] => sink: Sink[String, Future[Done]]#Shape => import GraphDSL.Implicits._ val src = Source(s3Keys) val flowA = Flow[String].map(key => s“s3://bucketA/$key”) val flowB = Flow[String].map(key => s"s3://bucketB/$key") val broadcast = builder.add(Broadcast[String](2)) val merge = builder.add(Merge[String](2)) src ~> broadcast ~> flowA ~> merge ~> sink broadcast ~> flowB ~> merge ClosedShape }) blueprint.run() onComplete { _ => Await.ready(system.terminate(), 10 seconds) } To return MaterializedValue using GraphDSL, the graph component that create MaterializedValue to return has to be passed to GrapDSL#create. So it must be defined outside GraphDSL builer… orz Process will not be completed till terminate ActorSystem Donʼt forget to terminate it!!! If not define materialized value, blueprint does not Return completion future…
  11. Glance at GraphStage

  12. pAsynchronous message passing uEfficient use of CPU pBack pressure Remarkable

    of Akka Streams are… Source Sink ① Request a next element ② send a element Upstreams send elements only when received requests from downstream. Down streamsʼ buffer will not overflow
  13. What is GraphStage? Source Sink ① Request a next element

    Every Graph Component is GraphStage!! Not found in Akka streams standard library? But want backpressure??? Implement custom GraphStages!!! ② send a element
  14. SourceStage that emits Fibonacci class FibonacciSource(to: Int) extends GraphStage[SourceShape[Int]] {

    val out: Outlet[Int] = Outlet("Fibonacci.out") override val shape = SourceShape(out) override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) { var fn_2 = 0 var fn_1 = 0 var n = 0 setHandler(out, new OutHandler { override def onPull(): Unit = { val fn = if (n == 0) 0 else if (n == 1) 1 else fn_2 + fn_1 if (fn >= to) completeStage() else push(out, fn) fn_2 = fn_1 fn_1 = fn n += 1 } }) } } Define a shape of Graph SourceShape that has a outlet that emit int elements // new instance is created every time RunnableGraph#run is called // terminate this stage with completion // called when every time received a request from downstream (backpressure) So mutable state must be initizalized within the GraphStageLogic // send an element to the downstream
  15. Connect S3 with Kafka

  16. Connect S3 with Kafka Docker Container Direct connect Put 2.5TB/day

    !!! Must be scalable
  17. Our architecture Direct connect ① Notify Created Events ② Receive

    object keys to ingest … ③ Download ④ Produce Distribute object keys to containers (Work as Load Balancer)
  18. pAt least once u= Sometimes duplicate pOnce an event is

    read, it becomes invisible and basically any consumers does not receive the same event until passed visibility timeout uLoad Balancing pElements are not deleted until sending Ack uIt is retriable, by not sending Ack when a failure occurs Amazon SQS
  19. pAlpakka (Implementation of GraphStages) uSQS Connector • Read events from

    SQS • Ack uS3 Connector • Downloading content of a S3 object pReactive Kafka uProduce content to Kafka Various connector libraries!! https://github.com/akka/alpakka/tree/master/sqs https://github.com/akka/alpakka/tree/master/s3 https://github.com/akka/reactive-kafka
  20. S3 → Kafka val src: Source[ByteString, NotUsed] = S3Client().download(bucket, key)

    val decompress: Flow[ByteString, ByteString, NotUsed] = Compression.gunzip() val lineFraming: Flow[ByteString, ByteString, NotUsed] = Framing.delimiter(delimiter = ByteString("¥n"), maximumFrameLength = 65536, allowTruncation = false) val sink: Sink[ProducerMessage.Message[Array[Byte], Array[Byte], Any], Future[Done]] = Producer.plainSink(producerSettings) val blueprint: RunnableGraph[Future[String]] = src .via(decompress) .via(lineFraming) .via(Flow[ByteString] .map(_.toArray) .map { record => ProducerMessage.Message[Array[Byte], Array[Byte], Null]( new ProducerRecord[Array[Byte], Array[Byte]](conf.topic, record), null )}) .toMat(sink)(Keep.right) .mapMaterializedValue { done => done.map(_ => objectLocation) } // alpakka S3Connector // a built-in flow to decompress gzipped content // a built-in flow to divide file content into lines // ReactiveKafka Producer Sink // to return a future of completed object key when called blueprint.run() // convert binary to ProducerRecord of Kafka
  21. Overall implicit val mat: Materializer = ActorMaterializer( ActorMaterializerSettings(system).withSupervisionStrategy( ex =>

    ex match { case ex: Throwable => system.log.error(ex, "an error occurs - skip and resume") Supervision.Resume }) ) val src = SqsSource(queueUrl) val sink = SqsAckSink(queueUrl) val blueprint: RunnableGraph[Future[Done]] = src .via(Flow[Message].map(parse) .mapAsyncUnordered(concurrency) { case (msg, events) => Future.sequence( events.collect { case event: S3Created => S3KafkaGraph(event.location).run() map { completedLocation => s3.deleteObject(completedLocation.bucket, completedLocation.key) } } ) map (_ => msg -> Ack()) } .toMat(sink)(Keep.right) // alpakka SqsSource // alpakka SqsAckSink // Parse a SQS message to keys of S3 object to consume Run S3 -> Kafka graph Delete success fully produced file // Ack to a successfully handled message Workaround for duplication in SQS, with supervision Resume, app keeps going with ignoring failed message (Such messages become visible after visibility timeout but deleted after retention period)
  22. Efficiency )BOEMF5#EBZEBUBXJUIDPSFT Direct connect ① Notify Created Events ② Receive

    object locations to ingest … ③ Download ④ Produce
  23. Conclusion Easily implements stream processing with high resource efficiency and

    back pressure even if you do not familiar with Akka Actor!
  24. Conclusion Easy to connect outer resource thanks to Alpakka connector!!!

  25. pA sample code of GraphDSL (First example) pFibonacciSource pFlowStage with

    Buffer (Not in this slide) gists https://gist.github.com/Saint1991/d2737721551bc908f48b08e15f0b12d4 https://gist.github.com/Saint1991/2aa5841eea5669e8b86a5eb2df8ecb15 https://gist.github.com/Saint1991/29d097f83942d52b598cda20372ad671