Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Kafka Streaming by Lindsey Dew ...

Shannon
November 28, 2018

Introduction to Kafka Streaming by Lindsey Dew & Omnia Ibrahim

Introduction to Kafka Streaming

The last few years we've witnessed the rise of data-driven projects, with Apache Kafka leading the charge for streaming big volumes of data. In this session Omnia and Lindsey will introduce you to Kafka Streams. They will be sharing with you Scala code examples, demonstrations of use-cases they have at Deliveroo, the problems they faced, how they solved them, and the lessons they've learned.

Shannon

November 28, 2018
Tweet

More Decks by Shannon

Other Decks in Technology

Transcript

  1. Agenda • Challenges Tracking Multiple Consumers • Solve it with

    Streams API: ◦ Stream processor topology ◦ Store/Query • Headaches @DeliverooEng
  2. Offset Management Par on 1 Par on 2 Con r

    G o 1 Con r G o 2 Top Rec (Key, Val ) Of et (Top , Par on 1, Con r G o 1) -> la tO se Con r G o 2
  3. val consumer = new Properties() props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"<kafkaip") props.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-validator") props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getName)

    props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, classOf[ByteArrayDeserializer].getName) val consumer = new KafkaConsumer[String, Array[Byte]](consumer) { subscribe(List("orders", "restaurants")) } val records = consumer.poll()
  4. @DeliverooEng TOPIC PARTITION CURRENT-OFFSET CONSUMER-ID CLIENT-ID orders 0 29384 consumer-1-cb872

    consumer-1 orders 2 29853 consumer-1-cb872 consumer-1 orders 1 29089 consumer-1-cb872 consumer-1 ./kafka-consumer-groups --group orders-validator --describe
  5. Constraints not enforced by kafka • Client Id ◦ Libraries

    have their own default • Group Id ◦ Must not be reused between applications
  6. Problem: Debugging Production How can we track what configuration settings

    are being used by our consumers? What applications are actually active now?
  7. __co m _of t g o 1, or s, 1

    112, co tT , ... @DeliverooEng g o 1 Con 1-x ,127.0.x.x, or s,1 Con 1-x ,127.0.x.x, or s,2 1. Current Offset 2. Group MetaData
  8. Idea: Use Streams API __co m _of t (ti W

    do , Ac i Gr u ) Dashboard s o qu Bin d a: Of et Gro a t ac T g o 1 ac T g o 2 c i t2 c i t1 ? @DeliverooEng Our l at co n O f e n Gro a t
  9. Streams API • Real-time stream processing library • Consuming and

    producing back • Supports stateless and stateful processing • Scalable and Fault Tolerant • Available for Java or Scala using lightbend/kafka-streams-scala In an 1 In an 2 Ro h o A __co m _of t p0 p1 p2 p3 St e AP St e AP
  10. Stream Processor Topology @DeliverooEng of tS am g o M

    ta Key m Key: Gro ad Ke , Val : Gro ad Va la tG u t a k v k v Key: Gro I , Val : Cli ta (Gro I , Ac i Gr u ) Ag eg of tK re Key: Of etK , Val : Of etV e of tC it L t i s Key: Gro I , Val : Con rO se Map/Fil Jo n Bra h Bra h
  11. • Unbounded sequence of structured data • Perform computations on

    KStream val builder = new StreamsBuilderS() val offsetStream: KStreamS[Array[Byte], Array[Byte]] = builder.stream("__consumer_offsets")
  12. KStream Methods - Branch val Array(offsetKeyStream, groupMetadataKeyStream) = offsetStream.branch(isOffset,isGroupMetadata) def

    isOffset (key:Array[Byte], value:Array[Byte]): Boolean = GroupMetadataManager .readMessageKey(ByteBuffer.wrap(key)) match { case _: OffsetKey => true Case _: GroupMetadataKey => false } def isGroupMetadata(key:Array[Byte], value:Array[Byte]): Boolean = !isOffset(key, value)
  13. KStream Methods - Map & Filter final case class ConsumerOffsetDetails(

    topic: TopicName, partition: Int, group: GroupId, offset: Long, commitTimestamp: Long ) offsetKeyStream .map[ GroupId, ConsumerOffsetDetails]( ( k: Array[Byte], v: Array[Byte] ) => { … ... (groupId, ConsumerOffsetDetails( …. ) ) } ).filter(isCommittedLastTenMins) def isCommittedLastTenMins(k: GroupId, v: ConsumerOffsetDetails ): Boolean = ... @DeliverooEng Side effects: re-partitions topic
  14. Stream Process Topology - branch 2 @DeliverooEng of tS am

    g o M ta Key m Key: Gro ad Ke , Val : Gro ad Va la tG u t a k1 v1 k2 v2 Key: Gro I , Val : Cli ta (Gro I , Ac i Gr u ) Ag eg of tK re Key: Of etK , Val : Of etV e of tC it L t i s Key: Gro I , Val : Con rO se Map/Fil Jo n Bra h Bra h
  15. • KTable ◦ The present Streams API Concepts - KTable

    g o 1 me V2 g o 2 me V1 K t e K ab ag g e re @DeliverooEng la tG u t a g o M ta Key m (g o 1, me V1), (g o 2,me V1), (g o 1,me V2)
  16. KTable - Aggregate val latestGroupMetadata: KTableS[GroupId, ClientDetails] = groupMetadataKeyStream .map[GroupId,

    ClientDetails]( (k: Array[Byte], v: Array[Byte]) => { … } ) .groupByKey( ) .aggregate( ( ) => ClientDetails("", Set.empty[ConsumerInstanceDetails], "", 0), (k: String, v: ClientDetails, agg:ClientDetails) => // aggregator ) final case class ClientDetails( clientId: ClientId, members: Set[ConsumerInstanceDetails], group: GroupId, generationId: Long ) ???
  17. SerDes • Built-In: String, Int, Double, Bytes, … • Customize

    SerDes ? @DeliverooEng object ClientDetailsSerDer extends Serializer[ClientDetails] with Deserializer[ClientDetails]{ override def serialize(topic: String, data: ClientDetails): Array[Byte] = ... override def deserialize(topic: String, data: Array[Byte]): ClientDetails = ... } object CustomSerdes { val clientDetailsSerde: Serde[ClientDetails] = Serdes.serdeFrom(ClientDetailsSerDer, ClientDetailsSerDer) …. }
  18. KTable - Aggregate val latestGroupMetadata: KTableS[GroupId, ClientDetails] = groupMetadataKeyStream .map[GroupId,

    ClientDetails]( fromArrayBytesToClientDetails ) .groupByKey( ) .aggregate( ( ) => ClientDetails("", Set.empty[ConsumerInstanceDetails], "", 0), (k: String, v: ClientDetails, agg:ClientDetails) => // aggregator ) Serialized.`with`(Serdes.String(), CustomSerdes.clientDetailsSerde) Side effects: re-partitions topic
  19. (co m -g o -1, Con rO se D il

    (to : or s, pa t o :8, of t:166,... ) ) (co m -g o -2, Con rO se D il (to : or s, pa t o :8, of t:166,... ) ) KTable - Join co m -g o -1 Cli ta ( c i tI : co m -1, me r : Set(...), …) K t e - of tC it L t i s K ab - la tG u t a co m -g o -2 Cli ta ( c i tI : co m -2, me r : Set(...), …) JO (co m -g o -1,Ac i Gr u ( c i tD a s: …., co m O f e s: ...)) (co m -g o -2,Ac i Gr u ( c i tD a s: …., co m O f e s: ...)) …. @DeliverooEng
  20. KTable - Join val joined: KStreamS[GroupId, ActiveGroup] = offsetCommitsLastTenMins .join(

    latestGroupMetadata, (offsetCommit:ConsumerOffsetDetails, groupMetadata:ClientDetails) => { ActiveGroup(groupMetadata, offsetCommit) } ) final case class ActiveGroup(clientDetails: ClientDetails, consumerOffsets: ConsumerOffsetDetails) implicit val joinedSerde = joinedFromKVOSerde( Serdes.String(), // Key Serde CustomSerdes.consumerOffsetDetailsSerde, // KStream Value Serde CustomSerdes.clientDetailsSerde // KTable Value Serde )
  21. Streams API: Querying (ti W do , Ac i Gr

    u ) Dashboard s o qu ac T g o 1 ac T g o 2 Tim do f et Ac i Gr u : (Cli ta , Of et) c i t2 c i t1 @DeliverooEng (Gro I , Ac i Gr u ) Jo n wi w
  22. Windowing • Group events by event time Gro 1 Gro

    2 Tim I r a 1 Tim I r a 2 Tim I r a 3 Gro 1 Gro 2 Gro 2 Gro 1
  23. Windowing val windowedActiveGroup: TimeWindowedKStreamS[GroupId, ActiveGroup] = joined .groupByKey( Serialized .`with`(

    Serdes.String(), CustomSerdes.activeGroup ) ).windowedBy(TimeWindows.of(60000)) @DeliverooEng
  24. State Store • Store and Query data • Ephemeral view

    of the data ◦ Changelog for fault tolerance In an 1 In an 2 Ap St e AP St e AP Loc t e t fo s ce 1 Loc t e t fo s ce 2 @DeliverooEng ho 1:9090 ho 2:9000 R la
  25. WindowStore - Store windowedActiveGroup .reduce( (a1: ActiveGroup, a2: ActiveGroup) =>

    a2, Materialized .as[ GroupId, ActiveGroup, WindowStore[ Bytes, Array[Byte]] ]( "active-group" ).withKeySerde(Serdes.String()) .withValueSerde(CustomSerdes.activeGroup))
  26. WindowStore - Query val activeGroupStore = streams.store( "active-group", QueryableStoreTypes.windowStore[GroupId, ActiveGroup]()

    ) // Get all data from store activeGroupStore.all().asScala.toList // Get data at window activeGroupStore.fetchAll(1542621679037,1542621979037) //Example record KeyValue([groupId@1543151820000/1543151880000], ActiveGroup))
  27. Under the hood @DeliverooEng val builder = new StreamsBuilderS() builder.build().describe()

    // Topologies: // Sub-topology: 0 // Source: KSTREAM-SOURCE-0000000000 (topics: [__consumer_offsets]) // --> KSTREAM-BRANCH-0000000001 // Processor: KSTREAM-BRANCH-0000000001 (stores: []) // --> KSTREAM-BRANCHCHILD-0000000003, KSTREAM-BRANCHCHILD-0000000002 // <-- KSTREAM-SOURCE-0000000000 // …… // …... // Sub-topology: 1 // …... Topologies
  28. zz85/kafka-streams-viz @DeliverooEng Under the hood Side effect on the cluster

    ? Side effect on the cluster ? INPUT StateStore Topic Repartition Topic Task
  29. zz85/kafka-streams-viz @DeliverooEng Under the hood Side effect on app instances

    Side effect on cluster Side effect on cluster OUTPUT Changelog Topic
  30. Headaches can managed How to manage internal topics? • Retention

    / Cleanup Config @DeliverooEng val streamProperties = new Properties() …. streamProperties.put( "topic.retention.ms" , "3600000" ) streamProperties.put( "topic.cleanup.policy", "delete" ) • mapValues / flatMapValues
  31. ag -s o Headaches Can’t be managed Internal topics configuration

    __co m _of t 50 pa t o K RE -FI R-x -re t i K RE -AG G E-x -re t i 50 pa t o 50 pa t o K RE -AG G E-x -c a g 50 pa t o ac -g o s -s o ac -g o s-s o -c a g 50 pa t o • Overload Kafka cluster • Administration ◦ No garbage collection for topics
  32. Sources • deliveroo/roowhoo-dashborad • Looking at Kafka's consumers' offsets •

    Why Kafka Streams didn't work for us? by Alexis Seigneurin • kafka-streams-viz Tools
  33. Conclusion • Streams API ◦ Powerful abstractions with lots of

    extensibility opportunities ◦ Important to understand what gets created Use with Caution