Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Kafka Streaming by Lindsey Dew & Omnia Ibrahim

Shannon
November 28, 2018

Introduction to Kafka Streaming by Lindsey Dew & Omnia Ibrahim

Introduction to Kafka Streaming

The last few years we've witnessed the rise of data-driven projects, with Apache Kafka leading the charge for streaming big volumes of data. In this session Omnia and Lindsey will introduce you to Kafka Streams. They will be sharing with you Scala code examples, demonstrations of use-cases they have at Deliveroo, the problems they faced, how they solved them, and the lessons they've learned.

Shannon

November 28, 2018
Tweet

More Decks by Shannon

Other Decks in Technology

Transcript

  1. Introduction to Kafka Streaming
    Omnia Ibrahim and Lindsey Dew

    View full-size slide

  2. Who are we
    @DeliverooEng

    View full-size slide

  3. Real-time analytics
    Order Status
    Dispatcher plans
    Monolith
    Riders
    Payments
    ...

    View full-size slide

  4. Agenda
    ● Challenges Tracking Multiple Consumers
    ● Solve it with Streams API:
    ○ Stream processor topology
    ○ Store/Query
    ● Headaches
    @DeliverooEng

    View full-size slide

  5. Offset Management
    Par on 1
    Par on 2
    Con r G o 1
    Con r G o 2
    Top
    Rec (Key, Val ) Of et
    (Top , Par on 1, Con r G o 1) -> la tO se
    Con r G o 2

    View full-size slide

  6. val consumer = new Properties()
    props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"props.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-validator")
    props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
    classOf[StringDeserializer].getName)
    props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
    classOf[ByteArrayDeserializer].getName)
    val consumer = new KafkaConsumer[String, Array[Byte]](consumer) {
    subscribe(List("orders", "restaurants"))
    }
    val records = consumer.poll()

    View full-size slide

  7. @DeliverooEng
    TOPIC PARTITION CURRENT-OFFSET CONSUMER-ID CLIENT-ID
    orders 0 29384 consumer-1-cb872 consumer-1
    orders 2 29853 consumer-1-cb872 consumer-1
    orders 1 29089 consumer-1-cb872 consumer-1
    ./kafka-consumer-groups --group orders-validator --describe

    View full-size slide

  8. Constraints not enforced by kafka
    ● Client Id
    ○ Libraries have their own default
    ● Group Id
    ○ Must not be reused between
    applications

    View full-size slide

  9. Problem: Debugging Production
    How can we track what configuration
    settings are being used by our consumers?
    What applications are actually active now?

    View full-size slide

  10. @DeliverooEng

    View full-size slide

  11. __co m _of t
    g o 1, or s, 1
    112, co
    tT ,
    ...
    @DeliverooEng
    g o 1
    Con 1-x ,127.0.x.x,
    or s,1
    Con 1-x ,127.0.x.x,
    or s,2
    1. Current Offset
    2. Group MetaData

    View full-size slide

  12. Idea: Use Streams API
    __co m _of t (ti W do , Ac i Gr u )
    Dashboard
    s o qu
    Bin d a:
    Of et
    Gro a t
    ac T g o 1
    ac T g o 2 c i t2
    c i t1
    ?
    @DeliverooEng
    Our l at
    co n O f e n
    Gro a t

    View full-size slide

  13. Streams API
    ● Real-time stream processing library
    ● Consuming and producing back
    ● Supports stateless and stateful
    processing
    ● Scalable and Fault Tolerant
    ● Available for Java or Scala using
    lightbend/kafka-streams-scala
    In an 1 In an 2
    Ro h o A
    __co m _of t
    p0 p1 p2 p3
    St e
    AP
    St e
    AP

    View full-size slide

  14. Stream Processor Topology
    @DeliverooEng
    of tS am
    g o M ta Key m
    Key: Gro ad Ke ,
    Val : Gro ad Va
    la tG u t a
    k v
    k v
    Key: Gro I ,
    Val : Cli ta
    (Gro I ,
    Ac i Gr u )
    Ag eg
    of tK re
    Key: Of etK ,
    Val : Of etV e
    of tC it L t i s
    Key: Gro I ,
    Val : Con rO se
    Map/Fil
    Jo n
    Bra h
    Bra h

    View full-size slide

  15. ● Unbounded sequence of
    structured data
    ● Perform computations on
    KStream
    val builder = new StreamsBuilderS()
    val offsetStream: KStreamS[Array[Byte],
    Array[Byte]] =
    builder.stream("__consumer_offsets")

    View full-size slide

  16. KStream Methods - Branch
    val Array(offsetKeyStream, groupMetadataKeyStream) =
    offsetStream.branch(isOffset,isGroupMetadata)
    def isOffset (key:Array[Byte], value:Array[Byte]): Boolean = GroupMetadataManager
    .readMessageKey(ByteBuffer.wrap(key)) match {
    case _: OffsetKey => true
    Case _: GroupMetadataKey => false
    }
    def isGroupMetadata(key:Array[Byte], value:Array[Byte]): Boolean = !isOffset(key, value)

    View full-size slide

  17. KStream Methods - Map & Filter
    final case class ConsumerOffsetDetails(
    topic: TopicName,
    partition: Int,
    group: GroupId,
    offset: Long,
    commitTimestamp: Long
    )
    offsetKeyStream
    .map[ GroupId, ConsumerOffsetDetails](
    ( k: Array[Byte], v: Array[Byte] ) => {

    ...
    (groupId, ConsumerOffsetDetails( …. ) )
    }
    ).filter(isCommittedLastTenMins)
    def isCommittedLastTenMins(k: GroupId, v: ConsumerOffsetDetails ): Boolean = ...
    @DeliverooEng
    Side effects:
    re-partitions topic

    View full-size slide

  18. Stream Process Topology - branch 2
    @DeliverooEng
    of tS am
    g o M ta Key m
    Key: Gro ad Ke ,
    Val : Gro ad Va
    la tG u t a
    k1 v1
    k2 v2
    Key: Gro I ,
    Val : Cli ta
    (Gro I ,
    Ac i Gr u )
    Ag eg
    of tK re
    Key: Of etK ,
    Val : Of etV e
    of tC it L t i s
    Key: Gro I ,
    Val : Con rO se
    Map/Fil
    Jo n
    Bra h
    Bra h

    View full-size slide

  19. ● KTable
    ○ The present
    Streams API Concepts - KTable
    g o 1 me V2
    g o 2 me V1
    K t e K ab
    ag g e
    re
    @DeliverooEng
    la tG u t a
    g o M ta Key m
    (g o 1, me V1), (g o 2,me V1), (g o 1,me V2)

    View full-size slide

  20. KTable - Aggregate
    val latestGroupMetadata: KTableS[GroupId, ClientDetails] =
    groupMetadataKeyStream
    .map[GroupId, ClientDetails]( (k: Array[Byte], v: Array[Byte]) => { … } )
    .groupByKey( )
    .aggregate(
    ( ) => ClientDetails("", Set.empty[ConsumerInstanceDetails], "", 0),
    (k: String, v: ClientDetails, agg:ClientDetails) =>
    // aggregator
    )
    final case class ClientDetails(
    clientId: ClientId,
    members: Set[ConsumerInstanceDetails],
    group: GroupId,
    generationId: Long
    )
    ???

    View full-size slide

  21. SerDes
    ● Built-In: String, Int,
    Double, Bytes, …
    ● Customize SerDes ?
    @DeliverooEng
    object ClientDetailsSerDer extends Serializer[ClientDetails] with
    Deserializer[ClientDetails]{
    override def serialize(topic: String, data: ClientDetails):
    Array[Byte] = ...
    override def deserialize(topic: String, data: Array[Byte]):
    ClientDetails = ...
    }
    object CustomSerdes {
    val clientDetailsSerde: Serde[ClientDetails] =
    Serdes.serdeFrom(ClientDetailsSerDer,
    ClientDetailsSerDer)
    ….
    }

    View full-size slide

  22. KTable - Aggregate
    val latestGroupMetadata: KTableS[GroupId, ClientDetails] =
    groupMetadataKeyStream
    .map[GroupId, ClientDetails]( fromArrayBytesToClientDetails )
    .groupByKey( )
    .aggregate(
    ( ) => ClientDetails("", Set.empty[ConsumerInstanceDetails], "", 0),
    (k: String, v: ClientDetails, agg:ClientDetails) =>
    // aggregator
    )
    Serialized.`with`(Serdes.String(), CustomSerdes.clientDetailsSerde)
    Side effects:
    re-partitions topic

    View full-size slide

  23. (co m -g o -1, Con rO se D il (to : or s,
    pa t o :8, of t:166,... ) )
    (co m -g o -2, Con rO se D il (to : or s,
    pa t o :8, of t:166,... ) )
    KTable - Join
    co m -g o -1
    Cli ta ( c i tI : co m -1,
    me r : Set(...), …)
    K t e - of tC it L t i s
    K ab - la tG u t a
    co m -g o -2
    Cli ta ( c i tI : co m -2,
    me r : Set(...), …)
    JO
    (co m -g o -1,Ac i Gr u (
    c i tD a s: ….,
    co m O f e s: ...))
    (co m -g o -2,Ac i Gr u (
    c i tD a s: ….,
    co m O f e s: ...))
    ….
    @DeliverooEng

    View full-size slide

  24. KTable - Join
    val joined: KStreamS[GroupId, ActiveGroup] = offsetCommitsLastTenMins
    .join(
    latestGroupMetadata,
    (offsetCommit:ConsumerOffsetDetails, groupMetadata:ClientDetails) => {
    ActiveGroup(groupMetadata, offsetCommit)
    }
    )
    final case class ActiveGroup(clientDetails: ClientDetails,
    consumerOffsets: ConsumerOffsetDetails)
    implicit val joinedSerde = joinedFromKVOSerde(
    Serdes.String(), // Key Serde
    CustomSerdes.consumerOffsetDetailsSerde, // KStream Value Serde
    CustomSerdes.clientDetailsSerde // KTable Value Serde
    )

    View full-size slide

  25. Streams API: Querying
    (ti W do , Ac i Gr u )
    Dashboard
    s o qu
    ac T g o 1
    ac T g o 2
    Tim do f et
    Ac i Gr u :
    (Cli ta ,
    Of et)
    c i t2
    c i t1
    @DeliverooEng
    (Gro I ,
    Ac i Gr u )
    Jo n
    wi w

    View full-size slide

  26. Windowing
    ● Group events by event time
    Gro 1
    Gro 2
    Tim I r a 1 Tim I r a 2 Tim I r a 3
    Gro 1
    Gro 2 Gro 2
    Gro 1

    View full-size slide

  27. Windowing
    val windowedActiveGroup:
    TimeWindowedKStreamS[GroupId, ActiveGroup] = joined
    .groupByKey(
    Serialized
    .`with`( Serdes.String(), CustomSerdes.activeGroup )
    ).windowedBy(TimeWindows.of(60000))
    @DeliverooEng

    View full-size slide

  28. State Store
    ● Store and Query data
    ● Ephemeral view of the data
    ○ Changelog for fault
    tolerance
    In an 1 In an 2
    Ap
    St e
    AP
    St e
    AP
    Loc t e t
    fo s ce 1
    Loc t e t
    fo s ce 2
    @DeliverooEng
    ho 1:9090 ho 2:9000
    R la

    View full-size slide

  29. WindowStore - Store
    windowedActiveGroup
    .reduce(
    (a1: ActiveGroup, a2: ActiveGroup) => a2,
    Materialized
    .as[ GroupId, ActiveGroup, WindowStore[ Bytes, Array[Byte]] ](
    "active-group"
    ).withKeySerde(Serdes.String())
    .withValueSerde(CustomSerdes.activeGroup))

    View full-size slide

  30. WindowStore - Query
    val activeGroupStore = streams.store(
    "active-group",
    QueryableStoreTypes.windowStore[GroupId, ActiveGroup]()
    )
    // Get all data from store
    activeGroupStore.all().asScala.toList
    // Get data at window
    activeGroupStore.fetchAll(1542621679037,1542621979037)
    //Example record
    KeyValue([groupId@1543151820000/1543151880000], ActiveGroup))

    View full-size slide

  31. Under the hood
    @DeliverooEng
    val builder = new StreamsBuilderS()
    builder.build().describe()
    // Topologies:
    // Sub-topology: 0
    // Source: KSTREAM-SOURCE-0000000000 (topics:
    [__consumer_offsets])
    // --> KSTREAM-BRANCH-0000000001
    // Processor: KSTREAM-BRANCH-0000000001 (stores: [])
    // --> KSTREAM-BRANCHCHILD-0000000003,
    KSTREAM-BRANCHCHILD-0000000002
    // <-- KSTREAM-SOURCE-0000000000
    // ……
    // …...
    // Sub-topology: 1
    // …...
    Topologies

    View full-size slide

  32. zz85/kafka-streams-viz
    @DeliverooEng
    Under the hood
    Side effect on
    the cluster ?
    Side effect on
    the cluster ?
    INPUT
    StateStore
    Topic
    Repartition Topic
    Task

    View full-size slide

  33. zz85/kafka-streams-viz
    @DeliverooEng
    Under the hood
    Side effect on
    app instances
    Side effect on
    cluster
    Side effect on
    cluster
    OUTPUT
    Changelog Topic

    View full-size slide

  34. HOW MANY SIDE-EFFECT TOPICS?
    @DeliverooEng
    4 2 repartition
    2 changelog

    View full-size slide

  35. Headaches can managed
    How to manage
    internal topics?
    ● Retention /
    Cleanup Config
    @DeliverooEng
    val streamProperties = new Properties()
    ….
    streamProperties.put( "topic.retention.ms" , "3600000" )
    streamProperties.put( "topic.cleanup.policy", "delete" )
    ● mapValues /
    flatMapValues

    View full-size slide

  36. ag -s o
    Headaches Can’t be managed
    Internal topics configuration
    __co m _of t 50 pa t o
    K RE -FI R-x -re t i K RE -AG G E-x -re t i
    50 pa t o
    50 pa t o
    K RE -AG G E-x -c a g
    50 pa t o
    ac -g o s
    -s o
    ac -g o s-s o -c a g
    50 pa t o
    ● Overload Kafka cluster
    ● Administration
    ○ No garbage
    collection for topics

    View full-size slide

  37. Sources
    ● deliveroo/roowhoo-dashborad
    ● Looking at Kafka's consumers' offsets
    ● Why Kafka Streams didn't work for us? by Alexis
    Seigneurin
    ● kafka-streams-viz Tools

    View full-size slide

  38. Conclusion
    ● Streams API
    ○ Powerful abstractions with lots of extensibility
    opportunities
    ○ Important to understand what gets created
    Use with Caution

    View full-size slide