Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stream processing with ksqlDB and Apache Kafka

Stream processing with ksqlDB and Apache Kafka

Kafka delivers real-time events at scale, and with libraries like KStreams, Java developers are able to transform those events. In this talk we introduce ksqlDB, which offers a SQL interface on top of KStreams to enable continuous, interactive queries without requiring any Java or Python knowledge. ksqlDB enables all Apache Kafka consumers to route messages and perform both stateful and stateless transformations to unlock new data insights. With ksqlDB, your data in motion is as accessible as the stale records traditionally locked away in a relational database.

In this session, after a brief introduction to Apache Kafka, we'll dive into using ksqlDB to manage data streams pulled - in real time - from the Minneapolis air traffic control system. During this journey, you'll learn the ins and outs behind how ksqlDB works and introduce patterns applicable more broadly in common high-volume use cases like log monitoring, insurance, financial services, and consumer retail.

Keith Resar

May 20, 2021
Tweet

More Decks by Keith Resar

Other Decks in Programming

Transcript

  1. 1 2 3 Data Integration (a primer!) Kafka + data

    transformation (another primer) ksqlDB - SQL interface to transform data in motion
  2. A B Data Source Data Target Data Integration Relational Database

    NoSQL, HBase Application Logs Custom Data
  3. A B Data Source 1 Data Target 2 Dead Letters

    3 Direct Data Integration MySQL Salesforce
  4. Stateless transformations are easy but they limit your long-term capability.

    Stateful transformations add dependencies and a lot of complicated scaffolding to build and maintain. You will do these wrong. Consider how to recover from failure or restart your integration following a new version. Custom Data Integration FAIL A B Ephemeral isn’t useful, stateful is hard. 1 of 3
  5. The first one sucks, and every one after that is

    even worse. Tight coupling slows development velocity and adds operational risk. ➤ A → B, sure that’s doable. What about A → C? A → D? (when is it too much?) ➤ How do you manage different encoding, transformations, schemas? Custom Data Integration FAIL A B Point to point scales like bags of rocks. 2 of 3
  6. Frequent advice recommends avoiding premature scaling. Where will you invest

    development, testing, and ops time? ➤ Build for scale from day one or rebuild? ➤ What drives scale - traffic volume or longer (synchronous) processing requirements? ➤ Support scale out, coordination / clustering, work delegation? Custom Data Integration FAIL A B 3 of 3 Design for now but fail with scale.
  7. Instantly Connect Popular Data Sources & Sinks Data Diode 120+

    pre-built connectors 90+ Confluent Developed & Supported 30+ Partner Supported, Confluent Verified
  8. Kafka Data Transformation A B Single Message Transforms in Connect.

    Offer basic stateless routing, key changes, column changes.
  9. Kafka Data Transformation A B Single Message Transforms in Connect.

    Offer basic stateless routing, key changes, column changes. z // Example Single Message Transform (SMT) // { "name": "exampleSMTRouter", "config": { …… "transforms": "routeRecords", "transforms.routeRecords.type": "org.apache.kafka.connect.transforms.RegexRouter", "transforms.routeRecords.regex": "(.*)", "transforms.routeRecords.replacement": "$1-test" …… } }
  10. Kafka Data Transformation A B KStreams App Advanced message transforms

    in Java Source Topic Destination Topic z // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String]("streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to("streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config) // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String](" streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to("streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config) // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String]("streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to("streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config) // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String]("streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to("streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config) // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String]("streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to("streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config) // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String]("streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to("streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config) // Simple word count app using KStreams library // val builder = new StreamsBuilder() val textLines: KStream[String, String] = builder.stream[String, String]("streams-plaintext-input") val wordCounts: KTable[String, Long] = textLines.flatMapValues( textLine => textLine.toLowerCase.split("\\W+")) .groupBy((_, word) => word) .count() wordCounts.toStream.to(" streams-wordcount-output") val streams: KafkaStreams = new KafkaStreams(builder.build(), config)
  11. Demo Data Schema flights airlines positions code name KEY flight

    code flight_num landing_time takeoff_time KEY flight timestamp_ms altitude lat lon KEY
  12. flights airlines positions code name KEY Static Data, mapping airline

    codes with airline names. Used to dereference encoded flight names with something more human readable. code name DAL Delta Air Lines SWA Southwest Airlines Demo Data Schema
  13. flights airlines positions Streaming Data, updated as each new flight

    plan is registered and when the flight status changes. flight code flight_num takeoff_time landing_time DAL1232 DAL 1232 1620337680000 null SWA345 SWA 345 1620335280000 1620338340000 flight code flight_num landing_time takeoff_time KEY Demo Data Schema
  14. flights airlines positions Streaming Data, updated with each new position

    report throughout the flight. flight timestamp_ms altitude lat lon SKW3984 1620314506000 35000 45.701 -104.29 Demo Data Schema flight timestamp_ms altitude lat lon KEY
  15. window n Windowed Aggregation time Tumbling window n window n+1

    window n+2 Hopping window n+1 window n+2 Session window n window n+1 Δt>inactivity gap
  16. Integration Getting data from A → B Custom integrations are

    evil Kafka + Transformation Kafka loosely coupled integration Transformation via Connect, SMT, or KStreams ksqlDB SQL interface to streaming data Approachable and viable for production scale
  17. Where to go from here? 5 Resources Look in the

    chat for links to each (including swag!)