Slide 1

Slide 1 text

REAL-TIME DATA TRANSFORMATION @KeithResar Kafka Developer confluent.io * BY EXAMPLE

Slide 2

Slide 2 text

TABLE OF CONTENTS THE INTRO Level set, background, what is stream processing THE EXAMPLE Define a common use case that we apply to each tool 01 03 THE EVAL Dimensions used for evaluating tools THE TOOLS Solving for ksqlDB, Kafka Streams, Apache Flink, and Apache Spark 02 04

Slide 3

Slide 3 text

THE INTRO

Slide 4

Slide 4 text

2010 Apache Kafka created at LinkedIn 2023 Most fortune 100 companies trust and use Kafka THE RISE OF EVENT STREAMING

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

1 2 3 4 5 6 8 9 7 Partition 1 Old New 1 2 3 4 5 6 8 7 Partition 0 10 9 11 12 Partition 2 1 2 3 4 5 6 8 7 10 9 11 12 Writes 1 2 3 4 5 6 8 7 10 9 11 12 Producers Writes Consumer A (offset=4) Consumer B (offset=7) Reads ANATOMY OF A KAFKA TOPIC

Slide 7

Slide 7 text

EVENTS HAPPENED { 'timestamp': 1684773207, 'sensor_id': 'sh291', 'rpm': 3846, 'vibration': 348, 'temperature': 81 } ● Past tense ● They become business data ● Unbounded

Slide 8

Slide 8 text

This is a robot named Robot. DATA PROCESSING PERSPECTIVES

Slide 9

Slide 9 text

These are Robot’s robot eggs. (as you know, robots do not give live birth) DATA PROCESSING PERSPECTIVES * fun fact - Robot needs to robot Robot’s robot eggs to activate baby robots (known as roblets) * expected fact - when using a word and repeating it as proper noun, noun, adjective, and adverb the word itself starts to look really weird.

Slide 10

Slide 10 text

Robot works 24 hours a day. (Look at all those eggs!) DATA PROCESSING PERSPECTIVES * Robot went a bit “robot” last year, so we don’t fully trust his work. We have a federal decree to measure the egg output and alert on issues.

Slide 11

Slide 11 text

Divide output by day DATA PROCESSING PERSPECTIVES

Slide 12

Slide 12 text

Read daily eggs Run analysis Report results DATA PROCESSING PERSPECTIVES 10 eggs 20 eggs 18 eggs This is batch processing AKA ETL Day 0 Day 1 Day 2

Slide 13

Slide 13 text

Read each egg output Run windowed analysis Report windowed results DATA PROCESSING PERSPECTIVES This stream processing AKA Consume / Transform / Produce Day 0 Day 1 Day 2

Slide 14

Slide 14 text

DATA PROCESSING PERSPECTIVES Introducing our Robot roblet* analysis transformer * All robot eggs (Robot’s included) are commonly referred to as “roblets”

Slide 15

Slide 15 text

COMMON PATTERNS, ILLUSTRATED

Slide 16

Slide 16 text

COMMON PATTERNS 01 FILTERING Stream Event Filter Filtered Stream

Slide 17

Slide 17 text

COMMON PATTERNS 02 GROUPING/ AGGREGATING Stream Event Grouper Grouped Stream

Slide 18

Slide 18 text

COMMON PATTERNS 03 JOINING Table Key-based data Stream Unbound data Event Joiner Enriched Stream

Slide 19

Slide 19 text

COMMON PATTERNS 03 JOINING Stream Unbound data Stream Unbound data Event Joiner Enriched Stream

Slide 20

Slide 20 text

THE EVAL

Slide 21

Slide 21 text

STREAM PROCESSING TOOLS ksqlDB 01 Kafka Streams 02 Apache Flink 03 Apache Spark 04

Slide 22

Slide 22 text

EVAL DIMENSIONS API OPS 01 03 INTEGRATIONS ET AL 02 04

Slide 23

Slide 23 text

THE EXAMPLE

Slide 24

Slide 24 text

You remember Robot, he’s our robot. SCENARIO AND DATA

Slide 25

Slide 25 text

SCENARIO AND DATA and our roblet eggs

Slide 26

Slide 26 text

DATA SCHEMA // roblet order data serial_num ship_location name // raw roblet sensor data timestamp serial_num weight color temperature rads

Slide 27

Slide 27 text

CALCULATIONS // Instant Safety // (stateless filter) rads > 1 // Productivity // (window / grouping) avg(units) per hour // Ship Logistics // (join / enrich) color by location

Slide 28

Slide 28 text

THE TOOLS

Slide 29

Slide 29 text

ABOUT KSQLDB ksqlDB Kafka Streams Kafka Clients Dynamic tables Streams, windows Events High-level API Stream data processing Consume / produce Kafka Broker

Slide 30

Slide 30 text

ABOUT KSQLDB ksqlDB is a real-time, event streaming database purpose-built for stream processing applications built on top of KStreams. It simplifies stream processing development by exposing the power of KStreams with the approachable feel of a traditional database. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model SQL Kafka (built with Kafka Streams) Managed (single process runs your queries)

Slide 31

Slide 31 text

CODING KSQLDB -- Define structures to access our data CREATE TABLE orders ( serial_num VARCHAR PRIMARY KEY, ship_location VARCHAR, name VARCHAR ) WITH (KAFKA_TOPIC='orders'); CREATE STREAM sensors ( timestamp TIMESTAMP, serial_num VARCHAR, weight DOUBLE, color VARCHAR, temperature DOUBLE, rads DOUBLE ) WITH (KAFKA_TOPIC='sensors');

Slide 32

Slide 32 text

CODING KSQLDB -- Instant safety (stateless filter) SELECT * FROM sensors WHERE rads > 1 EMIT CHANGES;

Slide 33

Slide 33 text

CODING KSQLDB -- Productivity (avg units/min) SELECT COUNT(*) AS count FROM sensors WINDOW HOPPING (SIZE 1 HOUR, ADVANCE BY 1 MINUTE) EMIT CHANGES;

Slide 34

Slide 34 text

CODING KSQLDB -- Ship logistics (join / enrich) SELECT o.serial_num, s.color FROM orders o INNER JOIN sensors s ON o.serial_num = s.serial_num EMIT CHANGES;

Slide 35

Slide 35 text

ABOUT KSTREAMS ksqlDB Kafka Streams Kafka Clients Dynamic tables Streams, windows Events High-level API Stream and batch data processing Consume / produce Kafka Broker

Slide 36

Slide 36 text

ABOUT KSTREAMS Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model Java Kafka (library as part of Apache Kafka project) Self-hosted (bring your own scheduler)

Slide 37

Slide 37 text

CODING KSTREAMS // Instant safety (stateless filter) builder = new StreamsBuilder(); source = builder.stream("sensors"); filteredStream = source.filter((key, value) -> { sensorEvent = new JsonParser().parse(value).getAsJsonObject(); return(sensorEvent.get("rads").getAsDouble()>1); });

Slide 38

Slide 38 text

CODING KSTREAMS // Productivity (avg units/min) builder = new StreamsBuilder(); source = builder.stream("sensors"); countTable = source.groupBy((key, value) -> key, Grouped.with(Serdes.String(), Serdes.String())) .windowedBy(TimeWindows.of(Duration.ofHours(1)) .advanceBy(Duration.ofMinutes(1))) .count();

Slide 39

Slide 39 text

ABOUT FLINK Apache Flink is a scalable, and high-performance stream processing framework that supports both batch and real-time data processing tasks, ensuring fault-tolerance, state management, and event-time processing. Flink supports Kafka and other integrations. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model SQL, Java, Python Many Connectors (Kafka, file, object storage, JDBC, HBase, Kinesis, Mongo) Multi-Process Scheduler

Slide 40

Slide 40 text

ABOUT FLINK “ ABOUT FLINK SQL / Table API DataStream API ProcessFunction Dynamic tables Streams, windows Events, state, time High-level analytics API Stream and batch data processing State event driven applications

Slide 41

Slide 41 text

ABOUT FLINK “ ABOUT FLINK

Slide 42

Slide 42 text

CODING FLINK -- Define Schema CREATE TABLE sensors ( ts TIMESTAMP(3), serial_num STRING, weight DOUBLE, color STRING, temperature DOUBLE, rads DOUBLE, WATERMARK FOR ts AS timestamp - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'topic' = 'sensors', 'properties.bootstrap.servers' = 'localhost:9092', 'scan.startup.mode' = 'latest-offset', 'format' = 'json' );

Slide 43

Slide 43 text

CODING FLINK -- Instant safety (stateless filter) SELECT * FROM sensors WHERE rads > 1

Slide 44

Slide 44 text

CODING FLINK // Productivity (avg units/min) sensorDataStream .map(new MapFunction() { public map(String value) throws Exception { objectMapper = new ObjectMapper(); sensor = objectMapper.readValue(); return new Tuple2<>(sensor.serial_num, 1); } }) .keyBy(0) .window(SlidingProcessingTimeWindows .of(Time.hours(1), Time.minutes(1))) .sum(1);

Slide 45

Slide 45 text

CODING FLINK -- Ship logistics (join / enrich) SELECT o.serial_num, s.color FROM orders AS o JOIN sensors AS s ON o.serial_num = s.serial_num;

Slide 46

Slide 46 text

ABOUT SPARK Apache Spark is an distributed computing system used for data processing and analytics. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model SQL, Java, Python, R Limited for Streaming, many for batch Multi-Process Scheduler

Slide 47

Slide 47 text

ABOUT FLINK “ ABOUT SPARK

Slide 48

Slide 48 text

CODING SPARK ## Instant safety (stateless filter) sensors_df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "") \ .option("subscribe", "sensors") \ .load() # Filter the data where rads > 1 sensors_filtered_df = sensors_df.filter(col("rads") > 1)

Slide 49

Slide 49 text

CODING SPARK # Productivity (avg units/min) windowed_counts = sensors_df \ .withWatermark("timestamp", "1 hour") \ .groupBy(window(timestamp, "1 hour", "1 minute")) \ .agg(countDistinct("serial_num").alias("serial_count"))

Slide 50

Slide 50 text

CODING SPARK # Ship logistics (join / enrich) # Define consumers for orders and sensors # Read and deserialize data from each joined_df = sensors_df.join(orders_df, "serial_num")

Slide 51

Slide 51 text

SUMMARY AT A GLANCE Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Apache Spark Apache Flink Kafka Streams ksqlDB Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Low Touch Hi Toil Simple Infra Extensive Expressive Opinionated Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure

Slide 52

Slide 52 text

RESOURCES

Slide 53

Slide 53 text

Where to go from here? 3 Resources

Slide 54

Slide 54 text

Everything Kafka docs, animations, and video on demand developer.confluent.io

Slide 55

Slide 55 text

O’Reilly Books free e-book bundle cnfl.io/book-bundle

Slide 56

Slide 56 text

Confluent Cloud Free access with new accounts

Slide 57

Slide 57 text

THE END Thanks! @keithresar Keith Resar