Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-Time Data Transformation by Example

Real-Time Data Transformation by Example

Learn what tooling works best for your next data transformation project. Compare the most common real-time engines - ksqlDB, Spark, Flink, and Kafka Streams. See them side-by-side, each chewing through a common data set with working code examples. In this talk we demonstrate the unique approach each of these powerful tools takes to solving data filtering, enrichment, and windowed aggregation.

Keith Resar

May 24, 2023
Tweet

More Decks by Keith Resar

Other Decks in Technology

Transcript

  1. TABLE OF CONTENTS THE INTRO Level set, background, what is

    stream processing THE EXAMPLE Define a common use case that we apply to each tool 01 03 THE EVAL Dimensions used for evaluating tools THE TOOLS Solving for ksqlDB, Kafka Streams, Apache Flink, and Apache Spark 02 04
  2. 2010 Apache Kafka created at LinkedIn 2023 Most fortune 100

    companies trust and use Kafka THE RISE OF EVENT STREAMING
  3. 1 2 3 4 5 6 8 9 7 Partition

    1 Old New 1 2 3 4 5 6 8 7 Partition 0 10 9 11 12 Partition 2 1 2 3 4 5 6 8 7 10 9 11 12 Writes 1 2 3 4 5 6 8 7 10 9 11 12 Producers Writes Consumer A (offset=4) Consumer B (offset=7) Reads ANATOMY OF A KAFKA TOPIC
  4. EVENTS HAPPENED { 'timestamp': 1684773207, 'sensor_id': 'sh291', 'rpm': 3846, 'vibration':

    348, 'temperature': 81 } • Past tense • They become business data • Unbounded
  5. These are Robot’s robot eggs. (as you know, robots do

    not give live birth) DATA PROCESSING PERSPECTIVES * fun fact - Robot needs to robot Robot’s robot eggs to activate baby robots (known as roblets) * expected fact - when using a word and repeating it as proper noun, noun, adjective, and adverb the word itself starts to look really weird.
  6. Robot works 24 hours a day. (Look at all those

    eggs!) DATA PROCESSING PERSPECTIVES * Robot went a bit “robot” last year, so we don’t fully trust his work. We have a federal decree to measure the egg output and alert on issues.
  7. Read daily eggs Run analysis Report results DATA PROCESSING PERSPECTIVES

    10 eggs 20 eggs 18 eggs This is batch processing AKA ETL Day 0 Day 1 Day 2
  8. Read each egg output Run windowed analysis Report windowed results

    DATA PROCESSING PERSPECTIVES This stream processing AKA Consume / Transform / Produce Day 0 Day 1 Day 2
  9. DATA PROCESSING PERSPECTIVES Introducing our Robot roblet* analysis transformer *

    All robot eggs (Robot’s included) are commonly referred to as “roblets”
  10. DATA SCHEMA // roblet order data serial_num ship_location name //

    raw roblet sensor data timestamp serial_num weight color temperature rads
  11. CALCULATIONS // Instant Safety // (stateless filter) rads > 1

    // Productivity // (window / grouping) avg(units) per hour // Ship Logistics // (join / enrich) color by location
  12. ABOUT KSQLDB ksqlDB Kafka Streams Kafka Clients Dynamic tables Streams,

    windows Events High-level API Stream data processing Consume / produce Kafka Broker
  13. ABOUT KSQLDB ksqlDB is a real-time, event streaming database purpose-built

    for stream processing applications built on top of KStreams. It simplifies stream processing development by exposing the power of KStreams with the approachable feel of a traditional database. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model SQL Kafka (built with Kafka Streams) Managed (single process runs your queries)
  14. CODING KSQLDB -- Define structures to access our data CREATE

    TABLE orders ( serial_num VARCHAR PRIMARY KEY, ship_location VARCHAR, name VARCHAR ) WITH (KAFKA_TOPIC='orders'); CREATE STREAM sensors ( timestamp TIMESTAMP, serial_num VARCHAR, weight DOUBLE, color VARCHAR, temperature DOUBLE, rads DOUBLE ) WITH (KAFKA_TOPIC='sensors');
  15. CODING KSQLDB -- Productivity (avg units/min) SELECT COUNT(*) AS count

    FROM sensors WINDOW HOPPING (SIZE 1 HOUR, ADVANCE BY 1 MINUTE) EMIT CHANGES;
  16. CODING KSQLDB -- Ship logistics (join / enrich) SELECT o.serial_num,

    s.color FROM orders o INNER JOIN sensors s ON o.serial_num = s.serial_num EMIT CHANGES;
  17. ABOUT KSTREAMS ksqlDB Kafka Streams Kafka Clients Dynamic tables Streams,

    windows Events High-level API Stream and batch data processing Consume / produce Kafka Broker
  18. ABOUT KSTREAMS Kafka Streams is a client library for building

    applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model Java Kafka (library as part of Apache Kafka project) Self-hosted (bring your own scheduler)
  19. CODING KSTREAMS // Instant safety (stateless filter) builder = new

    StreamsBuilder(); source = builder.stream("sensors"); filteredStream = source.filter((key, value) -> { sensorEvent = new JsonParser().parse(value).getAsJsonObject(); return(sensorEvent.get("rads").getAsDouble()>1); });
  20. CODING KSTREAMS // Productivity (avg units/min) builder = new StreamsBuilder();

    source = builder.stream("sensors"); countTable = source.groupBy((key, value) -> key, Grouped.with(Serdes.String(), Serdes.String())) .windowedBy(TimeWindows.of(Duration.ofHours(1)) .advanceBy(Duration.ofMinutes(1))) .count();
  21. ABOUT FLINK Apache Flink is a scalable, and high-performance stream

    processing framework that supports both batch and real-time data processing tasks, ensuring fault-tolerance, state management, and event-time processing. Flink supports Kafka and other integrations. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model SQL, Java, Python Many Connectors (Kafka, file, object storage, JDBC, HBase, Kinesis, Mongo) Multi-Process Scheduler
  22. ABOUT FLINK “ ABOUT FLINK SQL / Table API DataStream

    API ProcessFunction Dynamic tables Streams, windows Events, state, time High-level analytics API Stream and batch data processing State event driven applications
  23. CODING FLINK -- Define Schema CREATE TABLE sensors ( ts

    TIMESTAMP(3), serial_num STRING, weight DOUBLE, color STRING, temperature DOUBLE, rads DOUBLE, WATERMARK FOR ts AS timestamp - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'topic' = 'sensors', 'properties.bootstrap.servers' = 'localhost:9092', 'scan.startup.mode' = 'latest-offset', 'format' = 'json' );
  24. CODING FLINK // Productivity (avg units/min) sensorDataStream .map(new MapFunction() {

    public map(String value) throws Exception { objectMapper = new ObjectMapper(); sensor = objectMapper.readValue(); return new Tuple2<>(sensor.serial_num, 1); } }) .keyBy(0) .window(SlidingProcessingTimeWindows .of(Time.hours(1), Time.minutes(1))) .sum(1);
  25. CODING FLINK -- Ship logistics (join / enrich) SELECT o.serial_num,

    s.color FROM orders AS o JOIN sensors AS s ON o.serial_num = s.serial_num;
  26. ABOUT SPARK Apache Spark is an distributed computing system used

    for data processing and analytics. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. “ ” Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure Language Interfaces Data Integrations Run model SQL, Java, Python, R Limited for Streaming, many for batch Multi-Process Scheduler
  27. CODING SPARK ## Instant safety (stateless filter) sensors_df = spark

    \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "<bootstrap>") \ .option("subscribe", "sensors") \ .load() # Filter the data where rads > 1 sensors_filtered_df = sensors_df.filter(col("rads") > 1)
  28. CODING SPARK # Productivity (avg units/min) windowed_counts = sensors_df \

    .withWatermark("timestamp", "1 hour") \ .groupBy(window(timestamp, "1 hour", "1 minute")) \ .agg(countDistinct("serial_num").alias("serial_count"))
  29. CODING SPARK # Ship logistics (join / enrich) # Define

    consumers for orders and sensors # Read and deserialize data from each joined_df = sensors_df.join(orders_df, "serial_num")
  30. SUMMARY AT A GLANCE Expressive Opinionated Low Touch Hi Toil

    Simple Infra Extensive Apache Spark Apache Flink Kafka Streams ksqlDB Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Low Touch Hi Toil Simple Infra Extensive Expressive Opinionated Expressive Opinionated Low Touch Hi Toil Simple Infra Extensive Flexibility Operations Infrastructure