Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kafka and Flink: Stateful Streaming Data Pipelines made easy with SQL

A23789f299ed06fe7d9f1c6940440bfa?s=47 FTisiot
September 04, 2021

Apache Kafka and Flink: Stateful Streaming Data Pipelines made easy with SQL

A stateful streaming data pipeline needs both a solid base and an engine to drive the data. Apache Kafka is an excellent choice for storing and transmitting high throughput and low latency messages. Apache Flink adds the cherry on top with a distributed stateful compute engine available in a variety of languages, including SQL.

In this session we'll explore how Apache Flink operates in conjunction with Apache Kafka to build stateful streaming data pipelines, and the problems we can solve with this combination. We will explore Flink's SQL client, showing how to define connections and transformations with the most known and beloved language in the data industry.
This session is aimed at data professionals who want to reduce the barrier to streaming data pipelines by making them configurable as set of simple SQL commands.

A23789f299ed06fe7d9f1c6940440bfa?s=128

FTisiot

September 04, 2021
Tweet

Transcript

  1. Apache Kafka and Flink Stateful Streaming Data Pipelines made easy

    with SQL Francesco Tisiot - Developer Advocate @ftisiot
  2. @ftisiot

  3. @ftisiot

  4. @ftisiot

  5. @ftisiot

  6. @ftisiot What is Apache Kafka?

  7. @ftisiot Topic A Topic B 0 1 2 3 4

    0 1 2 3 Producer Consumer Producer Consumer Consumer
  8. @ftisiot Brokers Replication Factor 3 2

  9. @ftisiot Integrating Apache Kafka

  10. @ftisiot Kafka Connect Source Kafka Connect Sink

  11. @ftisiot from kafka import KafkaProducer kafka-python producer = KafkaProducer( bootstrap_servers=['broker1:1234']

    ) producer.send( 'my-topic-name', b'my-message' ) producer.flush()
  12. @ftisiot { "id": 1, "shop": “Mario's Pizza", "name": "Arsenio Pisaroni-Boccaccio",

    "phoneNumber": "+39 51 0290746", "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)", "pizzas": [ { "pizzaName": "Margherita", "additionalToppings": ["ham"] }, { "pizzaName": "Diavola", "additionalToppings": ["mozzarella","banana","onion"] }] } https://github.com/aiven/kafka-python-fake-data-producer
  13. @ftisiot

  14. @ftisiot Database Apache Kafka

  15. @ftisiot Table Log vs The Fridge Dilemma

  16. @ftisiot Table Log 3 6 +7 +3 +6 -5 -2

    +5 5
  17. @ftisiot Create Stateful Apps

  18. @ftisiot Kafka Streams Faust KSQL Apache Flink

  19. @ftisiot

  20. @ftisiot

  21. @ftisiot SQL Table API DataStream API

  22. @ftisiot Filter Join Aggregate Explode Detect Change Shape

  23. @ftisiot

  24. @ftisiot Connect Flink

  25. @ftisiot { "id": 1, "shop": “Mario's Pizza", "name": "Arsenio Pisaroni-Boccaccio",

    "phoneNumber": "+39 51 0290746", "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)", "pizzas": [ { "pizzaName": "Margherita", "additionalToppings": ["ham"] }] } pizza_name base_price Marinara 4 Diavola 6 Mari & Monti 8 Salami 7 Peperoni 8 Margherita 5
  26. @ftisiot CREATE TABLE pizza_orders ( id INT, shop VARCHAR, name

    VARCHAR, phoneNumber VARCHAR, address VARCHAR, pizzas ARRAY <ROW ( pizzaName VARCHAR, additionalToppings ARRAY <VARCHAR>)> ) CREATE TABLE pizza_orders ( id INT, shop VARCHAR, name VARCHAR, phoneNumber VARCHAR, address VARCHAR, pizzas ARRAY <ROW ( pizzaName VARCHAR, additionalToppings ARRAY <VARCHAR>)> ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = ‘kafka:13041', 'topic' = 'pizza-orders', 'scan.startup.mode' = 'earliest-offset', … ); Kafka Source
  27. @ftisiot CREATE TEMPORARY TABLE pizza_prices ( pizza_name VARCHAR, base_price INT,

    PRIMARY KEY (pizza_name) NOT ENFORCED ) CREATE TEMPORARY TABLE pizza_prices ( pizza_name VARCHAR, base_price INT, PRIMARY KEY (pizza_name) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = ‘jdbc:postgresql:/pghost:13039/db', 'username'='avnadmin', 'password'='verysecurepassword123', 'table-name' = 'pizza_price' ); Pg Source
  28. @ftisiot CREATE TABLE order_price ( id INT, pizza_name VARCHAR, base_price

    INT, nr_pizzas BIGINT NOT NULL, PRIMARY KEY (id, pizza_name) NOT ENFORCED ) Pg Tgt CREATE TABLE order_price ( id INT, pizza_name VARCHAR, base_price INT, nr_pizzas BIGINT NOT NULL, PRIMARY KEY (id, pizza_name) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = ‘jdbc:postgresql://pghost:13039/db', 'username'='avnadmin', 'password'='verysecurepassword123', 'table-name' = 'order_price' );
  29. @ftisiot Create Pipeline insert into order_price insert into order_price select

    id, b.pizzaName, base_price, count(*) nr_pizzas from pizza_orders cross join UNNEST(pizzas) b insert into order_price select id, b.pizzaName, base_price, count(*) nr_pizzas from pizza_orders cross join UNNEST(pizzas) b LEFT OUTER JOIN pizza_prices FOR SYSTEM_TIME AS OF orderProctime AS pp ON b.pizzaName = pp.pizza_name insert into order_price select id, b.pizzaName, base_price, count(*) nr_pizzas from pizza_orders cross join UNNEST(pizzas) b LEFT OUTER JOIN pizza_prices FOR SYSTEM_TIME AS OF orderProctime AS pp ON b.pizzaName = pp.pizza_name group by id, b.pizzaName, base_price;
  30. @ftisiot

  31. @ftisiot

  32. @ftisiot Resources https://aiven.io http://flink.apache.org/ https://aiven.io/blog/create-your-own-data-stream-for- kafka-with-python-and-faker https://kafka.apache.org/ https://github.com/aiven/flink-sql-cli-docker