Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Build an Open Source Data Pipeline

FTisiot
February 04, 2022

Build an Open Source Data Pipeline

Any conversation about Big Data would be incomplete without talking about Apache Kafka and Apache Flink: the winning open source combination for high-volume streaming data pipelines.

In this talk we'll explore how moving from long running batches to streaming data changes the game completely. We'll show how to build a streaming data pipeline, starting with Apache Kafka for storing and transmitting high throughput and low latency messages. Then we'll add Apache Flink, a distributed stateful compute engine, to create complex streaming transformations using familiar SQL statements.

FTisiot

February 04, 2022
Tweet

More Decks by FTisiot

Other Decks in Technology

Transcript

  1. Olena Kutsenko - Francesco Tisiot - Dev Advocates @OlenaKutsenko -

    @ftisiot Build an Open Source Streaming Data Pipeline
  2. @OlenaKutsenko | @ftisiot | @aiven_io

  3. @OlenaKutsenko | @ftisiot | @aiven_io

  4. @OlenaKutsenko | @ftisiot | @aiven_io

  5. @OlenaKutsenko | @ftisiot | @aiven_io What is Apache Kafka?

  6. @OlenaKutsenko | @ftisiot | @aiven_io

  7. @OlenaKutsenko | @ftisiot | @aiven_io Topic A Topic B 0

    1 2 3 4 0 1 2 3 Producer Consumer Producer Consumer Consumer
  8. @OlenaKutsenko | @ftisiot | @aiven_io Brokers Replication Factor 3 2

  9. @OlenaKutsenko | @ftisiot | @aiven_io Brokers Producer Consumer

  10. @OlenaKutsenko | @ftisiot | @aiven_io Integrating Apache Kafka

  11. @OlenaKutsenko | @ftisiot | @aiven_io Kafka Connect Source Kafka Connect

    Sink
  12. @OlenaKutsenko | @ftisiot | @aiven_io kafka-python from kafka import KafkaProducer

    producer = KafkaProducer( bootstrap_servers=['broker1:1234'] ) producer.send( 'my-topic-name', b'my-message' ) producer.flush()
  13. @OlenaKutsenko | @ftisiot | @aiven_io { "id": 1, "shop": “Mario's

    Pizza", "name": "Arsenio Pisaroni-Boccaccio", "phoneNumber": "+39 51 0290746", "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)", "pizzas": [ { "pizzaName": "Margherita", "additionalToppings": ["ham"] }, { "pizzaName": "Diavola", "additionalToppings": ["mozzarella","banana","onion"] }] } https:/ /github.com/aiven/python-fake-data-producer-for-apache-kafka
  14. @OlenaKutsenko | @ftisiot | @aiven_io Compute State

  15. @OlenaKutsenko | @ftisiot | @aiven_io Apache Flink

  16. @OlenaKutsenko | @ftisiot | @aiven_io

  17. @OlenaKutsenko | @ftisiot | @aiven_io Filter Join Aggregate Explode Detect

    Transform
  18. @OlenaKutsenko | @ftisiot | @aiven_io

  19. @OlenaKutsenko | @ftisiot | @aiven_io Flink in Action

  20. @OlenaKutsenko | @ftisiot | @aiven_io { "id": 1, "shop": “Mario's

    Pizza", "name": "Arsenio Pisaroni-Boccaccio", "phoneNumber": "+39 51 0290746", "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)", "pizzas": [ { "pizzaName": "Margherita", "additionalToppings": ["ham"] }] } pizza_name base_price Marinara 4 Diavola 6 Mari & Monti 8 Salami 7 Peperoni 8 Margherita 5
  21. @OlenaKutsenko | @ftisiot | @aiven_io CREATE TABLE pizza_orders ( id

    INT, shop VARCHAR, name VARCHAR, phoneNumber VARCHAR, address VARCHAR, pizzas ARRAY <ROW ( pizzaName VARCHAR, additionalToppings ARRAY <VARCHAR>)> ) CREATE TABLE pizza_orders ( id INT, shop VARCHAR, name VARCHAR, phoneNumber VARCHAR, address VARCHAR, pizzas ARRAY <ROW ( pizzaName VARCHAR, additionalToppings ARRAY <VARCHAR>)> ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = ‘kafka:13041', 'topic' = 'pizza-orders', 'scan.startup.mode' = 'earliest-offset', Kafka Source
  22. @OlenaKutsenko | @ftisiot | @aiven_io CREATE TEMPORARY TABLE pizza_prices (

    pizza_name VARCHAR, base_price INT, PRIMARY KEY (pizza_name) NOT ENFORCED ) CREATE TEMPORARY TABLE pizza_prices ( pizza_name VARCHAR, base_price INT, PRIMARY KEY (pizza_name) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = ‘jdbc:postgresql:/pghost:13039/db', 'username'='avnadmin', 'password'='verysecurepassword123', 'table-name' = 'pizza_price' ); Pg Source
  23. @OlenaKutsenko | @ftisiot | @aiven_io Pg Target CREATE TABLE order_price

    ( id INT, total_price BIGINT, PRIMARY KEY (id) NOT ENFORCED ) CREATE TABLE order_price ( id INT, total_price BIGINT, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://pghost:13039/db', 'username'='avnadmin', 'password'='verysecurepassword123', 'table-name' = 'order_price' );
  24. @OlenaKutsenko | @ftisiot | @aiven_io Create Pipeline insert into order_price

    insert into order_price select id, sum(base_price) total_price group by id; insert into order_price select id, sum(base_price) total_price from pizza_orders cross join UNNEST(pizzas) b group by id; insert into order_price select id, sum(base_price) total_price from pizza_orders cross join UNNEST(pizzas) b LEFT OUTER JOIN pizza_prices FOR SYSTEM_TIME AS OF orderProctime AS pp ON b.pizzaName = pp.pizza_name group by id;
  25. @OlenaKutsenko | @ftisiot | @aiven_io

  26. @OlenaKutsenko | @ftisiot | @aiven_io References https://aiven.io http://flink.apache.org/ https://aiven.io/blog/create-your-own- data-stream-for-kafka-with-python-and-

    faker https://kafka.apache.org/ https://github.com/aiven/sql-cli-for- apache-flink-docker