Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Build an Open Source Data Pipeline

A23789f299ed06fe7d9f1c6940440bfa?s=47 FTisiot
February 04, 2022

Build an Open Source Data Pipeline

Any conversation about Big Data would be incomplete without talking about Apache Kafka and Apache Flink: the winning open source combination for high-volume streaming data pipelines.

In this talk we'll explore how moving from long running batches to streaming data changes the game completely. We'll show how to build a streaming data pipeline, starting with Apache Kafka for storing and transmitting high throughput and low latency messages. Then we'll add Apache Flink, a distributed stateful compute engine, to create complex streaming transformations using familiar SQL statements.

A23789f299ed06fe7d9f1c6940440bfa?s=128

FTisiot

February 04, 2022
Tweet

More Decks by FTisiot

Other Decks in Technology

Transcript

  1. Olena Kutsenko - Francesco Tisiot - Dev Advocates @OlenaKutsenko -

    @ftisiot Build an Open Source Streaming Data Pipeline
  2. @OlenaKutsenko | @ftisiot | @aiven_io

  3. @OlenaKutsenko | @ftisiot | @aiven_io

  4. @OlenaKutsenko | @ftisiot | @aiven_io

  5. @OlenaKutsenko | @ftisiot | @aiven_io What is Apache Kafka?

  6. @OlenaKutsenko | @ftisiot | @aiven_io

  7. @OlenaKutsenko | @ftisiot | @aiven_io Topic A Topic B 0

    1 2 3 4 0 1 2 3 Producer Consumer Producer Consumer Consumer
  8. @OlenaKutsenko | @ftisiot | @aiven_io Brokers Replication Factor 3 2

  9. @OlenaKutsenko | @ftisiot | @aiven_io Brokers Producer Consumer

  10. @OlenaKutsenko | @ftisiot | @aiven_io Integrating Apache Kafka

  11. @OlenaKutsenko | @ftisiot | @aiven_io Kafka Connect Source Kafka Connect

    Sink
  12. @OlenaKutsenko | @ftisiot | @aiven_io kafka-python from kafka import KafkaProducer

    producer = KafkaProducer( bootstrap_servers=['broker1:1234'] ) producer.send( 'my-topic-name', b'my-message' ) producer.flush()
  13. @OlenaKutsenko | @ftisiot | @aiven_io { "id": 1, "shop": “Mario's

    Pizza", "name": "Arsenio Pisaroni-Boccaccio", "phoneNumber": "+39 51 0290746", "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)", "pizzas": [ { "pizzaName": "Margherita", "additionalToppings": ["ham"] }, { "pizzaName": "Diavola", "additionalToppings": ["mozzarella","banana","onion"] }] } https:/ /github.com/aiven/python-fake-data-producer-for-apache-kafka
  14. @OlenaKutsenko | @ftisiot | @aiven_io Compute State

  15. @OlenaKutsenko | @ftisiot | @aiven_io Apache Flink

  16. @OlenaKutsenko | @ftisiot | @aiven_io

  17. @OlenaKutsenko | @ftisiot | @aiven_io Filter Join Aggregate Explode Detect

    Transform
  18. @OlenaKutsenko | @ftisiot | @aiven_io

  19. @OlenaKutsenko | @ftisiot | @aiven_io Flink in Action

  20. @OlenaKutsenko | @ftisiot | @aiven_io { "id": 1, "shop": “Mario's

    Pizza", "name": "Arsenio Pisaroni-Boccaccio", "phoneNumber": "+39 51 0290746", "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)", "pizzas": [ { "pizzaName": "Margherita", "additionalToppings": ["ham"] }] } pizza_name base_price Marinara 4 Diavola 6 Mari & Monti 8 Salami 7 Peperoni 8 Margherita 5
  21. @OlenaKutsenko | @ftisiot | @aiven_io CREATE TABLE pizza_orders ( id

    INT, shop VARCHAR, name VARCHAR, phoneNumber VARCHAR, address VARCHAR, pizzas ARRAY <ROW ( pizzaName VARCHAR, additionalToppings ARRAY <VARCHAR>)> ) CREATE TABLE pizza_orders ( id INT, shop VARCHAR, name VARCHAR, phoneNumber VARCHAR, address VARCHAR, pizzas ARRAY <ROW ( pizzaName VARCHAR, additionalToppings ARRAY <VARCHAR>)> ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = ‘kafka:13041', 'topic' = 'pizza-orders', 'scan.startup.mode' = 'earliest-offset', Kafka Source
  22. @OlenaKutsenko | @ftisiot | @aiven_io CREATE TEMPORARY TABLE pizza_prices (

    pizza_name VARCHAR, base_price INT, PRIMARY KEY (pizza_name) NOT ENFORCED ) CREATE TEMPORARY TABLE pizza_prices ( pizza_name VARCHAR, base_price INT, PRIMARY KEY (pizza_name) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = ‘jdbc:postgresql:/pghost:13039/db', 'username'='avnadmin', 'password'='verysecurepassword123', 'table-name' = 'pizza_price' ); Pg Source
  23. @OlenaKutsenko | @ftisiot | @aiven_io Pg Target CREATE TABLE order_price

    ( id INT, total_price BIGINT, PRIMARY KEY (id) NOT ENFORCED ) CREATE TABLE order_price ( id INT, total_price BIGINT, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://pghost:13039/db', 'username'='avnadmin', 'password'='verysecurepassword123', 'table-name' = 'order_price' );
  24. @OlenaKutsenko | @ftisiot | @aiven_io Create Pipeline insert into order_price

    insert into order_price select id, sum(base_price) total_price group by id; insert into order_price select id, sum(base_price) total_price from pizza_orders cross join UNNEST(pizzas) b group by id; insert into order_price select id, sum(base_price) total_price from pizza_orders cross join UNNEST(pizzas) b LEFT OUTER JOIN pizza_prices FOR SYSTEM_TIME AS OF orderProctime AS pp ON b.pizzaName = pp.pizza_name group by id;
  25. @OlenaKutsenko | @ftisiot | @aiven_io

  26. @OlenaKutsenko | @ftisiot | @aiven_io References https://aiven.io http://flink.apache.org/ https://aiven.io/blog/create-your-own- data-stream-for-kafka-with-python-and-

    faker https://kafka.apache.org/ https://github.com/aiven/sql-cli-for- apache-flink-docker