Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Build an Open Source Data Pipeline

FTisiot
February 04, 2022

Build an Open Source Data Pipeline

Any conversation about Big Data would be incomplete without talking about Apache Kafka and Apache Flink: the winning open source combination for high-volume streaming data pipelines.

In this talk we'll explore how moving from long running batches to streaming data changes the game completely. We'll show how to build a streaming data pipeline, starting with Apache Kafka for storing and transmitting high throughput and low latency messages. Then we'll add Apache Flink, a distributed stateful compute engine, to create complex streaming transformations using familiar SQL statements.

FTisiot

February 04, 2022
Tweet

More Decks by FTisiot

Other Decks in Technology

Transcript

  1. Olena Kutsenko - Francesco Tisiot - Dev Advocates
    @OlenaKutsenko - @ftisiot
    Build


    an


    Open Source


    Streaming


    Data Pipeline

    View full-size slide

  2. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  3. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  4. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  5. @OlenaKutsenko | @ftisiot | @aiven_io
    What


    is


    Apache
    Kafka?

    View full-size slide

  6. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  7. @OlenaKutsenko | @ftisiot | @aiven_io
    Topic A
    Topic B
    0
    1
    2
    3
    4
    0
    1
    2
    3
    Producer
    Consumer
    Producer
    Consumer
    Consumer

    View full-size slide

  8. @OlenaKutsenko | @ftisiot | @aiven_io
    Brokers
    Replication Factor
    3 2

    View full-size slide

  9. @OlenaKutsenko | @ftisiot | @aiven_io
    Brokers
    Producer
    Consumer

    View full-size slide

  10. @OlenaKutsenko | @ftisiot | @aiven_io
    Integrating


    Apache
    Kafka

    View full-size slide

  11. @OlenaKutsenko | @ftisiot | @aiven_io
    Kafka


    Connect


    Source
    Kafka


    Connect


    Sink

    View full-size slide

  12. @OlenaKutsenko | @ftisiot | @aiven_io
    kafka-python
    from kafka import KafkaProducer


    producer = KafkaProducer(


    bootstrap_servers=['broker1:1234']


    )


    producer.send(


    'my-topic-name',


    b'my-message'


    )
    producer.flush()

    View full-size slide

  13. @OlenaKutsenko | @ftisiot | @aiven_io
    {


    "id": 1,


    "shop": “Mario's Pizza",


    "name": "Arsenio Pisaroni-Boccaccio",


    "phoneNumber": "+39 51 0290746",


    "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)",


    "pizzas": [


    {


    "pizzaName": "Margherita",


    "additionalToppings": ["ham"]


    },


    {


    "pizzaName": "Diavola",


    "additionalToppings": ["mozzarella","banana","onion"]


    }]


    }


    https:/
    /github.com/aiven/python-fake-data-producer-for-apache-kafka

    View full-size slide

  14. @OlenaKutsenko | @ftisiot | @aiven_io
    Compute
    State

    View full-size slide

  15. @OlenaKutsenko | @ftisiot | @aiven_io
    Apache Flink

    View full-size slide

  16. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  17. @OlenaKutsenko | @ftisiot | @aiven_io
    Filter Join
    Aggregate Explode
    Detect Transform

    View full-size slide

  18. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  19. @OlenaKutsenko | @ftisiot | @aiven_io
    Flink


    in


    Action

    View full-size slide

  20. @OlenaKutsenko | @ftisiot | @aiven_io
    {


    "id": 1,


    "shop": “Mario's Pizza",


    "name": "Arsenio Pisaroni-Boccaccio",


    "phoneNumber": "+39 51 0290746",


    "address": "Via Ugo 01, Montegrotto, 85639 Padova(PD)",


    "pizzas": [


    {


    "pizzaName": "Margherita",


    "additionalToppings": ["ham"]


    }]


    }


    pizza_name base_price
    Marinara 4
    Diavola 6
    Mari & Monti 8
    Salami 7
    Peperoni 8
    Margherita 5

    View full-size slide

  21. @OlenaKutsenko | @ftisiot | @aiven_io
    CREATE TABLE pizza_orders (


    id INT,


    shop VARCHAR,


    name VARCHAR,


    phoneNumber VARCHAR,


    address VARCHAR,


    pizzas ARRAY




    pizzaName VARCHAR,


    additionalToppings ARRAY )>


    )
    CREATE TABLE pizza_orders (


    id INT,


    shop VARCHAR,


    name VARCHAR,


    phoneNumber VARCHAR,


    address VARCHAR,


    pizzas ARRAY




    pizzaName VARCHAR,


    additionalToppings ARRAY )>


    ) WITH (


    'connector' = 'kafka',


    'properties.bootstrap.servers' = ‘kafka:13041',


    'topic' = 'pizza-orders',


    'scan.startup.mode' = 'earliest-offset',


    Kafka
    Source

    View full-size slide

  22. @OlenaKutsenko | @ftisiot | @aiven_io
    CREATE TEMPORARY TABLE pizza_prices (


    pizza_name VARCHAR,


    base_price INT,


    PRIMARY KEY (pizza_name) NOT ENFORCED


    )
    CREATE TEMPORARY TABLE pizza_prices (


    pizza_name VARCHAR,


    base_price INT,


    PRIMARY KEY (pizza_name) NOT ENFORCED


    ) WITH (


    'connector' = 'jdbc',


    'url' = ‘jdbc:postgresql:/pghost:13039/db',


    'username'='avnadmin',


    'password'='verysecurepassword123',


    'table-name' = 'pizza_price'


    );
    Pg Source

    View full-size slide

  23. @OlenaKutsenko | @ftisiot | @aiven_io
    Pg


    Target
    CREATE TABLE order_price (


    id INT,


    total_price BIGINT,


    PRIMARY KEY (id) NOT ENFORCED


    )
    CREATE TABLE order_price (


    id INT,


    total_price BIGINT,


    PRIMARY KEY (id) NOT ENFORCED


    ) WITH (


    'connector' = 'jdbc',


    'url' = 'jdbc:postgresql://pghost:13039/db',


    'username'='avnadmin',


    'password'='verysecurepassword123',


    'table-name' = 'order_price'


    );

    View full-size slide

  24. @OlenaKutsenko | @ftisiot | @aiven_io
    Create
    Pipeline
    insert into order_price


    insert into order_price


    select id,


    sum(base_price) total_price


    group by id;
    insert into order_price


    select id,


    sum(base_price) total_price


    from pizza_orders cross join UNNEST(pizzas) b


    group by id;
    insert into order_price


    select id,


    sum(base_price) total_price


    from pizza_orders cross join UNNEST(pizzas) b


    LEFT OUTER JOIN pizza_prices


    FOR SYSTEM_TIME AS OF orderProctime AS pp


    ON b.pizzaName = pp.pizza_name


    group by id;

    View full-size slide

  25. @OlenaKutsenko | @ftisiot | @aiven_io

    View full-size slide

  26. @OlenaKutsenko | @ftisiot | @aiven_io
    References
    https://aiven.io
    http://flink.apache.org/
    https://aiven.io/blog/create-your-own-
    data-stream-for-kafka-with-python-and-
    faker
    https://kafka.apache.org/
    https://github.com/aiven/sql-cli-for-
    apache-flink-docker

    View full-size slide