Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL

Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka® streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.

In this talk, we’ll see how easy it is to stream data from a database such as PostgreSQL into Kafka using CDC and Kafka Connect. In addition, we’ll use KSQL to filter, aggregate and join it to other data, and then stream this from Kafka out into multiple targets such as Elasticsearch and S3. All of this can be accomplished without a single line of code!

Why should Java geeks have all the fun?

Robin Moffatt

July 03, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Streaming ETL in Practice
    with PostgreSQL, Apache
    Kafka, and KSQL
    PGConf UK 2018
    3 Jul 2018 / Robin Moffatt
    @rmoff [email protected]
    https://speakerdeck.com/rmoff/

    View Slide

  2. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 2
    • Developer Advocate @ Confluent
    • Working in data & analytics since 2001
    • Oracle ACE Director & Dev Champion
    • Blogging : http://rmoff.net & http://cnfl.io/rmoff
    • Twitter: @rmoff
    • Geek stuff
    • Beer & Fried Breakfasts
    $ whoami
    https://speakerdeck.com/rmoff/

    View Slide

  3. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 3
    App App App App
    search
    Hadoop
    DWH
    monitoring security
    MQ MQ
    cache
    cache
    A bit of a mess…

    View Slide

  4. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 4
    The Streaming Platform
    KAFKA
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View Slide

  5. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 5
    Database offload → Analytics
    HDFS / S3 /
    BigQuery etc
    RDBMS
    CDC

    View Slide

  6. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 6
    Streaming ETL with Apache Kafka and KSQL
    order events
    customer
    customer orders
    Stream
    Processing
    RDBMS CDC

    View Slide

  7. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 7
    Real-time Event Stream Enrichment
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    CDC

    View Slide

  8. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 8
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    New App

    CDC

    View Slide

  9. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 9
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    HDFS / S3 / etc
    New App

    CDC

    View Slide

  10. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 10
    KSQL
    Streaming ETL with Apache Kafka

    View Slide

  11. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 11
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources Sinks
    Amazon S3
    syslog
    flat file
    CSV
    JSON

    View Slide

  12. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 12
    The Connect API of Apache Kafka®
    ✓ Fault tolerant and automatically load balanced
    ✓ Extensible API
    ✓ Single Message Transforms
    ✓ Part of Apache Kafka, included in

    Confluent Open Source
    Reliable and scalable integration of Kafka
    with other systems – no coding required.
    {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
    "table.whitelist": "sales,orders,customers"
    }
    https://docs.confluent.io/current/connect/
    ✓ Centralized management and configuration
    ✓ Support for hundreds of technologies
    including RDBMS, Elasticsearch, HDFS, S3
    ✓ Supports CDC ingest of events from RDBMS
    ✓ Preserves data schema

    View Slide

  13. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 13
    Integrating Postgres with Kafka
    Kafka Connect
    & Debezium
    Kafka Connect
    & JDBC Sink

    View Slide

  14. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 14
    Confluent Hub
    hub.confluent.io
    • Launched June 2018
    • One-stop place to discover and
    download :
    • Connectors
    • Transformations
    • Converters

    View Slide

  15. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 15
    KSQL
    Streaming ETL with Apache Kafka

    View Slide

  16. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018
    Declarative
    Stream
    Language
    Processing
    KSQL
    is a

    View Slide

  17. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018
    KSQL
    is the
    Streaming
    SQL Engine
    for
    Apache Kafka

    View Slide

  18. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018
    KSQL for Streaming ETL
    CREATE STREAM vip_actions AS 

    SELECT userid, page, action
    FROM clickstream c
    LEFT JOIN users u
    ON c.userid = u.user_id 

    WHERE u.level = 'Platinum';
    Joining, filtering, and aggregating streams of event data

    View Slide

  19. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018
    KSQL for Anomaly Detection
    CREATE TABLE possible_fraud AS

    SELECT card_number, count(*)

    FROM authorization_attempts 

    WINDOW TUMBLING (SIZE 5 SECONDS)

    GROUP BY card_number

    HAVING count(*) > 3;
    Identifying patterns or anomalies in real-time data,
    surfaced in milliseconds

    View Slide

  20. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018
    KSQL for Real-Time Monitoring
    • Log data monitoring, tracking and alerting
    • syslog data
    • Sensor / IoT data
    CREATE STREAM SYSLOG_INVALID_USERS AS
    SELECT HOST, MESSAGE
    FROM SYSLOG
    WHERE MESSAGE LIKE '%Invalid user%';
    http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

    View Slide

  21. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 21
    KSQL
    Streaming ETL with Apache Kafka

    View Slide

  22. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 22
    Kafka Connect
    Producer API
    Elasticsearch
    Kafka Connect
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    Postgres
    Demo Time!
    Kafka Connect
    Postgres

    View Slide

  23. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 23
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    POOR_RATINGS
    Filter all ratings where STARS<3
    CREATE STREAM POOR_RATINGS AS
    SELECT * FROM ratings WHERE STARS <3

    View Slide

  24. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 24
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    UNHAPPY_PLATINUM_CUSTOMERS
    Filter for just PLATINUM customers
    CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS
    SELECT * FROM RATINGS_WITH_CUSTOMER_DATA
    WHERE STARS < 3

    View Slide

  25. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 25
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    RATINGS_BY_CLUB_STATUS_1MIN
    Aggregate per-minute by CLUB_STATUS
    CREATE TABLE RATINGS_BY_CLUB_STATUS AS
    SELECT CLUB_STATUS, COUNT(*)
    FROM RATINGS_WITH_CUSTOMER_DATA
    WINDOW TUMBLING (SIZE 1 MINUTES)
    GROUP BY CLUB_STATUS;

    View Slide

  26. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 26
    Confluent Open Source :
    Apache Kafka with a bunch of cool stuff! For free!
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Apache Open Source Confluent Open Source Confluent Enterprise
    Confluent Platform
    Confluent Platform
    Apache Kafka®
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | CLI
    Apache Open Source Confluent Open Source Confluent Enterprise
    SQL Stream Processing
    KSQL

    View Slide

  27. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 27
    Free Books!
    https://www.confluent.io/apache-kafka-stream-processing-book-bundle

    View Slide

  28. @rmoff [email protected]
    http://cnfl.io/slack
    https://www.confluent.io/download/

    View Slide

  29. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 29
    • Postgres integration into Kafka
    • http://debezium.io/docs/connectors/postgresql/
    • https://www.simple.com/engineering/a-change-data-capture-pipeline-from-postgresql-to-kafka
    • https://www.slideshare.net/JeffKlukas/postgresql-kafka-the-delight-of-change-data-capture
    • https://blog.insightdatascience.com/from-postgresql-to-redshift-with-kafka-connect-111c44954a6a
    • Streaming ETL
    • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides
    • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL
    • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides
    • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
    • https://github.com/confluentinc/ksql/
    Useful links

    View Slide

  30. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL - PGConf UK 2018 30
    • CDC Spreadsheet
    • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
    • #partner-engineering on Slack for questions
    • BD team (#partners / partne[email protected]) can help with introductions on a given sales op
    Resources
    #EOF

    View Slide