Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JavaZone Workshop - Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Robin Moffatt
September 11, 2018

JavaZone Workshop - Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Code: https://cnfl.io/ksql-workshop

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with the Kafka Connect API, which is part of Apache Kafka. KSQL is the open-source SQL streaming engine for Apache Kafka, and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.

In this talk we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API, and KSQL.

Gasp as we filter events in real time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!Have you

Robin Moffatt

September 11, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Apache Kafka and KSQL in Action :
    Let’s Build a Streaming Data Pipeline!
    @rmoff [email protected]
    confluent.io/ksql

    View full-size slide

  2. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 2
    https://cnfl.io/ksql-workshop-prereqs
    • Make sure you allocate Docker >=8GB memory

    • Clone the repo
    • Should default to branch 5.0.0-post
    • Pull the git images as instructed in the doc
    docker system info | grep Memory

    View full-size slide

  3. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 3
    https://cnfl.io/ksql-workshop
    3: Start up the Stack

    View full-size slide

  4. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 4
    • Developer Advocate @ Confluent
    • Working in data & analytics since 2001
    • Oracle ACE Director & Dev Champion
    • Blogging : http://rmoff.net & http://cnfl.io/rmoff
    • Twitter: @rmoff
    • Geek stuff
    • Beer & Fried Breakfasts
    $ whoami
    https://speakerdeck.com/rmoff/

    View full-size slide

  5. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 5
    App App App App
    search
    Hadoop
    DWH
    monitoring security
    MQ MQ
    cache
    cache
    A bit of a mess…

    View full-size slide

  6. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 6
    Kafka is a Streaming Platform
    KAFKA
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View full-size slide

  7. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 7
    Analytics - Database Offload
    HDFS / S3 /
    BigQuery etc
    RDBMS
    CDC

    View full-size slide

  8. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 8
    Stream Processing with Apache Kafka and KSQL
    order events
    customer
    customer orders
    Stream
    Processing
    RDBMS CDC

    View full-size slide

  9. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 9
    Real-time Event Stream Enrichment
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    CDC

    View full-size slide

  10. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 10
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    New App

    CDC

    View full-size slide

  11. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 11
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    HDFS / S3 / etc
    New App

    CDC

    View full-size slide

  12. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 12
    Rating
    events
    Join events to
    users, and filter
    Push notification
    to Slack
    Operational
    Dashboard
    Data
    Lake
    User
    data
    Let’s Build It!

    View full-size slide

  13. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 13
    Rating
    events
    Join events to
    users, and filter
    Push notification
    to Slack
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    Let’s Build It!

    View full-size slide

  14. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 14
    Confluent Open Source :
    Apache Kafka with a bunch of cool stuff! For free!
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Apache Open Source Confluent Open Source Confluent Enterprise
    Confluent Platform
    Confluent Platform
    Apache Kafka®
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | CLI
    Apache Open Source Confluent Open Source Confluent Enterprise
    SQL Stream Processing
    KSQL

    View full-size slide

  15. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 15
    Rating
    events
    Join events to
    users, and filter
    Push notification
    to Slack
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    Kafka Connect
    Kafka
    Connect
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  16. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 16
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources Sinks
    Amazon S3
    syslog
    flat file
    CSV
    JSON MQTT
    MQTT

    View full-size slide

  17. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 17
    ✓ Fault tolerant and automatically load balanced
    ✓ Extensible API
    ✓ Single Message Transforms
    ✓ Part of Apache Kafka, included in

    Confluent Open Source
    Reliable and scalable integration of Kafka with other systems – no coding required.
    {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
    "table.whitelist": "sales,orders,customers"
    }
    https://docs.confluent.io/current/connect/
    ✓ Centralized management and configuration
    ✓ Support for hundreds of technologies
    including RDBMS, Elasticsearch, HDFS, S3
    ✓ Supports CDC ingest of events from RDBMS
    ✓ Preserves data schema
    Kafka Connect

    View full-size slide

  18. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 18
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Avro
    Message
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  19. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 19
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect
    Avro
    Message

    View full-size slide

  20. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 20
    Confluent Hub
    hub.confluent.io
    • One-stop place to discover and
    download :
    • Connectors
    • Transformations
    • Converters

    View full-size slide

  21. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 21
    MySQL Debezium
    Kafka Connect
    Producer API
    Demo Time!

    View full-size slide

  22. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 22
    https://cnfl.io/ksql-workshop
    4 & 5: Setup & Inspect source data

    View full-size slide

  23. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 23
    Rating
    events
    Join events to
    users, and filter
    Push notification
    to Slack
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    Let’s Build It!
    Kafka
    Connect
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  24. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 24
    Rating
    events
    Join events to
    users, and filter
    Push notification
    to Slack
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    KSQL
    Kafka
    Connect
    Kafka
    Connect
    Kafka
    Connect
    KSQL

    View full-size slide

  25. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql
    Declarative
    Stream
    Language
    Processing
    KSQL
    is a

    View full-size slide

  26. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql
    KSQL
    is the
    Streaming
    SQL Engine
    for
    Apache Kafka

    View full-size slide

  27. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql
    KSQL in Development and Production
    Interactive KSQL

    for development and testing
    Headless KSQL

    for Production
    Desired KSQL queries
    have been identified
    REST
    “Hmm, let me try

    out this idea...”

    View full-size slide

  28. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 28
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    POOR_RATINGS
    Filter all ratings where STARS<3
    CREATE STREAM POOR_RATINGS AS
    SELECT * FROM ratings WHERE STARS <3

    View full-size slide

  29. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 29
    https://cnfl.io/ksql-workshop
    6: KSQL CLI
    7: Querying the Ratings topic
    8. Populating a Kafka topic with KSQL

    View full-size slide

  30. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 30
    Do you think that’s a table
    you are querying?

    View full-size slide

  31. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 31
    The Table Stream Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    12345 + €25
    12345 -€60
    Account ID Balance
    12345 €75
    Account ID Balance
    12345 €15
    Time
    Stream Table

    View full-size slide

  32. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 32
    The truth is the log.
    The database is a cache
    of a subset of the log.
    —Pat Helland
    Immutability Changes Everything
    http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
    Photo by Bobby Burch on Unsplash

    View full-size slide

  33. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 33
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS
    SELECT * FROM RATINGS LEFT JOIN CUSTOMERS
    ON R.ID=C.ID;

    View full-size slide

  34. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 34
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    UNHAPPY_PLATINUM_CUSTOMERS
    Filter for just PLATINUM customers
    CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS
    SELECT * FROM RATINGS_WITH_CUSTOMER_DATA
    WHERE STARS < 3

    View full-size slide

  35. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 35
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    RATINGS_BY_CLUB_STATUS_1MIN
    Aggregate per-minute by CLUB_STATUS
    CREATE TABLE RATINGS_BY_CLUB_STATUS AS
    SELECT CLUB_STATUS, COUNT(*)
    FROM RATINGS_WITH_CUSTOMER_DATA
    WINDOW TUMBLING (SIZE 1 MINUTES)
    GROUP BY CLUB_STATUS;

    View full-size slide

  36. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 36
    Stream to Elasticsearch

    View full-size slide

  37. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 37
    https://cnfl.io/ksql-workshop
    9. Joining Data in KSQL
    10. Daisy-chaining derived streams
    11. Streaming Aggregates

    View full-size slide

  38. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 38
    Free Books!
    https://www.confluent.io/apache-kafka-stream-processing-book-bundle

    View full-size slide

  39. @rmoff
    [email protected]
    https://www.confluent.io/ksql
    http://cnfl.io/slack

    View full-size slide

  40. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 40
    • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides
    • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL
    • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides
    • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
    • https://github.com/confluentinc/ksql/
    Useful links

    View full-size slide

  41. @rmoff / Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! http://cnfl.io/ksql 41
    • CDC Spreadsheet
    • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
    • #partner-engineering on Slack for questions
    • BD team (#partners / [email protected]) can help with introductions on a given sales op
    Resources
    #EOF

    View full-size slide