Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LISA18: Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Robin Moffatt
October 31, 2018
260

LISA18: Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!

Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with the Kafka Connect API, which is part of Apache Kafka. KSQL is the open-source SQL streaming engine for Apache Kafka, and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.

In this talk we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API, and KSQL.

Gasp as we filter events in real time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!

This will be a practical talk, after which attendees will have a clear idea of the power of stream processing, and how to get started with it using the open-source Apache Kafka and KSQL projects.

Robin Moffatt

October 31, 2018
Tweet

More Decks by Robin Moffatt

Transcript

  1. Apache Kafka and KSQL in Action :
    Let’s Build a Streaming Data Pipeline!
    @rmoff [email protected]
    confluent.io/ksql
    USENIX Large Installation System Administration Conference (LISA)
    October 31 2018

    View full-size slide

  2. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    • Developer Advocate @ Confluent
    • Working in data & analytics since 2001
    • Oracle Developer Champion
    • Blogging : http://rmoff.net & http://cnfl.io/rmoff
    • Twitter:
    • Geek stuff
    • Beer & Fried Breakfasts
    $ whoami
    https://speakerdeck.com/rmoff/
    @rmoff

    View full-size slide

  3. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Kafka

    View full-size slide

  4. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Kafka is a Streaming Platform
    KAFKA
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View full-size slide

  5. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Streaming is not
    just for realtime

    View full-size slide

  6. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Streaming is for
    everyone

    View full-size slide

  7. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    All data is
    events

    View full-size slide

  8. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    A Dumb Pipeline
    HDFS / S3 /
    BigQuery etc
    Logs

    View full-size slide

  9. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    A Dumb Pipeline
    HDFS / S3 /
    BigQuery etc
    Logs
    Logs

    View full-size slide

  10. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Stream Processing with Apache Kafka and KSQL
    Stream
    Processing
    Logs
    HDFS / S3 /
    BigQuery etc
    All logs Errors

    View full-size slide

  11. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Real-time Event Stream Enrichment
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    CDC

    View full-size slide

  12. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    New App

    CDC

    View full-size slide

  13. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    HDFS / S3 / etc
    New App

    CDC

    View full-size slide

  14. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Rating
    events
    Join events to
    users, and filter
    Push notification
    Operational
    Dashboard
    Data
    Lake
    User
    data
    Let’s Build It!

    View full-size slide

  15. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Rating
    events
    Join events to
    users, and filter
    Push notification
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    Let’s Build It!

    View full-size slide

  16. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Rating
    events
    Join events to
    users, and filter
    Push notification
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    Kafka Connect
    Kafka
    Connect
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  17. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    An API of Apache Kafka, providing reliable and scalable integration of Kafka with
    other systems – no coding required.
    {
    "connector.class":
    "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url":
    "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
    "table.whitelist":
    "sales,orders,customers"
    }
    https://docs.confluent.io/current/connect/
    Kafka Connect

    View full-size slide

  18. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources
    syslog
    flat file
    CSV
    JSON
    MQTT

    View full-size slide

  19. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sinks
    Amazon S3
    MQT

    View full-size slide

  20. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources Sinks
    Amazon S3
    MQT
    syslog
    flat file
    CSV
    JSON
    MQTT

    View full-size slide

  21. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Confluent Hub
    hub.confluent.io
    • One-stop place to discover and
    download :
    • Connectors
    • Transformations
    • Converters

    View full-size slide

  22. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Avro
    Message
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  23. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect
    Avro
    Message

    View full-size slide

  24. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    curl -X "POST" "http://kafka-connect-cp:18083/connectors/" \
    -H "Content-Type: application/json" \
    -d '{
    "name": "es_sink_lisa18",
    "config": {
    "connector.class":
    "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": false,
    "topics": "lisa18",
    "key.ignore": "true",
    "schema.ignore": "true",
    "type.name": "type.name=kafkaconnect",
    "connection.url": "http://elasticsearch:9200"
    }
    }'
    Kafka → Elasticsearch

    View full-size slide

  25. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    MySQL Debezium
    Kafka Connect
    Producer API
    Demo Time!

    View full-size slide

  26. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Rating
    events
    Join events to
    users, and filter
    Push notification
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    Let’s Build It!
    Kafka
    Connect
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  27. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Rating
    events
    Join events to
    users, and filter
    Push notification
    Operational
    Dashboard
    Data
    Lake
    User
    data
    RDBMS
    S3/HDFS/
    SnowflakeDB
    etc
    Elasticsearch
    App
    App
    Producer API
    Consumer API
    KSQL
    Kafka
    Connect
    Kafka
    Connect
    Kafka
    Connect
    KSQL

    View full-size slide

  28. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    Declarative
    Stream
    Language
    Processing
    KSQL
    is a

    View full-size slide

  29. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    KSQL
    is the
    Streaming
    SQL Engine
    for
    Apache Kafka

    View full-size slide

  30. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    KSQL for Real-Time Monitoring
    • Log data monitoring, tracking and alerting
    • syslog data
    • Sensor / IoT data
    CREATE STREAM SYSLOG_INVALID_USERS AS
    SELECT HOST, MESSAGE
    FROM SYSLOG
    WHERE MESSAGE LIKE '%Invalid user%';
    http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

    View full-size slide

  31. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    KSQL for Streaming ETL
    CREATE STREAM vip_actions AS 

    SELECT userid, page, action
    FROM clickstream c
    LEFT JOIN users u
    ON c.userid = u.user_id 

    WHERE u.level = 'Platinum';
    Joining, filtering, and aggregating streams of event data

    View full-size slide

  32. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    KSQL for Anomaly Detection
    CREATE TABLE possible_fraud AS

    SELECT card_number, count(*)

    FROM authorization_attempts 

    WINDOW TUMBLING (SIZE 5 SECONDS)

    GROUP BY card_number

    HAVING count(*) > 3;
    Identifying patterns or anomalies in real-time data,
    surfaced in milliseconds

    View full-size slide

  33. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    CREATE STREAM pageviews
    WITH (PARTITIONS=4,
    VALUE_FORMAT='AVRO') AS 

    SELECT * FROM pageviews_json;
    KSQL for Data Transformation
    Make simple derivations of existing topics from the command line

    View full-size slide

  34. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    KSQL in Development and Production
    Interactive KSQL

    for development and testing
    Headless KSQL

    for Production
    Desired KSQL queries
    have been identified
    REST
    “Hmm, let me try

    out this idea...”

    View full-size slide

  35. @rmoff / http://cnfl.io/ksql
    Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    MySQL Debezium
    Kafka Connect
    Producer API
    Elasticsearch
    Kafka Connect
    Demo Time!

    View full-size slide

  36. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    POOR_RATINGS
    Filter all ratings where STARS<3
    CREATE STREAM POOR_RATINGS AS
    SELECT * FROM ratings WHERE STARS <3

    View full-size slide

  37. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Do you think that’s a table
    you are querying?

    View full-size slide

  38. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    The Stream Table Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    12345 + €25
    12345 -€60
    Account ID Balance
    12345 €75
    Account ID Balance
    12345 €15
    Time
    Stream Table
    Read more: https://cnfl.io/stream-table-duality

    View full-size slide

  39. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    The truth is the log.
    The database is a cache
    of a subset of the log.
    —Pat Helland
    Immutability Changes Everything
    http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
    Photo by Bobby Burch on Unsplash

    View full-size slide

  40. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS
    SELECT * FROM RATINGS LEFT JOIN CUSTOMERS
    ON R.ID=C.ID;

    View full-size slide

  41. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Kafka Connect
    Producer API
    {
    "rating_id": 5313,
    "user_id": 3,
    "stars": 4,
    "route_id": 6975,
    "rating_time": 1519304105213,
    "channel": "web",
    "message": "worst. flight. ever. #neveragain"
    }
    {
    "id": 3,
    "first_name": "Merilyn",
    "last_name": "Doughartie",
    "email": "[email protected]",
    "gender": "Female",
    "club_status": "platinum",
    "comments": "none"
    }
    RATINGS_WITH_CUSTOMER_DATA
    Join each rating to customer data
    UNHAPPY_PLATINUM_CUSTOMERS
    Filter for just PLATINUM customers
    CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS
    SELECT * FROM RATINGS_WITH_CUSTOMER_DATA
    WHERE STARS < 3

    View full-size slide

  42. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Confluent Open Source :
    Apache Kafka with a bunch of cool stuff! For free!
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Confluent Platform
    Confluent Platform
    Apache Kafka®
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | CLI
    SQL Stream Processing
    KSQL
    Datacenter Public Cloud Confluent Cloud
    CONFLUENT FULLY-MANAGED
    CUSTOMER SELF-MANAGED

    View full-size slide

  43. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    •Kafka Connect
    • Integration between Kafka and other data stores
    •Kafka
    • Provides stream processing natively
    •KSQL
    • Build stream processing apps with just SQL
    If you remember one thing… (or three)

    View full-size slide

  44. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Free Books!
    https://www.confluent.io/apache-kafka-stream-processing-book-bundle

    View full-size slide

  45. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    Try it out!
    https://cnfl.io/kafka-ksql-elastic

    View full-size slide

  46. @rmoff
    [email protected]
    https://www.confluent.io/ksql
    http://cnfl.io/slack

    View full-size slide

  47. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides
    • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL
    • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides
    • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
    • https://github.com/confluentinc/ksql/
    Useful links

    View full-size slide

  48. Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!
    @rmoff / http://cnfl.io/ksql
    • CDC Spreadsheet
    • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC
    • #partner-engineering on Slack for questions
    • BD team (#partners / [email protected]) can help with introductions on a given sales op
    Resources
    #EOF

    View full-size slide