$30 off During Our Annual Pro Sale. View Details »

No More Silos: Integrating Databases and Apache Kafka

Robin Moffatt
December 05, 2018

No More Silos: Integrating Databases and Apache Kafka

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka® streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems.

In this talk we’ll look at one of the most common integration requirements - connecting databases to Kafka. We’ll consider the concept that all data is a stream of events, including that residing within a database. We’ll look at why we’d want to stream data from a database, including driving applications in Kafka from events upstream. We’ll discuss the different methods for connecting databases to Kafka, and the pros and cons of each. Techniques including Change-Data-Capture (CDC) and Kafka Connect will be covered, as well as an exploration of the power of KSQL for performing transformations such as joins on the inbound data.

Attendees of this talk will learn:

- That all data is event streams; databases are just a materialised view of a stream of events.

- The best ways to integrate databases with Kafka.

- Anti-patterns of which to be aware.

- The power of KSQL for transforming streams of data in Kafka.

Robin Moffatt

December 05, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Integrating
    Databases
    and Apache
    Kafka
    #ukoug_tech18
    @rmoff
    Robin Moffatt, Developer Advocate @ Confluent

    View Slide

  2. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Photo by Emily Morter on Unsplash

    View Slide

  3. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Kafka is a Streaming Platform
    KAFKA
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View Slide

  4. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Analytics - Database Offload
    HDFS / S3 /
    BigQuery etc
    RDBMS

    View Slide

  5. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Real-time Event Stream Enrichment
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    CDC

    View Slide

  6. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Evolve processing from old systems to new
    Stream
    Processing
    RDBMS
    Existing
    App
    New App

    View Slide


  7. @rmoff / No More Silos: Integrating Databases and Apache Kafka
    But streaming…I've just got
    data in a database…right?

    View Slide


  8. @rmoff / No More Silos: Integrating Databases and Apache Kafka
    Bold claim: all your
    data is event streams

    View Slide

  9. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    A Customer
    Experience

    View Slide

  10. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    A Sale

    View Slide

  11. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    A Sensor
    Reading

    View Slide

  12. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    An Application
    Log Entry

    View Slide

  13. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Databases

    View Slide

  14. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Do you think that’s a table
    you are querying?

    View Slide

  15. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    The Stream Table Duality
    Account ID Balance
    12345 €50

    View Slide

  16. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    The Stream Table Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    Time

    View Slide

  17. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    The Stream Table Duality
    Account ID Amount
    12345 + €50
    12345 + €25
    Account ID Balance
    12345 €75
    Time

    View Slide

  18. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    The Stream Table Duality
    Account ID Amount
    12345 + €50
    12345 + €25
    12345 -€60
    Account ID Balance
    12345 €15
    Time

    View Slide

  19. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    The Stream Table Duality
    Account ID Amount
    12345 + €50
    12345 + €25
    12345 -€60
    Account ID Balance
    12345 €15
    Time
    Stream Table

    View Slide

  20. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    The truth is the log.
    The database is a cache
    of a subset of the log.
    —Pat Helland
    Immutability Changes Everything
    http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
    Photo by Bobby Burch on Unsplash

    View Slide

  21. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    KSQL
    is the
    Streaming
    SQL Engine
    for
    Apache Kafka

    View Slide

  22. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    KSQL for Real-Time Monitoring
    • Log data monitoring, tracking and alerting
    • syslog data
    • Sensor / IoT data
    CREATE STREAM SYSLOG_INVALID_USERS AS
    SELECT HOST, MESSAGE
    FROM SYSLOG
    WHERE MESSAGE LIKE '%Invalid user%';
    http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

    View Slide

  23. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    KSQL for Streaming ETL
    CREATE STREAM vip_actions AS 

    SELECT userid, page, action
    FROM clickstream c
    LEFT JOIN users u
    ON c.userid = u.user_id 

    WHERE u.level = 'Platinum';
    Joining, filtering, and aggregating streams of event data

    View Slide

  24. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    Oracle Debezium
    Kafka Connect
    Producer API
    Elasticsearch
    Kafka Connect
    Streaming ETL with
    Apache Kafka and KSQL

    View Slide

  25. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    KSQL for Anomaly Detection
    CREATE TABLE possible_fraud AS

    SELECT card_number, count(*)

    FROM authorization_attempts 

    WINDOW TUMBLING (SIZE 5 SECONDS)

    GROUP BY card_number

    HAVING count(*) > 3;
    Identifying patterns or anomalies in real-time data,
    surfaced in milliseconds

    View Slide

  26. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    Photo by Vadim Sherbakov on Unsplash

    View Slide

  27. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources
    syslog
    flat file
    CSV
    JSON
    MQTT

    View Slide

  28. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sinks
    Amazon S3
    MQTT

    View Slide

  29. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Streaming Integration with Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources Sinks
    Amazon S3
    MQTT
    syslog
    flat file
    CSV
    JSON
    MQTT

    View Slide

  30. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Kafka Connect basics
    Kafka
    Kafka Connect
    Source

    View Slide

  31. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Connectors
    Kafka
    Kafka Connect
    Source
    Connector

    View Slide

  32. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Converters
    Kafka
    Kafka Connect
    Source
    Connector Converter
    https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained

    View Slide

  33. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Single Message Transforms
    Kafka
    Kafka Connect
    Source
    Connector Converter
    Transform(s)

    View Slide

  34. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Extensible
    Kafka
    Kafka Connect
    Source
    Connector Converter
    Transform(s)
    https://docs.confluent.io/current/connect/javadocs/
    Browse & download at hub.confluent.io

    View Slide

  35. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Avro
    Message
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect

    View Slide

  36. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect
    Avro
    Message

    View Slide

  37. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    • CDC is a generic term referring to
    capturing changing data typically
    from a RDBMS.
    • Two general approaches:
    • Query-based CDC
    • Log-based CDC
    Change-Data-Capture (CDC)
    There are other options including hacks with
    Triggers, Flashback etc but these are system and/or
    technology-specific.

    View Slide

  38. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    • Use a database query to try and identify new & changed rows



    • Implemented with the open source Kafka Connect JDBC connector
    • Can import based on table names, schema, or bespoke SQL query
    •Incremental ingest driven through incrementing ID column and/or
    timestamp column
    Query-based CDC
    SELECT * FROM my_table 

    WHERE col >

    View Slide

  39. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    • Use the database's
    transaction log to identify
    every single change event
    • Various CDC tools available
    that integrate with Apache
    Kafka (more of this later…)
    Log-based CDC

    View Slide

  40. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    Demo Time!

    View Slide

  41. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    "Which one
    should I use?"
    Photo by Tyler Nix on Unsplash

    View Slide

  42. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    It Depends!
    Photo by Trevor Cole on Unsplash

    View Slide

  43. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Query-based vs Log-based CDC
    Photo by Matese Fields on Unsplash
    • Query-based
    +Usually easier to setup, and requires
    fewer permissions
    - Needs specific columns in source
    schema
    - Impact of polling the DB (or higher
    latencies tradeoff)
    - Can't track deletes, or multiple events
    between polling interval
    Read more: http://cnfl.io/kafka-cdc

    View Slide

  44. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Query-based vs Log-based CDC
    Photo by Sebastian Pociecha on Unsplash
    Read more: http://cnfl.io/kafka-cdc
    • Log-based
    +Greater data fidelity
    +Lower latency
    +Lower impact on source
    - More setup steps
    - Higher system privileges required
    - For propriatory databases, usually $$$

    View Slide

  45. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Considerations for Integration into Apache Kafka
    Photo by Matthew Smith on Unsplash
    • Chucking data over the fence into a Kafka topic is
    not enough
    • CDC tools should integrate with standard ways of
    building data pipelines in Kafka
    • Schema handling
    • Serialisation formats

    View Slide

  46. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    •Oracle GoldenGate for Big
    Data—Requires the OGGBD
    licence, not just OGG
    •Debezium—Open
    source,Oracle support in
    Beta
    • currently uses XStream—
    which requires OGG
    licence
    •Attunity, IBM IIDR, HVR,
    SQData, StreamSets—all
    offer commerical CDC
    integration into Kafka with
    support for Schema Registry
    •DBVisit Replicate—no
    longer under development
    •JDBC Connector—Open
    source, but not "true" CDC
    Oracle and Kafka integration

    View Slide

  47. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Which Log-Based CDC Tool?
    All these options integrate with Apache Kafka and Confluent
    Platform, including support for the Schema Registry
    ⓘ For query-based CDC, use the
    Confluent Kafka Connect JDBC
    connector
    • Open Source RDBMS, 

    e.g. MySQL, PostgreSQL
    • Debezium
    • (+ paid options)
    • Mainframe

    e.g. VSAM, IMS
    • Attunity
    • SQData
    • Proprietory RDBMS, 

    e.g. Oracle, MS SQL
    • Oracle GoldenGate
    • Debezium + XStream
    • Attunity
    • IBM InfoSphere Data Replication
    (IIDR)
    • SQData
    • HVR

    View Slide

  48. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Confluent Open Source :
    Apache Kafka with a bunch of cool stuff! For free!
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Confluent Platform
    Confluent Platform
    Apache Kafka®
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | CLI
    SQL Stream Processing
    KSQL
    Datacenter Public Cloud Confluent Cloud
    CONFLUENT FULLY-MANAGED
    CUSTOMER SELF-MANAGED

    View Slide

  49. No More Silos: Integrating Databases and Apache Kafka
    @rmoff #ukoug_tech18
    Free Books!
    https://www.confluent.io/apache-kafka-stream-processing-book-bundle

    View Slide

  50. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    @rmoff [email protected]
    http://cnfl.io/slack
    https://www.confluent.io/download/
    http://cnfl.io/kafka-cdc

    View Slide

  51. @rmoff #ukoug_tech18
    No More Silos: Integrating Databases and Apache Kafka
    #EOF

    View Slide