No More Silos: Integrating Databases and Apache Kafka

2bded62396ea66c84bd10e91c718dea9?s=47 Robin Moffatt
December 05, 2018

No More Silos: Integrating Databases and Apache Kafka

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka® streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems.

In this talk we’ll look at one of the most common integration requirements - connecting databases to Kafka. We’ll consider the concept that all data is a stream of events, including that residing within a database. We’ll look at why we’d want to stream data from a database, including driving applications in Kafka from events upstream. We’ll discuss the different methods for connecting databases to Kafka, and the pros and cons of each. Techniques including Change-Data-Capture (CDC) and Kafka Connect will be covered, as well as an exploration of the power of KSQL for performing transformations such as joins on the inbound data.

Attendees of this talk will learn:

- That all data is event streams; databases are just a materialised view of a stream of events.

- The best ways to integrate databases with Kafka.

- Anti-patterns of which to be aware.

- The power of KSQL for transforming streams of data in Kafka.

2bded62396ea66c84bd10e91c718dea9?s=128

Robin Moffatt

December 05, 2018
Tweet

Transcript

  1. Integrating Databases and Apache Kafka #ukoug_tech18 @rmoff Robin Moffatt, Developer

    Advocate @ Confluent
  2. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Photo by Emily Morter on Unsplash
  3. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  4. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Analytics - Database Offload HDFS / S3 / BigQuery etc RDBMS
  5. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC
  6. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Evolve processing from old systems to new Stream Processing RDBMS Existing App New App <x>
  7. “ @rmoff / No More Silos: Integrating Databases and Apache

    Kafka But streaming…I've just got data in a database…right?
  8. “ @rmoff / No More Silos: Integrating Databases and Apache

    Kafka Bold claim: all your data is event streams
  9. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    A Customer Experience
  10. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    A Sale
  11. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    A Sensor Reading
  12. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    An Application Log Entry
  13. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Databases
  14. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Do you think that’s a table you are querying?
  15. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Balance 12345 €50
  16. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 Time
  17. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 Account ID Balance 12345 €75 Time
  18. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €15 Time
  19. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €15 Time Stream Table
  20. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash
  21. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL is the Streaming SQL Engine for Apache Kafka
  22. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  23. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  24. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    Oracle Debezium Kafka Connect Producer API Elasticsearch Kafka Connect Streaming ETL with Apache Kafka and KSQL
  25. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  26. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    Photo by Vadim Sherbakov on Unsplash
  27. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources syslog flat file CSV JSON MQTT
  28. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sinks Amazon S3 MQTT
  29. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 MQTT syslog flat file CSV JSON MQTT
  30. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka Connect basics Kafka Kafka Connect Source
  31. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Connectors Kafka Kafka Connect Source Connector
  32. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Converters Kafka Kafka Connect Source Connector Converter https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
  33. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Single Message Transforms Kafka Kafka Connect Source Connector Converter Transform(s)
  34. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Extensible Kafka Kafka Connect Source Connector Converter Transform(s) https://docs.confluent.io/current/connect/javadocs/ Browse & download at hub.confluent.io
  35. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  36. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
  37. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC Change-Data-Capture (CDC) There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific.
  38. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column Query-based CDC SELECT * FROM my_table 
 WHERE col > <value of col last time we polled>
  39. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) Log-based CDC
  40. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    Demo Time!
  41. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    "Which one should I use?" Photo by Tyler Nix on Unsplash
  42. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    It Depends! Photo by Trevor Cole on Unsplash
  43. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Query-based vs Log-based CDC Photo by Matese Fields on Unsplash • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes, or multiple events between polling interval Read more: http://cnfl.io/kafka-cdc
  44. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash Read more: http://cnfl.io/kafka-cdc • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$
  45. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Chucking data over the fence into a Kafka topic is not enough • CDC tools should integrate with standard ways of building data pipelines in Kafka • Schema handling • Serialisation formats
  46. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source,Oracle support in Beta • currently uses XStream— which requires OGG licence •Attunity, IBM IIDR, HVR, SQData, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry •DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not "true" CDC Oracle and Kafka integration
  47. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Which Log-Based CDC Tool? All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Oracle GoldenGate • Debezium + XStream • Attunity • IBM InfoSphere Data Replication (IIDR) • SQData • HVR
  48. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI SQL Stream Processing KSQL Datacenter Public Cloud Confluent Cloud CONFLUENT FULLY-MANAGED CUSTOMER SELF-MANAGED
  49. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  50. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    @rmoff robin@confluent.io http://cnfl.io/slack https://www.confluent.io/download/ http://cnfl.io/kafka-cdc
  51. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    #EOF