Upgrade to Pro — share decks privately, control downloads, hide ads and more …

No More Silos: Integrating Databases and Apache Kafka

Robin Moffatt
December 05, 2018

No More Silos: Integrating Databases and Apache Kafka

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka® streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems.

In this talk we’ll look at one of the most common integration requirements - connecting databases to Kafka. We’ll consider the concept that all data is a stream of events, including that residing within a database. We’ll look at why we’d want to stream data from a database, including driving applications in Kafka from events upstream. We’ll discuss the different methods for connecting databases to Kafka, and the pros and cons of each. Techniques including Change-Data-Capture (CDC) and Kafka Connect will be covered, as well as an exploration of the power of KSQL for performing transformations such as joins on the inbound data.

Attendees of this talk will learn:

- That all data is event streams; databases are just a materialised view of a stream of events.

- The best ways to integrate databases with Kafka.

- Anti-patterns of which to be aware.

- The power of KSQL for transforming streams of data in Kafka.

Robin Moffatt

December 05, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  2. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Analytics - Database Offload HDFS / S3 / BigQuery etc RDBMS
  3. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC
  4. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Evolve processing from old systems to new Stream Processing RDBMS Existing App New App <x>
  5. “ @rmoff / No More Silos: Integrating Databases and Apache

    Kafka But streaming…I've just got data in a database…right?
  6. “ @rmoff / No More Silos: Integrating Databases and Apache

    Kafka Bold claim: all your data is event streams
  7. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Do you think that’s a table you are querying?
  8. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Balance 12345 €50
  9. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 Time
  10. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 Account ID Balance 12345 €75 Time
  11. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €15 Time
  12. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €15 Time Stream Table
  13. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash
  14. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL is the Streaming SQL Engine for Apache Kafka
  15. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  16. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  17. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    Oracle Debezium Kafka Connect Producer API Elasticsearch Kafka Connect Streaming ETL with Apache Kafka and KSQL
  18. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  19. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources syslog flat file CSV JSON MQTT
  20. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sinks Amazon S3 MQTT
  21. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 MQTT syslog flat file CSV JSON MQTT
  22. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka Connect basics Kafka Kafka Connect Source
  23. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Connectors Kafka Kafka Connect Source Connector
  24. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Converters Kafka Kafka Connect Source Connector Converter https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
  25. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Single Message Transforms Kafka Kafka Connect Source Connector Converter Transform(s)
  26. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Extensible Kafka Kafka Connect Source Connector Converter Transform(s) https://docs.confluent.io/current/connect/javadocs/ Browse & download at hub.confluent.io
  27. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  28. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
  29. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC Change-Data-Capture (CDC) There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific.
  30. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column Query-based CDC SELECT * FROM my_table 
 WHERE col > <value of col last time we polled>
  31. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) Log-based CDC
  32. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    "Which one should I use?" Photo by Tyler Nix on Unsplash
  33. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Query-based vs Log-based CDC Photo by Matese Fields on Unsplash • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes, or multiple events between polling interval Read more: http://cnfl.io/kafka-cdc
  34. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash Read more: http://cnfl.io/kafka-cdc • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$
  35. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Chucking data over the fence into a Kafka topic is not enough • CDC tools should integrate with standard ways of building data pipelines in Kafka • Schema handling • Serialisation formats
  36. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source,Oracle support in Beta • currently uses XStream— which requires OGG licence •Attunity, IBM IIDR, HVR, SQData, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry •DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not "true" CDC Oracle and Kafka integration
  37. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Which Log-Based CDC Tool? All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Oracle GoldenGate • Debezium + XStream • Attunity • IBM InfoSphere Data Replication (IIDR) • SQData • HVR
  38. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI SQL Stream Processing KSQL Datacenter Public Cloud Confluent Cloud CONFLUENT FULLY-MANAGED CUSTOMER SELF-MANAGED
  39. No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18

    Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  40. @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

    @rmoff [email protected] http://cnfl.io/slack https://www.confluent.io/download/ http://cnfl.io/kafka-cdc