Slide 1

Slide 1 text

Integrating Databases and Apache Kafka #ukoug_tech18 @rmoff Robin Moffatt, Developer Advocate @ Confluent

Slide 2

Slide 2 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Photo by Emily Morter on Unsplash

Slide 3

Slide 3 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs

Slide 4

Slide 4 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Analytics - Database Offload HDFS / S3 / BigQuery etc RDBMS

Slide 5

Slide 5 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS CDC

Slide 6

Slide 6 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Evolve processing from old systems to new Stream Processing RDBMS Existing App New App

Slide 7

Slide 7 text

“ @rmoff / No More Silos: Integrating Databases and Apache Kafka But streaming…I've just got data in a database…right?

Slide 8

Slide 8 text

“ @rmoff / No More Silos: Integrating Databases and Apache Kafka Bold claim: all your data is event streams

Slide 9

Slide 9 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 A Customer Experience

Slide 10

Slide 10 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 A Sale

Slide 11

Slide 11 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 A Sensor Reading

Slide 12

Slide 12 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 An Application Log Entry

Slide 13

Slide 13 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Databases

Slide 14

Slide 14 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Do you think that’s a table you are querying?

Slide 15

Slide 15 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 The Stream Table Duality Account ID Balance 12345 €50

Slide 16

Slide 16 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 The Stream Table Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 Time

Slide 17

Slide 17 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 Account ID Balance 12345 €75 Time

Slide 18

Slide 18 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €15 Time

Slide 19

Slide 19 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 The Stream Table Duality Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €15 Time Stream Table

Slide 20

Slide 20 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

Slide 21

Slide 21 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka KSQL is the Streaming SQL Engine for Apache Kafka

Slide 22

Slide 22 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

Slide 23

Slide 23 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data

Slide 24

Slide 24 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka Oracle Debezium Kafka Connect Producer API Elasticsearch Kafka Connect Streaming ETL with Apache Kafka and KSQL

Slide 25

Slide 25 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds

Slide 26

Slide 26 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka Photo by Vadim Sherbakov on Unsplash

Slide 27

Slide 27 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources syslog flat file CSV JSON MQTT

Slide 28

Slide 28 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sinks Amazon S3 MQTT

Slide 29

Slide 29 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 MQTT syslog flat file CSV JSON MQTT

Slide 30

Slide 30 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Kafka Connect basics Kafka Kafka Connect Source

Slide 31

Slide 31 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Connectors Kafka Kafka Connect Source Connector

Slide 32

Slide 32 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Converters Kafka Kafka Connect Source Connector Converter https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained

Slide 33

Slide 33 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Single Message Transforms Kafka Kafka Connect Source Connector Converter Transform(s)

Slide 34

Slide 34 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Extensible Kafka Kafka Connect Source Connector Converter Transform(s) https://docs.confluent.io/current/connect/javadocs/ Browse & download at hub.confluent.io

Slide 35

Slide 35 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect

Slide 36

Slide 36 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message

Slide 37

Slide 37 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC Change-Data-Capture (CDC) There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific.

Slide 38

Slide 38 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column Query-based CDC SELECT * FROM my_table 
 WHERE col >

Slide 39

Slide 39 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) Log-based CDC

Slide 40

Slide 40 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka Demo Time!

Slide 41

Slide 41 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka "Which one should I use?" Photo by Tyler Nix on Unsplash

Slide 42

Slide 42 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka It Depends! Photo by Trevor Cole on Unsplash

Slide 43

Slide 43 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Query-based vs Log-based CDC Photo by Matese Fields on Unsplash • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes, or multiple events between polling interval Read more: http://cnfl.io/kafka-cdc

Slide 44

Slide 44 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash Read more: http://cnfl.io/kafka-cdc • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$

Slide 45

Slide 45 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Chucking data over the fence into a Kafka topic is not enough • CDC tools should integrate with standard ways of building data pipelines in Kafka • Schema handling • Serialisation formats

Slide 46

Slide 46 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source,Oracle support in Beta • currently uses XStream— which requires OGG licence •Attunity, IBM IIDR, HVR, SQData, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry •DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not "true" CDC Oracle and Kafka integration

Slide 47

Slide 47 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Which Log-Based CDC Tool? All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Oracle GoldenGate • Debezium + XStream • Attunity • IBM InfoSphere Data Replication (IIDR) • SQData • HVR

Slide 48

Slide 48 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI SQL Stream Processing KSQL Datacenter Public Cloud Confluent Cloud CONFLUENT FULLY-MANAGED CUSTOMER SELF-MANAGED

Slide 49

Slide 49 text

No More Silos: Integrating Databases and Apache Kafka @rmoff #ukoug_tech18 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle

Slide 50

Slide 50 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka @rmoff robin@confluent.io http://cnfl.io/slack https://www.confluent.io/download/ http://cnfl.io/kafka-cdc

Slide 51

Slide 51 text

@rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka #EOF