Slide 1

Slide 1 text

Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL Oracle Developer Meetup 25 Jun 2018 / Robin Moffatt @rmoff [email protected] https://speakerdeck.com/rmoff/

Slide 2

Slide 2 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 2 $ whoami • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts https://speakerdeck.com/rmoff/

Slide 3

Slide 3 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 3 Analytics—In the beginning… Sales DWH

Slide 4

Slide 4 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 4 And then there were more data sources… Sales DWH Inventory

Slide 5

Slide 5 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 5 Batch Transformations … (ETL / ELT) Sales DWH Inventory

Slide 6

Slide 6 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 6 Add a Data Lake… Sales DWH Inventory Data Lake

Slide 7

Slide 7 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 7 …or Replace the Data Warehouse Sales Inventory Data Lake

Slide 8

Slide 8 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 8 Still need to do Batch transformations… Sales Inventory Data Lake

Slide 9

Slide 9 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 9 Want your data anytime ? Batch is Latency built in by Design

Slide 10

Slide 10 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Photo by Denys Nevozhai on Unsplash Microservices Mobile Machine 
 Learning Internet of 
 Things The World has Changed

Slide 11

Slide 11 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 11 Photo by Rosie Fraser on Unsplash Lots of new technologies (whether you like it or not)

Slide 12

Slide 12 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 App App App App search Hadoop DWH monitoring security MQ MQ cache cache

Slide 13

Slide 13 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs

Slide 14

Slide 14 text

“ But streaming…I've just got data in a database…right?

Slide 15

Slide 15 text

“ Bold claim: all your data is event streams

Slide 16

Slide 16 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 16 A Customer Experience

Slide 17

Slide 17 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 17 A Sale

Slide 18

Slide 18 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 18 A Sensor Reading

Slide 19

Slide 19 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 19 An Application Log Entry

Slide 20

Slide 20 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 20 Databases

Slide 21

Slide 21 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 21 Do you think that’s a table you are querying?

Slide 22

Slide 22 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50

Slide 23

Slide 23 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 Time

Slide 24

Slide 24 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 Account ID Balance 12345 €75 Time

Slide 25

Slide 25 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time

Slide 26

Slide 26 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time Stream Table

Slide 27

Slide 27 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 23 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

Slide 28

Slide 28 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Streaming Platform Vision

Slide 29

Slide 29 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Streaming Platform Vision Event-Driven Scalable Decoupled

Slide 30

Slide 30 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Streaming Platform Vision Event-Driven Scalable Decoupled

Slide 31

Slide 31 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 26 Database offload→Hadoop/Object Storage/Cloud DW for Analytics HDFS / S3 / BigQuery etc RDBMS CDC

Slide 32

Slide 32 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 27 Streaming ETL with Apache Kafka and KSQL order items customer customer orders Stream Processing RDBMS CDC

Slide 33

Slide 33 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 28 Real-time Event Stream Enrichment with Apache Kafka and KSQL order events customer Stream Processing customer orders RDBMS CDC

Slide 34

Slide 34 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 29 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS New App CDC

Slide 35

Slide 35 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 30 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS HDFS / S3 / etc New App CDC

Slide 36

Slide 36 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 31 Evolve processing from old systems to new RDBMS Existing App CDC

Slide 37

Slide 37 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 31 Evolve processing from old systems to new Stream Processing RDBMS Existing App CDC New App

Slide 38

Slide 38 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 32 Evolve processing from old systems to new Stream Processing RDBMS Existing App New App New App CDC

Slide 39

Slide 39 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 33 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL

Slide 40

Slide 40 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL

Slide 41

Slide 41 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL

Slide 42

Slide 42 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON

Slide 43

Slide 43 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 The Connect API of Apache Kafka® ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema

Slide 44

Slide 44 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Confluent Hub • Launched June 2018 • One-stop place to discover and download : • Connectors • Transformations • Converters hub.confluent.io

Slide 45

Slide 45 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 38 Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect

Slide 46

Slide 46 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 39 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message

Slide 47

Slide 47 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 39 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message

Slide 48

Slide 48 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40 Change-Data-Capture (CDC)

Slide 49

Slide 49 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches:

Slide 50

Slide 50 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC

Slide 51

Slide 51 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific.

Slide 52

Slide 52 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column 41 Query-based CDC SELECT * FROM my_table 
 WHERE col >

Slide 53

Slide 53 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 42 Log-based CDC • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…)

Slide 54

Slide 54 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 43 Query-based vs Log-based CDC Photo by Matese Fields on Unsplash Read more: http://cnfl.io/kafka-cdc • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes, or multiple events between polling interval

Slide 55

Slide 55 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 44 Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash Read more: http://cnfl.io/kafka-cdc • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$

Slide 56

Slide 56 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 45 Which Log-Based CDC Tool? • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Attunity • IBM InfoSphere Data Replication (IIDR) • Oracle GoldenGate • SQData • HVR • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector

Slide 57

Slide 57 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL

Slide 58

Slide 58 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL

Slide 59

Slide 59 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Declarative Stream Language Processing KSQL is a

Slide 60

Slide 60 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KSQL is the Streaming SQL Engine for Apache Kafka

Slide 61

Slide 61 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KSQL in Development and Production Interactive KSQL
 for development and testing REST “Hmm, let me try
 out this idea...”

Slide 62

Slide 62 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KSQL in Development and Production Interactive KSQL
 for development and testing Headless KSQL
 for Production Desired KSQL queries have been identified REST “Hmm, let me try
 out this idea...”

Slide 63

Slide 63 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data

Slide 64

Slide 64 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds

Slide 65

Slide 65 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

Slide 66

Slide 66 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT='AVRO', TIMESTAMP='view_time') AS 
 SELECT * FROM clickstream PARTITION BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line

Slide 67

Slide 67 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 54 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL

Slide 68

Slide 68 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Demo Time!

Slide 69

Slide 69 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Demo Time!

Slide 70

Slide 70 text

56

Slide 71

Slide 71 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 57 Customers Logons Logon_Enriched Join, and Derive additional columns

Slide 72

Slide 72 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 58 Orders Orders_agg_hourly Aggregate order metrics by hour

Slide 73

Slide 73 text

59

Slide 74

Slide 74 text

59

Slide 75

Slide 75 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 60

Slide 76

Slide 76 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source Confluent Enterprise SQL Stream Processing KSQL

Slide 77

Slide 77 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle

Slide 78

Slide 78 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 @rmoff [email protected] https://slackpass.io/confluentcommunity https://www.confluent.io/download/

Slide 79

Slide 79 text

@rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 #EOF