Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oracle Developer Meetup - Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Oracle Developer Meetup - Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Robin Moffatt

June 25, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Streaming ETL in
    Practice with Oracle,
    Apache Kafka, and KSQL
    Oracle Developer Meetup
    25 Jun 2018 / Robin Moffatt
    @rmoff [email protected]
    https://speakerdeck.com/rmoff/

    View full-size slide

  2. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 2
    $ whoami
    • Developer Advocate @ Confluent
    • Working in data & analytics since 2001
    • Oracle ACE Director & Dev Champion
    • Blogging : http://rmoff.net & http://cnfl.io/rmoff
    • Twitter: @rmoff
    • Geek stuff
    • Beer & Fried Breakfasts
    https://speakerdeck.com/rmoff/

    View full-size slide

  3. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 3
    Analytics—In the beginning…
    Sales DWH

    View full-size slide

  4. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 4
    And then there were more data sources…
    Sales DWH
    Inventory

    View full-size slide

  5. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 5
    Batch Transformations … (ETL / ELT)
    Sales DWH
    Inventory

    View full-size slide

  6. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 6
    Add a Data Lake…
    Sales DWH
    Inventory Data Lake

    View full-size slide

  7. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 7
    …or Replace the Data Warehouse
    Sales
    Inventory Data Lake

    View full-size slide

  8. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 8
    Still need to do Batch transformations…
    Sales
    Inventory Data Lake

    View full-size slide

  9. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 9
    Want your data anytime ?
    Batch is Latency built in by Design

    View full-size slide

  10. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Photo by Denys Nevozhai on Unsplash
    Microservices Mobile Machine 

    Learning
    Internet of 

    Things
    The World has Changed

    View full-size slide

  11. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 11
    Photo by Rosie Fraser on Unsplash
    Lots of new technologies
    (whether you like it or not)

    View full-size slide

  12. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    App App App App
    search
    Hadoop
    DWH
    monitoring security
    MQ MQ
    cache
    cache

    View full-size slide

  13. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KAFKA
    DWH Hadoop
    App
    App App App App
    App
    App
    App
    request-response
    messaging
    OR
    stream
    processing
    streaming data pipelines
    changelogs

    View full-size slide


  14. But streaming…I've just got
    data in a database…right?

    View full-size slide


  15. Bold claim: all your data
    is event streams

    View full-size slide

  16. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 16
    A Customer
    Experience

    View full-size slide

  17. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 17
    A Sale

    View full-size slide

  18. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 18
    A Sensor
    Reading

    View full-size slide

  19. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 19
    An Application
    Log Entry

    View full-size slide

  20. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 20
    Databases

    View full-size slide

  21. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 21
    Do you think that’s a table
    you are querying?

    View full-size slide

  22. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22
    The Table Stream Duality
    Account ID Balance
    12345 €50

    View full-size slide

  23. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22
    The Table Stream Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    Time

    View full-size slide

  24. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22
    The Table Stream Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    12345 + €25
    Account ID Balance
    12345 €75
    Time

    View full-size slide

  25. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22
    The Table Stream Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    12345 + €25
    12345 -€60
    Account ID Balance
    12345 €75
    Account ID Balance
    12345 €15
    Time

    View full-size slide

  26. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 22
    The Table Stream Duality
    Account ID Balance
    12345 €50
    Account ID Amount
    12345 + €50
    12345 + €25
    12345 -€60
    Account ID Balance
    12345 €75
    Account ID Balance
    12345 €15
    Time
    Stream Table

    View full-size slide

  27. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 23
    The truth is the log.
    The database is a cache
    of a subset of the log.
    —Pat Helland
    Immutability Changes Everything
    http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
    Photo by Bobby Burch on Unsplash

    View full-size slide

  28. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Streaming Platform Vision

    View full-size slide

  29. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Streaming Platform Vision
    Event-Driven
    Scalable
    Decoupled

    View full-size slide

  30. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Streaming Platform Vision
    Event-Driven
    Scalable
    Decoupled

    View full-size slide

  31. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 26
    Database offload→Hadoop/Object Storage/Cloud DW for Analytics
    HDFS / S3 /
    BigQuery etc
    RDBMS
    CDC

    View full-size slide

  32. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 27
    Streaming ETL with Apache Kafka and KSQL
    order items
    customer
    customer orders
    Stream
    Processing
    RDBMS CDC

    View full-size slide

  33. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 28
    Real-time Event Stream Enrichment with Apache Kafka and KSQL
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    CDC

    View full-size slide

  34. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 29
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    New App

    CDC

    View full-size slide

  35. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 30
    Transform Once, Use Many
    order events
    customer
    Stream
    Processing
    customer orders
    RDBMS

    HDFS / S3 / etc
    New App

    CDC

    View full-size slide

  36. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 31
    Evolve processing from old systems to new
    RDBMS
    Existing
    App
    CDC

    View full-size slide

  37. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 31
    Evolve processing from old systems to new
    Stream
    Processing
    RDBMS
    Existing
    App
    CDC
    New App

    View full-size slide

  38. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 32
    Evolve processing from old systems to new
    Stream
    Processing
    RDBMS
    Existing
    App
    New App

    New App

    CDC

    View full-size slide

  39. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 33
    Streaming ETL, powered by Apache Kafka and Confluent Platform
    KSQL

    View full-size slide

  40. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Elasticsearch
    Oracle
    Orders Logins
    Txn log
    Kafka Streams
    API
    KSQL
    Kafka
    Kafka

    Connect
    Kafka Connect
    Swingbench
    Realtime Analytics with Kafka and KSQL

    View full-size slide

  41. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Elasticsearch
    Oracle
    Orders Logins
    Txn log
    Kafka Streams
    API
    KSQL
    Kafka
    Kafka

    Connect
    Kafka Connect
    Swingbench
    Realtime Analytics with Kafka and KSQL

    View full-size slide

  42. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Kafka Connect
    Kafka Brokers
    Kafka Connect
    Tasks Workers
    Sources Sinks
    Amazon S3
    syslog
    flat file
    CSV
    JSON

    View full-size slide

  43. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    The Connect API of Apache Kafka®
    ✓ Fault tolerant and automatically load balanced
    ✓ Extensible API
    ✓ Single Message Transforms
    ✓ Part of Apache Kafka, included in

    Confluent Open Source
    Reliable and scalable integration of Kafka
    with other systems – no coding required.
    {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
    "table.whitelist": "sales,orders,customers"
    }
    https://docs.confluent.io/current/connect/
    ✓ Centralized management and configuration
    ✓ Support for hundreds of technologies including
    RDBMS, Elasticsearch, HDFS, S3
    ✓ Supports CDC ingest of events from RDBMS
    ✓ Preserves data schema

    View full-size slide

  44. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Confluent Hub
    • Launched June 2018
    • One-stop place to discover
    and download :
    • Connectors
    • Transformations
    • Converters
    hub.confluent.io

    View full-size slide

  45. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 38
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Avro
    Message
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect

    View full-size slide

  46. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 39
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect
    Avro
    Message

    View full-size slide

  47. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 39
    Kafka Connect + Schema Registry = WIN
    RDBMS
    Elasticsearch
    Schema
    Registry
    Avro
    Schema
    Kafka
    Connect
    Kafka
    Connect
    Avro
    Message

    View full-size slide

  48. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40
    Change-Data-Capture (CDC)

    View full-size slide

  49. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40
    Change-Data-Capture (CDC)
    • CDC is a generic term referring to
    capturing changing data typically
    from a RDBMS.
    • Two general approaches:

    View full-size slide

  50. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40
    Change-Data-Capture (CDC)
    • CDC is a generic term referring to
    capturing changing data typically
    from a RDBMS.
    • Two general approaches:
    • Query-based CDC

    View full-size slide

  51. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 40
    Change-Data-Capture (CDC)
    • CDC is a generic term referring to
    capturing changing data typically
    from a RDBMS.
    • Two general approaches:
    • Query-based CDC
    • Log-based CDC
    There are other options including hacks with
    Triggers, Flashback etc but these are system and/or
    technology-specific.

    View full-size slide

  52. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    • Use a database query to try and identify new & changed rows



    • Implemented with the open source Kafka Connect JDBC connector
    • Can import based on table names, schema, or bespoke SQL query
    •Incremental ingest driven through incrementing ID column and/or
    timestamp column
    41
    Query-based CDC
    SELECT * FROM my_table 

    WHERE col >

    View full-size slide

  53. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 42
    Log-based CDC
    • Use the database's
    transaction log to identify
    every single change event
    • Various CDC tools available
    that integrate with Apache
    Kafka (more of this later…)

    View full-size slide

  54. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 43
    Query-based vs Log-based CDC
    Photo by Matese Fields on Unsplash
    Read more: http://cnfl.io/kafka-cdc
    • Query-based
    +Usually easier to setup, and requires
    fewer permissions
    - Needs specific columns in source
    schema
    - Impact of polling the DB (or higher
    latencies tradeoff)
    - Can't track deletes, or multiple events
    between polling interval

    View full-size slide

  55. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 44
    Query-based vs Log-based CDC
    Photo by Sebastian Pociecha on Unsplash
    Read more: http://cnfl.io/kafka-cdc
    • Log-based
    +Greater data fidelity
    +Lower latency
    +Lower impact on source
    - More setup steps
    - Higher system privileges required
    - For propriatory databases, usually $$$

    View full-size slide

  56. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 45
    Which Log-Based CDC Tool?
    • Proprietory RDBMS, 

    e.g. Oracle, MS SQL
    • Attunity
    • IBM InfoSphere Data
    Replication (IIDR)
    • Oracle GoldenGate
    • SQData
    • HVR
    • Open Source RDBMS, 

    e.g. MySQL, PostgreSQL
    • Debezium
    • (+ paid options)
    • Mainframe

    e.g. VSAM, IMS
    • Attunity
    • SQData
    All these options integrate with Apache Kafka and Confluent
    Platform, including support for the Schema Registry
    ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector

    View full-size slide

  57. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Elasticsearch
    Oracle
    Orders Logins
    Txn log
    Kafka Streams
    API
    KSQL
    Kafka
    Kafka

    Connect
    Kafka Connect
    Swingbench
    Realtime Analytics with Kafka and KSQL

    View full-size slide

  58. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Elasticsearch
    Oracle
    Orders Logins
    Txn log
    Kafka Streams
    API
    KSQL
    Kafka
    Kafka

    Connect
    Kafka Connect
    Swingbench
    Realtime Analytics with Kafka and KSQL

    View full-size slide

  59. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Declarative
    Stream
    Language
    Processing
    KSQL
    is a

    View full-size slide

  60. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KSQL
    is the
    Streaming
    SQL Engine
    for
    Apache Kafka

    View full-size slide

  61. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KSQL in Development and Production
    Interactive KSQL

    for development and testing
    REST
    “Hmm, let me try

    out this idea...”

    View full-size slide

  62. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KSQL in Development and Production
    Interactive KSQL

    for development and testing
    Headless KSQL

    for Production
    Desired KSQL queries
    have been identified
    REST
    “Hmm, let me try

    out this idea...”

    View full-size slide

  63. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KSQL for Streaming ETL
    CREATE STREAM vip_actions AS 

    SELECT userid, page, action
    FROM clickstream c
    LEFT JOIN users u
    ON c.userid = u.user_id 

    WHERE u.level = 'Platinum';
    Joining, filtering, and aggregating streams of event data

    View full-size slide

  64. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KSQL for Anomaly Detection
    CREATE TABLE possible_fraud AS

    SELECT card_number, count(*)

    FROM authorization_attempts 

    WINDOW TUMBLING (SIZE 5 SECONDS)

    GROUP BY card_number

    HAVING count(*) > 3;
    Identifying patterns or anomalies in real-time data,
    surfaced in milliseconds

    View full-size slide

  65. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    KSQL for Real-Time Monitoring
    • Log data monitoring, tracking and alerting
    • syslog data
    • Sensor / IoT data
    CREATE STREAM SYSLOG_INVALID_USERS AS
    SELECT HOST, MESSAGE
    FROM SYSLOG
    WHERE MESSAGE LIKE '%Invalid user%';
    http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

    View full-size slide

  66. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    CREATE STREAM views_by_userid
    WITH (PARTITIONS=6, REPLICAS=5,
    VALUE_FORMAT='AVRO',
    TIMESTAMP='view_time') AS 

    SELECT * FROM clickstream
    PARTITION BY user_id;
    KSQL for Data Transformation
    Make simple derivations of existing topics from the command line

    View full-size slide

  67. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 54
    Streaming ETL, powered by Apache Kafka and Confluent Platform
    KSQL

    View full-size slide

  68. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Demo Time!

    View full-size slide

  69. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Demo Time!

    View full-size slide

  70. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 57
    Customers
    Logons
    Logon_Enriched
    Join, and Derive
    additional columns

    View full-size slide

  71. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 58
    Orders
    Orders_agg_hourly
    Aggregate order
    metrics by hour

    View full-size slide

  72. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018 60

    View full-size slide

  73. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Confluent Open Source :
    Apache Kafka with a bunch of cool stuff! For free!
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Apache Open Source Confluent Open Source Confluent Enterprise
    Confluent Platform
    Confluent Platform
    Apache Kafka®
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | CLI
    Apache Open Source Confluent Open Source Confluent Enterprise
    SQL Stream Processing
    KSQL

    View full-size slide

  74. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    Free Books!
    https://www.confluent.io/apache-kafka-stream-processing-book-bundle

    View full-size slide

  75. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    @rmoff [email protected]
    https://slackpass.io/confluentcommunity
    https://www.confluent.io/download/

    View full-size slide

  76. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL / June 2018
    #EOF

    View full-size slide