Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

Big data... big mess? Without a flexible and proven platform design up front there is the risk of a mess of point-to-point feeds. The solution to this is Apache Kafka, which enables stream or batch consumption of the data by multiple consumers. Implemented as part of Oracle's big data architecture, it acts as a flexible and scalable data bus for the enterprise. This session introduces the concepts of Kafka and a distributed stream platform, and explains how it fits within the big data architecture. See it used with Oracle GoldenGate to stream data into the data reservoir, as well as ad hoc population of discovery lab environments, microservices, and real-time search.

Robin Moffatt

October 01, 2017
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. 1
    Apache Kafka™'s Role in
    Implementing Oracle's Big
    Data Reference Architecture
    SUN6259
    Oracle OpenWorld, 1 Oct 2017
    Robin Moffatt, Partner Technology Evangelist, EMEA
    t:@rmoff e:[email protected]

    View Slide

  2. 2
    What is Apache Kafka™?

    View Slide

  3. 3
    “Apache Kafka™ is a 

    distributed streaming platform”

    View Slide

  4. 4
    Sounds Fancy
    4
    Sounds Fancy

    View Slide

  5. 5
    But

    View Slide

  6. 6
    What is Kafka?

    View Slide

  7. 7
    Kafka is a Distributed Streaming Platform
    Publish and
    subscribe to streams
    of data
    similar to a message
    queue or enterprise
    messaging system.
    Store streams of
    data
    in a fault tolerant way.
    Process streams
    of data
    in real time, as they
    occur.
    110101
    010111
    001101
    100010
    110101
    010111
    001101
    100010
    110101
    010111
    001101
    100010

    View Slide

  8. 8
    Powered by Apache Kafka™

    View Slide

  9. 9
    $ whoami
    • Partner Technology Evangelist @ Confluent
    • Working in data & analytics since 2001
    • Oracle ACE Director
    • Blogging : http://rmoff.net & 

    https://www.confluent.io/blog/author/robin/
    • Twitter: @rmoff
    • Geek stuff
    • Beer & Fried Breakfasts

    View Slide

  10. 10
    The Reference Architecture
    Or “Information Management and Big Data - A Reference Architecture” for short…

    View Slide

  11. 11
    Information Management and Big Data Reference Architecture
    • Tool-agnostic logical architecture for Information
    Management, taking into account Big Data
    • Written by Oracle, with input from Mark Rittman and Stewart
    Bryson
    • Three years old, but still a good starting point for
    implementation design
    http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

    View Slide

  12. 12
    Conceptual Architecture
    http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

    View Slide

  13. 13
    Implementation Patterns
    http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

    View Slide

  14. 14
    How Do We Build For the Future?
    http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

    View Slide

  15. 15
    Kafka in Detail

    View Slide

  16. 16
    What is Kafka?
    • Messages are stored in Topics
    • Roughly analogous to a database table
    • Topics can be partitioned across multiple Kafka nodes for
    redundancy and performance

    • Not just about streaming - huge uses for data integration too

    View Slide

  17. 17
    What is Kafka?
    • Kafka makes its data available to any consumer
    • Security permitting

    • Consumers: -
    • Are independent from each other
    • Can be grouped and parallelised for performance and resilience
    • Can re-read messages as required
    • Can read messages stream or batch

    View Slide

  18. 18
    Running Kafka
    • Apache Kafka is open source
    • Includes Kafka Core, streams processing and data integration capabilities
    • Can be deployed standalone or as part of Confluent Platform
    • Also available in most Hadoop distributions, but older versions without
    latest functionality

    View Slide

  19. 19
    Confluent Platform: Enterprise Streaming based on Apache Kafka™
    Database Changes Log Events loT Data Web Events …
    CRM
    Data Warehouse
    Database
    Hadoop
    Data

    Integration

    Monitoring
    Analytics
    Custom Apps
    Transformations
    Real-time Applications

    Apache Open Source Confluent Open Source Confluent Enterprise
    Confluent Platform
    Confluent Platform
    Apache Kafka™
    Core | Connect API | Streams API
    Data Compatibility
    Schema Registry
    Monitoring & Administration
    Confluent Control Center | Security
    Operations
    Replicator | Auto Data Balancing
    Development and Connectivity
    Clients | Connectors | REST Proxy | KSQL | CLI

    View Slide

  20. 20
    What are the Problems
    that Kafka Solves?

    View Slide

  21. 21
    What are the Problems That Kafka Solves?
    Consumer A
    Producer Kafka

    View Slide

  22. 22
    Multiple Independent Customers of the Same Data
    Consumer A
    Producer Kafka
    Consumer B

    View Slide

  23. 23
    Multiple Parallel Customers of the Same Data
    Consumer A
    Producer Kafka
    Consumer B
    Consumer A

    View Slide

  24. 24
    Multiple Sources of the Same Type of Data
    Consumer A
    Producer
    Kafka
    Consumer B
    Consumer A
    Producer
    Kafka

    View Slide

  25. 25
    Scaling Throughput and Resilience
    Consumer A
    Producer
    Consumer B
    Consumer A
    Producer Kafka
    Kafka

    View Slide

  26. 26
    System Availability and Event Buffering
    Consumer A
    Producer

    View Slide

  27. 27
    System Availability and Event Buffering
    Consumer A
    Producer Kafka

    View Slide

  28. 28
    Varying Latency Requirements / Batch vs Stream
    Consumer A
    Producer
    24hr batch extract

    View Slide

  29. 29
    Varying Latency Requirements / Batch vs Stream
    Consumer A
    Producer
    24hr batch extract
    Consumer B
    Needs near-realtime / 

    streamed data

    View Slide

  30. 30
    Varying Latency Requirements / Batch vs Stream
    Consumer A
    Producer
    24hr batch extract
    Consumer B
    Now unnecessarily coupled together
    Still only gets 24hr batch dump of data

    View Slide

  31. 31
    Varying Latency Requirements / Batch vs Stream
    Consumer A
    Producer
    24hr batch extract
    Consumer B
    Hits source system twice
    Event stream

    View Slide

  32. 32
    Varying Latency Requirements / Batch vs Stream
    Consumer A
    Producer
    24hr batch extract
    Consumer B
    Requires reimplementation of Consumer A
    Event stream

    View Slide

  33. 33
    Varying Latency Requirements / Batch vs Stream
    Consumer A
    Producer Kafka
    Consumer B
    Event stream Batch pull
    Event Stream

    View Slide

  34. 34
    Technology & Code Changes
    Consumer A
    (v1)
    Producer Kafka

    View Slide

  35. 35
    Technology & Code Changes
    Consumer A
    (v1)
    Producer Kafka
    Consumer A
    (v2)

    View Slide

  36. 36
    Technology & Code Changes
    Producer Kafka
    Consumer A
    (v2)

    View Slide

  37. 37
    How do I get my data into Kafka?

    View Slide

  38. 38
    Kafka Connect : Stream data in and out of Kafka
    Amazon S3

    View Slide

  39. 39
    But I need to join…
    aggregate…filter…

    View Slide

  40. 40
    KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent
    • Enables stream processing with zero coding required
    • The simplest way to process streams of data in real-
    time
    • Powered by Kafka: scalable, distributed, battle-tested
    • All you need is Kafka–No complex deployments of
    bespoke systems for stream processing

    View Slide

  41. 41
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 SECONDS)
    GROUP BY card_number
    HAVING count(*) > 3;
    KSQL: the Simplest Way to Do Stream Processing

    View Slide

  42. 42
    Streaming ETL, powered by Apache Kafka and Confluent Platform
    KSQL

    View Slide

  43. 43
    Reference Architecture - Implementation Patterns
    http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

    View Slide

  44. 44
    Building for the Future
    • Enable flexibility & agility for:
    • Performance / scaling / resilience
    • Reducing latency / moving to stream processing
    • Increasing number of data sources
    • Connecting other (as yet unknown) consuming applications
    • Taking advantage of improved technologies (functionality, resilience,
    cost, scaling, performance)

    View Slide

  45. 45
    Tightly-coupled = Inflexible

    View Slide

  46. 46
    Apache Kafka is the solid
    foundation upon which you
    build any successful data
    platform

    View Slide

  47. 47
    Conceptual Architecture
    http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

    View Slide

  48. 48
    Conceptual Architecture - with Kafka as the Backbone
    Search Replica Graph DB NoSQL

    View Slide

  49. 49
    It's not just about Information
    Management and Analytics
    Best Tool -> Best Job

    View Slide

  50. 50

    View Slide

  51. 51

    View Slide

  52. 52
    Putting it into Practice

    View Slide

  53. 53
    Kafka for building Streaming Data Pipelines
    • Source data is online ordering system running on Oracle
    • Requirement:
    • Realtime view of key customers logging onto the 

    application
    • Realtime view of aggregated order 

    counts and values
    • Long-term storage of data
    • Populate DW performance layer

    View Slide

  54. Streaming ETL with Apache Kafka and Confluent Platform
    Oracle
    Oracle
    GoldenGate
    for BigData
    Kafka Connect
    handler
    Elasticsearch
    Kafka
    Connect
    KSQL
    Schema
    Registry
    54
    Oracle
    Hadoop

    View Slide

  55. 55
    Oracle
    Oracle
    GoldenGate
    for BigData
    Kafka Connect
    handler
    Elasticsearch
    Kafka
    Connect
    LOGON-JSON
    LOGON
    CUSTOMERS-JSON
    TOPIC
    CREATE STREAM LOGON (LOGON_ID INT, …)
    WITH (kafka_topic='LOGON-JSON',
    value_format='JSON');
    TOPIC
    STREAM
    CUSTOMERS
    TABLE
    LOGON_ENRICHED
    STREAM
    CREATE STREAM LOGON_ENRICHED AS
    SELECT L.LOGON_ID, C.CUSTOMER_ID…
    FROM LOGON L
    LEFT OUTER JOIN CUSTOMERS C
    ON L.CUSTOMER_ID = C.CUSTOMER_ID;
    LOGON_ENRICHED
    TOPIC
    CREATE TABLE CUSTOMERS (CUSTOMER_ID INT…)
    WITH (kafka_topic='CUSTOMERS-JSON',
    value_format='JSON');

    View Slide

  56. 56
    Driving Realtime Analytics with Apache Kafka
    Events in the source system (Oracle) are streamed in realtime through Kafka,
    enriched via KSQL, and streamed out through Kafka Connect into Elasticsearch.

    View Slide

  57. 57
    Elasticsearch
    Kafka Connect
    order_mode_by_hour
    TABLE
    ORDERS
    STREAM
    create stream orders
    (ORDER_DATE STRING …
    WITH (kafka_topic='ORDERS',
    value_format='JSON');
    create table order_mode_by_hour
    as select order_mode,
    count(*) as order_count
    from orders
    window tumbling (size 1 hour)
    group by order_mode;
    ORDERS
    order_mode_by_hour
    TOPIC
    TOPIC
    Oracle
    Oracle
    GoldenGate
    for BigData
    Kafka Connect
    handler

    View Slide

  58. 58
    Streaming ETL with Apache Kafka and Confluent Platform
    Amazon S3
    Kafka
    Connect
    Kafka
    Connect
    KSQL
    Schema
    Registry
    Kafka
    Streams

    View Slide

  59. 59

    View Slide

  60. 60

    View Slide

  61. View Slide

  62. 62
    Apache Kafka's Role in
    Implementing Oracle's Big
    Data Reference Architecture
    SUN6259
    Oracle OpenWorld, 1 Oct 2017
    Robin Moffatt, Partner Technology Evangelist, EMEA
    t: @rmoff e: [email protected]
    https://www.confluent.io/download/
    https://speakerdeck.com/rmoff/

    View Slide