Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

Big data... big mess? Without a flexible and proven platform design up front there is the risk of a mess of point-to-point feeds. The solution to this is Apache Kafka, which enables stream or batch consumption of the data by multiple consumers. Implemented as part of Oracle's big data architecture, it acts as a flexible and scalable data bus for the enterprise. This session introduces the concepts of Kafka and a distributed stream platform, and explains how it fits within the big data architecture. See it used with Oracle GoldenGate to stream data into the data reservoir, as well as ad hoc population of discovery lab environments, microservices, and real-time search.

Robin Moffatt

October 01, 2017
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. 1 Apache Kafka™'s Role in Implementing Oracle's Big Data Reference

    Architecture SUN6259 Oracle OpenWorld, 1 Oct 2017 Robin Moffatt, Partner Technology Evangelist, EMEA t:@rmoff e:[email protected]
  2. 7 Kafka is a Distributed Streaming Platform Publish and subscribe

    to streams of data similar to a message queue or enterprise messaging system. Store streams of data in a fault tolerant way. Process streams of data in real time, as they occur. 110101 010111 001101 100010 110101 010111 001101 100010 110101 010111 001101 100010
  3. 9 $ whoami • Partner Technology Evangelist @ Confluent •

    Working in data & analytics since 2001 • Oracle ACE Director • Blogging : http://rmoff.net & 
 https://www.confluent.io/blog/author/robin/ • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts
  4. 11 Information Management and Big Data Reference Architecture • Tool-agnostic

    logical architecture for Information Management, taking into account Big Data • Written by Oracle, with input from Mark Rittman and Stewart Bryson • Three years old, but still a good starting point for implementation design http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf
  5. 16 What is Kafka? • Messages are stored in Topics

    • Roughly analogous to a database table • Topics can be partitioned across multiple Kafka nodes for redundancy and performance
 • Not just about streaming - huge uses for data integration too
  6. 17 What is Kafka? • Kafka makes its data available

    to any consumer • Security permitting
 • Consumers: - • Are independent from each other • Can be grouped and parallelised for performance and resilience • Can re-read messages as required • Can read messages stream or batch
  7. 18 Running Kafka • Apache Kafka is open source •

    Includes Kafka Core, streams processing and data integration capabilities • Can be deployed standalone or as part of Confluent Platform • Also available in most Hadoop distributions, but older versions without latest functionality
  8. 19 Confluent Platform: Enterprise Streaming based on Apache Kafka™ Database

    Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | KSQL | CLI
  9. 23 Multiple Parallel Customers of the Same Data Consumer A

    Producer Kafka Consumer B Consumer A
  10. 24 Multiple Sources of the Same Type of Data Consumer

    A Producer Kafka Consumer B Consumer A Producer Kafka
  11. 29 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Needs near-realtime / 
 streamed data
  12. 30 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Now unnecessarily coupled together Still only gets 24hr batch dump of data
  13. 31 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Hits source system twice Event stream
  14. 32 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Requires reimplementation of Consumer A Event stream
  15. 33 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer Kafka Consumer B Event stream Batch pull Event Stream
  16. 40 KSQL: a Streaming SQL Engine for Apache Kafka™ from

    Confluent • Enables stream processing with zero coding required • The simplest way to process streams of data in real- time • Powered by Kafka: scalable, distributed, battle-tested • All you need is Kafka–No complex deployments of bespoke systems for stream processing
  17. 41 CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts

    WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; KSQL: the Simplest Way to Do Stream Processing
  18. 44 Building for the Future • Enable flexibility & agility

    for: • Performance / scaling / resilience • Reducing latency / moving to stream processing • Increasing number of data sources • Connecting other (as yet unknown) consuming applications • Taking advantage of improved technologies (functionality, resilience, cost, scaling, performance)
  19. 46 Apache Kafka is the solid foundation upon which you

    build any successful data platform
  20. 50

  21. 51

  22. 53 Kafka for building Streaming Data Pipelines • Source data

    is online ordering system running on Oracle • Requirement: • Realtime view of key customers logging onto the 
 application • Realtime view of aggregated order 
 counts and values • Long-term storage of data • Populate DW performance layer
  23. Streaming ETL with Apache Kafka and Confluent Platform Oracle Oracle

    GoldenGate for BigData Kafka Connect handler Elasticsearch Kafka Connect KSQL Schema Registry 54 Oracle Hadoop
  24. 55 Oracle Oracle GoldenGate for BigData Kafka Connect handler Elasticsearch

    Kafka Connect LOGON-JSON LOGON CUSTOMERS-JSON TOPIC CREATE STREAM LOGON (LOGON_ID INT, …) WITH (kafka_topic='LOGON-JSON', value_format='JSON'); TOPIC STREAM CUSTOMERS TABLE LOGON_ENRICHED STREAM CREATE STREAM LOGON_ENRICHED AS SELECT L.LOGON_ID, C.CUSTOMER_ID… FROM LOGON L LEFT OUTER JOIN CUSTOMERS C ON L.CUSTOMER_ID = C.CUSTOMER_ID; LOGON_ENRICHED TOPIC CREATE TABLE CUSTOMERS (CUSTOMER_ID INT…) WITH (kafka_topic='CUSTOMERS-JSON', value_format='JSON');
  25. 56 Driving Realtime Analytics with Apache Kafka Events in the

    source system (Oracle) are streamed in realtime through Kafka, enriched via KSQL, and streamed out through Kafka Connect into Elasticsearch.
  26. 57 Elasticsearch Kafka Connect order_mode_by_hour TABLE ORDERS STREAM create stream

    orders (ORDER_DATE STRING … WITH (kafka_topic='ORDERS', value_format='JSON'); create table order_mode_by_hour as select order_mode, count(*) as order_count from orders window tumbling (size 1 hour) group by order_mode; ORDERS order_mode_by_hour TOPIC TOPIC Oracle Oracle GoldenGate for BigData Kafka Connect handler
  27. 58 Streaming ETL with Apache Kafka and Confluent Platform Amazon

    S3 Kafka Connect Kafka Connect KSQL Schema Registry Kafka Streams
  28. 59

  29. 60

  30. 62 Apache Kafka's Role in Implementing Oracle's Big Data Reference

    Architecture SUN6259 Oracle OpenWorld, 1 Oct 2017 Robin Moffatt, Partner Technology Evangelist, EMEA t: @rmoff e: [email protected] https://www.confluent.io/download/ https://speakerdeck.com/rmoff/