Pro Yearly is on sale from $80 to $50! »

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

Big data... big mess? Without a flexible and proven platform design up front there is the risk of a mess of point-to-point feeds. The solution to this is Apache Kafka, which enables stream or batch consumption of the data by multiple consumers. Implemented as part of Oracle's big data architecture, it acts as a flexible and scalable data bus for the enterprise. This session introduces the concepts of Kafka and a distributed stream platform, and explains how it fits within the big data architecture. See it used with Oracle GoldenGate to stream data into the data reservoir, as well as ad hoc population of discovery lab environments, microservices, and real-time search.

2bded62396ea66c84bd10e91c718dea9?s=128

Robin Moffatt

October 01, 2017
Tweet

Transcript

  1. 1 Apache Kafka™'s Role in Implementing Oracle's Big Data Reference

    Architecture SUN6259 Oracle OpenWorld, 1 Oct 2017 Robin Moffatt, Partner Technology Evangelist, EMEA t:@rmoff e:robin@confluent.io
  2. 2 What is Apache Kafka™?

  3. 3 “Apache Kafka™ is a 
 distributed streaming platform”

  4. 4 Sounds Fancy 4 Sounds Fancy

  5. 5 But

  6. 6 What is Kafka?

  7. 7 Kafka is a Distributed Streaming Platform Publish and subscribe

    to streams of data similar to a message queue or enterprise messaging system. Store streams of data in a fault tolerant way. Process streams of data in real time, as they occur. 110101 010111 001101 100010 110101 010111 001101 100010 110101 010111 001101 100010
  8. 8 Powered by Apache Kafka™

  9. 9 $ whoami • Partner Technology Evangelist @ Confluent •

    Working in data & analytics since 2001 • Oracle ACE Director • Blogging : http://rmoff.net & 
 https://www.confluent.io/blog/author/robin/ • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts
  10. 10 The Reference Architecture Or “Information Management and Big Data

    - A Reference Architecture” for short…
  11. 11 Information Management and Big Data Reference Architecture • Tool-agnostic

    logical architecture for Information Management, taking into account Big Data • Written by Oracle, with input from Mark Rittman and Stewart Bryson • Three years old, but still a good starting point for implementation design http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf
  12. 12 Conceptual Architecture http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

  13. 13 Implementation Patterns http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

  14. 14 How Do We Build For the Future? http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

  15. 15 Kafka in Detail

  16. 16 What is Kafka? • Messages are stored in Topics

    • Roughly analogous to a database table • Topics can be partitioned across multiple Kafka nodes for redundancy and performance
 • Not just about streaming - huge uses for data integration too
  17. 17 What is Kafka? • Kafka makes its data available

    to any consumer • Security permitting
 • Consumers: - • Are independent from each other • Can be grouped and parallelised for performance and resilience • Can re-read messages as required • Can read messages stream or batch
  18. 18 Running Kafka • Apache Kafka is open source •

    Includes Kafka Core, streams processing and data integration capabilities • Can be deployed standalone or as part of Confluent Platform • Also available in most Hadoop distributions, but older versions without latest functionality
  19. 19 Confluent Platform: Enterprise Streaming based on Apache Kafka™ Database

    Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | KSQL | CLI
  20. 20 What are the Problems that Kafka Solves?

  21. 21 What are the Problems That Kafka Solves? Consumer A

    Producer Kafka
  22. 22 Multiple Independent Customers of the Same Data Consumer A

    Producer Kafka Consumer B
  23. 23 Multiple Parallel Customers of the Same Data Consumer A

    Producer Kafka Consumer B Consumer A
  24. 24 Multiple Sources of the Same Type of Data Consumer

    A Producer Kafka Consumer B Consumer A Producer Kafka
  25. 25 Scaling Throughput and Resilience Consumer A Producer Consumer B

    Consumer A Producer Kafka Kafka
  26. 26 System Availability and Event Buffering Consumer A Producer

  27. 27 System Availability and Event Buffering Consumer A Producer Kafka

  28. 28 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract
  29. 29 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Needs near-realtime / 
 streamed data
  30. 30 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Now unnecessarily coupled together Still only gets 24hr batch dump of data
  31. 31 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Hits source system twice Event stream
  32. 32 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer 24hr batch extract Consumer B Requires reimplementation of Consumer A Event stream
  33. 33 Varying Latency Requirements / Batch vs Stream Consumer A

    Producer Kafka Consumer B Event stream Batch pull Event Stream
  34. 34 Technology & Code Changes Consumer A (v1) Producer Kafka

  35. 35 Technology & Code Changes Consumer A (v1) Producer Kafka

    Consumer A (v2)
  36. 36 Technology & Code Changes Producer Kafka Consumer A (v2)

  37. 37 How do I get my data into Kafka?

  38. 38 Kafka Connect : Stream data in and out of

    Kafka Amazon S3
  39. 39 But I need to join… aggregate…filter…

  40. 40 KSQL: a Streaming SQL Engine for Apache Kafka™ from

    Confluent • Enables stream processing with zero coding required • The simplest way to process streams of data in real- time • Powered by Kafka: scalable, distributed, battle-tested • All you need is Kafka–No complex deployments of bespoke systems for stream processing
  41. 41 CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts

    WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; KSQL: the Simplest Way to Do Stream Processing
  42. 42 Streaming ETL, powered by Apache Kafka and Confluent Platform

    KSQL
  43. 43 Reference Architecture - Implementation Patterns http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

  44. 44 Building for the Future • Enable flexibility & agility

    for: • Performance / scaling / resilience • Reducing latency / moving to stream processing • Increasing number of data sources • Connecting other (as yet unknown) consuming applications • Taking advantage of improved technologies (functionality, resilience, cost, scaling, performance)
  45. 45 Tightly-coupled = Inflexible

  46. 46 Apache Kafka is the solid foundation upon which you

    build any successful data platform
  47. 47 Conceptual Architecture http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

  48. 48 Conceptual Architecture - with Kafka as the Backbone Search

    Replica Graph DB NoSQL
  49. 49 It's not just about Information Management and Analytics Best

    Tool -> Best Job
  50. 50

  51. 51

  52. 52 Putting it into Practice

  53. 53 Kafka for building Streaming Data Pipelines • Source data

    is online ordering system running on Oracle • Requirement: • Realtime view of key customers logging onto the 
 application • Realtime view of aggregated order 
 counts and values • Long-term storage of data • Populate DW performance layer
  54. Streaming ETL with Apache Kafka and Confluent Platform Oracle Oracle

    GoldenGate for BigData Kafka Connect handler Elasticsearch Kafka Connect KSQL Schema Registry 54 Oracle Hadoop
  55. 55 Oracle Oracle GoldenGate for BigData Kafka Connect handler Elasticsearch

    Kafka Connect LOGON-JSON LOGON CUSTOMERS-JSON TOPIC CREATE STREAM LOGON (LOGON_ID INT, …) WITH (kafka_topic='LOGON-JSON', value_format='JSON'); TOPIC STREAM CUSTOMERS TABLE LOGON_ENRICHED STREAM CREATE STREAM LOGON_ENRICHED AS SELECT L.LOGON_ID, C.CUSTOMER_ID… FROM LOGON L LEFT OUTER JOIN CUSTOMERS C ON L.CUSTOMER_ID = C.CUSTOMER_ID; LOGON_ENRICHED TOPIC CREATE TABLE CUSTOMERS (CUSTOMER_ID INT…) WITH (kafka_topic='CUSTOMERS-JSON', value_format='JSON');
  56. 56 Driving Realtime Analytics with Apache Kafka Events in the

    source system (Oracle) are streamed in realtime through Kafka, enriched via KSQL, and streamed out through Kafka Connect into Elasticsearch.
  57. 57 Elasticsearch Kafka Connect order_mode_by_hour TABLE ORDERS STREAM create stream

    orders (ORDER_DATE STRING … WITH (kafka_topic='ORDERS', value_format='JSON'); create table order_mode_by_hour as select order_mode, count(*) as order_count from orders window tumbling (size 1 hour) group by order_mode; ORDERS order_mode_by_hour TOPIC TOPIC Oracle Oracle GoldenGate for BigData Kafka Connect handler
  58. 58 Streaming ETL with Apache Kafka and Confluent Platform Amazon

    S3 Kafka Connect Kafka Connect KSQL Schema Registry Kafka Streams
  59. 59

  60. 60

  61. None
  62. 62 Apache Kafka's Role in Implementing Oracle's Big Data Reference

    Architecture SUN6259 Oracle OpenWorld, 1 Oct 2017 Robin Moffatt, Partner Technology Evangelist, EMEA t: @rmoff e: robin@confluent.io https://www.confluent.io/download/ https://speakerdeck.com/rmoff/