Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

1 Apache Kafka™'s Role in Implementing Oracle's Big Data Reference
Architecture SUN6259 Oracle OpenWorld, 1 Oct 2017 Robin Moffatt, Partner Technology Evangelist, EMEA t:@rmoff e:[email protected]

2 What is Apache Kafka™?

3 “Apache Kafka™ is a   distributed streaming platform”

4 Sounds Fancy 4 Sounds Fancy

6 What is Kafka?

7 Kafka is a Distributed Streaming Platform Publish and subscribe
to streams of data similar to a message queue or enterprise messaging system. Store streams of data in a fault tolerant way. Process streams of data in real time, as they occur. 110101 010111 001101 100010 110101 010111 001101 100010 110101 010111 001101 100010

8 Powered by Apache Kafka™

9 $ whoami • Partner Technology Evangelist @ Confluent •
Working in data & analytics since 2001 • Oracle ACE Director • Blogging : http://rmoff.net &   https://www.confluent.io/blog/author/robin/ • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts

10 The Reference Architecture Or “Information Management and Big Data
- A Reference Architecture” for short…

11 Information Management and Big Data Reference Architecture • Tool-agnostic
logical architecture for Information Management, taking into account Big Data • Written by Oracle, with input from Mark Rittman and Stewart Bryson • Three years old, but still a good starting point for implementation design http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

12 Conceptual Architecture http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

13 Implementation Patterns http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

14 How Do We Build For the Future? http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

15 Kafka in Detail

16 What is Kafka? • Messages are stored in Topics
• Roughly analogous to a database table • Topics can be partitioned across multiple Kafka nodes for redundancy and performance  • Not just about streaming - huge uses for data integration too

17 What is Kafka? • Kafka makes its data available
to any consumer • Security permitting  • Consumers: - • Are independent from each other • Can be grouped and parallelised for performance and resilience • Can re-read messages as required • Can read messages stream or batch

18 Running Kafka • Apache Kafka is open source •
Includes Kafka Core, streams processing and data integration capabilities • Can be deployed standalone or as part of Confluent Platform • Also available in most Hadoop distributions, but older versions without latest functionality

19 Confluent Platform: Enterprise Streaming based on Apache Kafka™ Database
Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data  Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | KSQL | CLI

20 What are the Problems that Kafka Solves?

21 What are the Problems That Kafka Solves? Consumer A
Producer Kafka

22 Multiple Independent Customers of the Same Data Consumer A
Producer Kafka Consumer B

23 Multiple Parallel Customers of the Same Data Consumer A
Producer Kafka Consumer B Consumer A

24 Multiple Sources of the Same Type of Data Consumer
A Producer Kafka Consumer B Consumer A Producer Kafka

25 Scaling Throughput and Resilience Consumer A Producer Consumer B
Consumer A Producer Kafka Kafka

26 System Availability and Event Buffering Consumer A Producer

27 System Availability and Event Buffering Consumer A Producer Kafka

28 Varying Latency Requirements / Batch vs Stream Consumer A
Producer 24hr batch extract

Producer 24hr batch extract Consumer B Needs near-realtime /   streamed data

Producer 24hr batch extract Consumer B Now unnecessarily coupled together Still only gets 24hr batch dump of data

Producer 24hr batch extract Consumer B Hits source system twice Event stream

Producer 24hr batch extract Consumer B Requires reimplementation of Consumer A Event stream

Producer Kafka Consumer B Event stream Batch pull Event Stream

34 Technology & Code Changes Consumer A (v1) Producer Kafka

35 Technology & Code Changes Consumer A (v1) Producer Kafka
Consumer A (v2)

36 Technology & Code Changes Producer Kafka Consumer A (v2)

37 How do I get my data into Kafka?

38 Kafka Connect : Stream data in and out of
Kafka Amazon S3

39 But I need to join… aggregate…filter…

40 KSQL: a Streaming SQL Engine for Apache Kafka™ from
Confluent • Enables stream processing with zero coding required • The simplest way to process streams of data in realtime • Powered by Kafka: scalable, distributed, battle-tested • All you need is Kafka–No complex deployments of bespoke systems for stream processing

41 CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; KSQL: the Simplest Way to Do Stream Processing

42 Streaming ETL, powered by Apache Kafka and Confluent Platform
KSQL

43 Reference Architecture - Implementation Patterns http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

44 Building for the Future • Enable flexibility & agility
for: • Performance / scaling / resilience • Reducing latency / moving to stream processing • Increasing number of data sources • Connecting other (as yet unknown) consuming applications • Taking advantage of improved technologies (functionality, resilience, cost, scaling, performance)

45 Tightly-coupled = Inflexible

46 Apache Kafka is the solid foundation upon which you
build any successful data platform

47 Conceptual Architecture http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatarefarchitecture-2297765.pdf

48 Conceptual Architecture - with Kafka as the Backbone Search
Replica Graph DB NoSQL

49 It's not just about Information Management and Analytics Best
Tool -> Best Job

52 Putting it into Practice

53 Kafka for building Streaming Data Pipelines • Source data
is online ordering system running on Oracle • Requirement: • Realtime view of key customers logging onto the   application • Realtime view of aggregated order   counts and values • Long-term storage of data • Populate DW performance layer

Streaming ETL with Apache Kafka and Confluent Platform Oracle Oracle
GoldenGate for BigData Kafka Connect handler Elasticsearch Kafka Connect KSQL Schema Registry 54 Oracle Hadoop

55 Oracle Oracle GoldenGate for BigData Kafka Connect handler Elasticsearch
Kafka Connect LOGON-JSON LOGON CUSTOMERS-JSON TOPIC CREATE STREAM LOGON (LOGON_ID INT, …) WITH (kafka_topic='LOGON-JSON', value_format='JSON'); TOPIC STREAM CUSTOMERS TABLE LOGON_ENRICHED STREAM CREATE STREAM LOGON_ENRICHED AS SELECT L.LOGON_ID, C.CUSTOMER_ID… FROM LOGON L LEFT OUTER JOIN CUSTOMERS C ON L.CUSTOMER_ID = C.CUSTOMER_ID; LOGON_ENRICHED TOPIC CREATE TABLE CUSTOMERS (CUSTOMER_ID INT…) WITH (kafka_topic='CUSTOMERS-JSON', value_format='JSON');

56 Driving Realtime Analytics with Apache Kafka Events in the
source system (Oracle) are streamed in realtime through Kafka, enriched via KSQL, and streamed out through Kafka Connect into Elasticsearch.

57 Elasticsearch Kafka Connect order_mode_by_hour TABLE ORDERS STREAM create stream
orders (ORDER_DATE STRING … WITH (kafka_topic='ORDERS', value_format='JSON'); create table order_mode_by_hour as select order_mode, count(*) as order_count from orders window tumbling (size 1 hour) group by order_mode; ORDERS order_mode_by_hour TOPIC TOPIC Oracle Oracle GoldenGate for BigData Kafka Connect handler

58 Streaming ETL with Apache Kafka and Confluent Platform Amazon
S3 Kafka Connect Kafka Connect KSQL Schema Registry Kafka Streams

62 Apache Kafka's Role in Implementing Oracle's Big Data Reference
Architecture SUN6259 Oracle OpenWorld, 1 Oct 2017 Robin Moffatt, Partner Technology Evangelist, EMEA t: @rmoff e: [email protected] https://www.confluent.io/download/ https://speakerdeck.com/rmoff/

Apache Kafka's Role in Implementing Oracle's Bi...

Apache Kafka's Role in Implementing Oracle's Big Data Reference Architecture

More Decks by Robin Moffatt

Other Decks in Technology

Featured

Transcript