Apache Kafka Data under Abstraction Layer - Kafka Data Characteristics and Stream processing

Slide 1

Slide 1 text

Apache Hop Incubating - User Group Japan #1 Apache Kafka® Data under Abstraction Layer - Apache Kafkaのデータ特性とストリーム処理の関係性 Shinichi Hashitani, Solutions Engineer, Nov-16, 2021.

Slide 2

Slide 2 text

Hop and Abstraction Layers Kafka and Data Stream-Table Duality

Slide 3

Slide 3 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Managing Data Flow with Apache Hop 3 Database Kafka MQTT Clients Analytics Dashboard Splunk S3 neo4j Hop Runtime Apache Beam® OR Servers ‘N Stuff

Slide 4

Slide 4 text

You can conﬁgure/run/test ﬂow of data on Hop based on medatada. Everything else is abstracted away.

Slide 5

Slide 5 text

Abstraction is a beautiful thing.

Slide 6

Slide 6 text

But sometimes it obscures some important details.

Slide 7

Slide 7 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Under the Abstraction Layer and Apache Kafka 7 Hop Runtime Apache Beam Google Dataflow Apache Spark® Apache Flink® YARN or Kubernetes or else All of those guys can talk to Kafka independently. Bare Metal VM Kafka runs on anywhere.

Slide 8

Slide 8 text

Who is talking to Kafka and at where Kafka is hosted could impact how ﬂows on Hop communicate with Kafka.

Slide 9

Slide 9 text

Another subtle, but important detail.

Slide 10

Slide 10 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. What is behind Kafka? 10 Database MQTT Clients Analytics Dashboard Splunk S3 neo4j All of those guys can talk to Kafka independently. Kafka can be an abstraction layer between Hop and sources/sinks.

Slide 11

Slide 11 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Kafka Connect - A Native Way to Connect 11 ● Fault tolerant ● Exactly Once Semantics ● Connect with configuration ● Preserve data schema Kafka Connect defines API for connector implementation, handling all the plumbing to keep data consistency. It supports hundreds of data stores (MySQL, Postgres, Oracle, DB2, Cassandra, neo4j, BigQuery, S3, Salesforce, Splunk, etc.)

Slide 12

Slide 12 text

Kafka Connect let you extract events out data stores, while keeping the data consistent.

Slide 13

Slide 13 text

How does Kafka handles data?

Slide 14

Slide 14 text

Hop and Abstraction Layers Kafka and Data Stream-Table Duality

Slide 15

Slide 15 text

Data Abstraction

Slide 16

Slide 16 text

Relational Database - Tables Data Lake (Hadoop) - Files Apache Kafka - Streams (Logs)

Slide 17

Slide 17 text

Stream Log ● Continuous ● Logical ● Inﬁnite ● Continuous ● Physical ● Finite

Slide 18

Slide 18 text

Event Streams Event-Driven Architecture Event Sourcing Event Hub ...

Slide 19

Slide 19 text

Events… Streams… Logs...

Slide 20

Slide 20 text

Events and Streams

Slide 21

Slide 21 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 21 final Properties settings = new Properties(); settings.put(ProducerConfig.CLIENT_ID_CONFIG, driverId); settings.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092"); settings.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); settings.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class); settings.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://schema-registry:8081"); final KafkaProducer producer = new KafkaProducer<>(settings); ... final ProducerRecord record = new ProducerRecord<>(“store-order”, key, value); producer.send(record); To Where？ What？

Slide 22

Slide 22 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 22 final ProducerRecord record = new ProducerRecord<>(“store-order”, key, value); Event for what? Event itself “store-order” is a Topic The only indicator to explain the event when sending it to Kafka. Kafka consolidates and manages event per topic.

Slide 23

Slide 23 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Stream = Series of Events for a Particular Topic 23 customer login: abc order confirmed: #001 order updated: #002 customer login: efg order canceled: #003 package received: #a01 at dist center: #b02 left dist center: #a02 delivered: #a01 customer C: 0001 order U: 0003 payment U: 0002 payment C: 0003 customer U: 0002 store-order order confirmed: #001 order updated: #002 order canceled: #003 store-customer customer login: abc customer login: efg logistic package received: #a01 left dist center: #a02 delivered: #a01 at dist center: #b02 orderdb-c customer C: 0001 customer U: 0002 orderdb-o order U: 0003 orderdb-p payment C: 0003 payment U: 0002

Slide 24

Slide 24 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Steam and Log 24 customer login: abc order confirmed: #001 order updated: #002 customer login: efg order canceled: #003 Append-Only Immutable 1 2 3 4 5 6 8 7 10 9 11 12 1 2 3 4 5 6 8 7 Old New

Slide 25

Slide 25 text

- Events with the same topic are aggregated and stored. - Events are immutable. - New events are appended at the end. ∴ [Topic(Event)] = Stream ≒ Log

Slide 26

Slide 26 text

[Topic(Event)] = Stream ≒ Log [Topic(Event)] = Stream Partition([Topic(Event)]) = Log

Slide 27

Slide 27 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Stream and Partition 27 1 2 3 4 5 6 8 7 10 9 11 12 (“store-order”, key, value) Partition 1 Partition 0 Partition 2 hash(key) mod (Partition Count) = Partition # 3 7 9 10 11 16 17 1 2 12 18 22 4 5 6 13 14 15 20 19 21 23 8

Slide 28

Slide 28 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Partition, Event Order, and Parallelism 28 Order of Events is guaranteed only within a given partition. # of Partition= Max number of threads (concurrency) 3 7 9 10 11 16 17 1 2 12 18 22 4 5 6 13 14 15 20 19 21 8

Slide 29

Slide 29 text

Next: Data and Streams

Slide 30

Slide 30 text

Hop and Abstraction Layers Kafka and Data Stream-Table Duality

Slide 31

Slide 31 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Chess Game Play and Chessboard 31 “Streams and Tables in Apache Kafka: A Primer”, Michael Noll, Confluent Blog. A series of piece movement and the state of the board are different representations of the same data. • The chess board represents the complete state of the game, at the given timing. • Following each piece movement, one-by-one, in the exact order, allows the state to be reproduced on demand.

Slide 32

Slide 32 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Relational Database Internals 32 All processes are performed in memory. All records required for computation are loaded into memory. Updates are synched back to physical device at certain timing. Each operation is recorded in log. When failure occurs during processing, the data can be recovered by processing all operations not synced back to the storage; one-by-one, in the exact order. fsync buffer load

Slide 33

Slide 33 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Database Replication 33 Database Replication： • Write-Ahead Log is passed from Primary to Secondary, one-by-one, in the exact order. • Secondary processes the log one-by-one, in the exact order. “Postgres Replication and Automatic Failover Tutorial”, Abbas Butt, EDB.

Slide 34

Slide 34 text

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. Change Data Capture 34 After turn the data update into streams then you can： • Extract a part of it • Send into a different schema • Send to a different storage • Send to a cheap object storage By transferring update one-by-one in exact same order, the data can be send to different place for different use cases, while keeping the data intact. The DB

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Stream-Table Duality 36 Stream (facts) Table (dims) alice Berlin bob Lima alice Berlin alice Rome bob Lima alice Paris bob Sydney alice Berlin alice Rome bob Lima alice Paris bob Sydney Change stream to a table Keeping the table up to date Stream and Table are different representation of same data. We call it “Stream-Table Duality”.

Slide 37

Slide 37 text

Summary

Slide 38

Slide 38 text

● Apache Kafka can be a layer of abstraction ● Kafka can connect with a lot of sources/sinks via Kafka Connect ● Kafka can provide you data in consistent manner

Slide 39

Slide 39 text

Apache Kafka is Your Friend♥

Slide 40

Slide 40 text

with a Bonus Content ♥

Slide 41

Slide 41 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Hello! My Name is: 41 Established: 2014 CEO: Jay Kreps Opened Japan Office in 2021 Engineers@Japan Office Ayumu Aizawa Solutions Engineer Shinichi Hashitani Solutions Engineer Keigo Suda Solutions Architect Wei Ding Customer Success Technical Architect

Slide 42

Slide 42 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Free Stuff! 42 I ♥ Logs by Jay Kreps eBook (pdf) Mastering Kafka Streams and ksqlDB by Mitch Seymour eBook (pdf) 200 USD Credit for 3 months! http://cnfl.io/mu-try-cloud