Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kafka Data under Abstraction Layer - Kafka Data Characteristics and Stream processing

hashi
November 16, 2021

Apache Kafka Data under Abstraction Layer - Kafka Data Characteristics and Stream processing

Apache Hop Incubating - User Group Japan #1 登壇資料

Apache KafkaはApache Hop観点から見るとデータレイヤーにあたり、通常はApache Beam等のオーケストレーションレイヤーの抽象化によって実態を意識する必要はほぼありません。しかしながらApache Kafkaの扱うストリームは一般的なpub/subメッセージの集合ではなく、強いデータ整合性を保つストリームとなっています。本セッションではApache Kafkaの世界観で見るデータとストリームの扱い方と、間接的に処理する上で重要なExactly Once Semanticsについてご説明します。

hashi

November 16, 2021
Tweet

More Decks by hashi

Other Decks in Technology

Transcript

  1. Apache Hop Incubating - User Group Japan #1 Apache Kafka®

    Data under Abstraction Layer - Apache Kafkaのデータ特性とストリーム処理の関係性 Shinichi Hashitani, Solutions Engineer, Nov-16, 2021.
  2. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Managing Data Flow with Apache Hop 3 Database Kafka MQTT Clients Analytics Dashboard Splunk S3 neo4j Hop Runtime Apache Beam® OR Servers ‘N Stuff
  3. You can configure/run/test flow of data on Hop based on

    medatada. Everything else is abstracted away.
  4. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Under the Abstraction Layer and Apache Kafka 7 Hop Runtime Apache Beam Google Dataflow Apache Spark® Apache Flink® YARN or Kubernetes or else All of those guys can talk to Kafka independently. Bare Metal VM Kafka runs on anywhere.
  5. Who is talking to Kafka and at where Kafka is

    hosted could impact how flows on Hop communicate with Kafka.
  6. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. What is behind Kafka? 10 Database MQTT Clients Analytics Dashboard Splunk S3 neo4j All of those guys can talk to Kafka independently. Kafka can be an abstraction layer between Hop and sources/sinks.
  7. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Kafka Connect - A Native Way to Connect 11 • Fault tolerant • Exactly Once Semantics • Connect with configuration • Preserve data schema Kafka Connect defines API for connector implementation, handling all the plumbing to keep data consistency. It supports hundreds of data stores (MySQL, Postgres, Oracle, DB2, Cassandra, neo4j, BigQuery, S3, Salesforce, Splunk, etc.)
  8. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. 21 final Properties settings = new Properties(); settings.put(ProducerConfig.CLIENT_ID_CONFIG, driverId); settings.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092"); settings.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); settings.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class); settings.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://schema-registry:8081"); final KafkaProducer<String, PositionValue> producer = new KafkaProducer<>(settings); ... final ProducerRecord<String, PositionValue> record = new ProducerRecord<>(“store-order”, key, value); producer.send(record); To Where? What?
  9. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. 22 final ProducerRecord<String, PositionValue> record = new ProducerRecord<>(“store-order”, key, value); Event for what? Event itself “store-order” is a Topic The only indicator to explain the event when sending it to Kafka. Kafka consolidates and manages event per topic.
  10. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Stream = Series of Events for a Particular Topic 23 customer login: abc order confirmed: #001 order updated: #002 customer login: efg order canceled: #003 package received: #a01 at dist center: #b02 left dist center: #a02 delivered: #a01 customer C: 0001 order U: 0003 payment U: 0002 payment C: 0003 customer U: 0002 store-order order confirmed: #001 order updated: #002 order canceled: #003 store-customer customer login: abc customer login: efg logistic package received: #a01 left dist center: #a02 delivered: #a01 at dist center: #b02 orderdb-c customer C: 0001 customer U: 0002 orderdb-o order U: 0003 orderdb-p payment C: 0003 payment U: 0002
  11. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Steam and Log 24 customer login: abc order confirmed: #001 order updated: #002 customer login: efg order canceled: #003 Append-Only Immutable 1 2 3 4 5 6 8 7 10 9 11 12 1 2 3 4 5 6 8 7 Old New
  12. - Events with the same topic are aggregated and stored.

    - Events are immutable. - New events are appended at the end. ∴ [Topic(Event)] = Stream ≒ Log
  13. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Stream and Partition 27 1 2 3 4 5 6 8 7 10 9 11 12 (“store-order”, key, value) Partition 1 Partition 0 Partition 2 hash(key) mod (Partition Count) = Partition # 3 7 9 10 11 16 17 1 2 12 18 22 4 5 6 13 14 15 20 19 21 23 8
  14. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Partition, Event Order, and Parallelism 28 Order of Events is guaranteed only within a given partition. # of Partition= Max number of threads (concurrency) 3 7 9 10 11 16 17 1 2 12 18 22 4 5 6 13 14 15 20 19 21 8
  15. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Chess Game Play and Chessboard 31 “Streams and Tables in Apache Kafka: A Primer”, Michael Noll, Confluent Blog. A series of piece movement and the state of the board are different representations of the same data. • The chess board represents the complete state of the game, at the given timing. • Following each piece movement, one-by-one, in the exact order, allows the state to be reproduced on demand.
  16. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Relational Database Internals 32 All processes are performed in memory. All records required for computation are loaded into memory. Updates are synched back to physical device at certain timing. Each operation is recorded in log. When failure occurs during processing, the data can be recovered by processing all operations not synced back to the storage; one-by-one, in the exact order. fsync buffer load
  17. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Database Replication 33 Database Replication: • Write-Ahead Log is passed from Primary to Secondary, one-by-one, in the exact order. • Secondary processes the log one-by-one, in the exact order. “Postgres Replication and Automatic Failover Tutorial”, Abbas Butt, EDB.
  18. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Change Data Capture 34 After turn the data update into streams then you can: • Extract a part of it • Send into a different schema • Send to a different storage • Send to a cheap object storage By transferring update one-by-one in exact same order, the data can be send to different place for different use cases, while keeping the data intact. The DB
  19. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Stream-Table Duality 35
  20. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Stream-Table Duality 36 Stream (facts) Table (dims) alice Berlin bob Lima alice Berlin alice Rome bob Lima alice Paris bob Sydney alice Berlin alice Rome bob Lima alice Paris bob Sydney Change stream to a table Keeping the table up to date Stream and Table are different representation of same data. We call it “Stream-Table Duality”.
  21. • Apache Kafka can be a layer of abstraction •

    Kafka can connect with a lot of sources/sinks via Kafka Connect • Kafka can provide you data in consistent manner
  22. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Hello! My Name is: 41 Established: 2014 CEO: Jay Kreps Opened Japan Office in 2021 Engineers@Japan Office Ayumu Aizawa Solutions Engineer Shinichi Hashitani Solutions Engineer Keigo Suda Solutions Architect Wei Ding Customer Success Technical Architect
  23. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Free Stuff! 42 I ♥ Logs by Jay Kreps eBook (pdf) Mastering Kafka Streams and ksqlDB by Mitch Seymour eBook (pdf) 200 USD Credit for 3 months! http://cnfl.io/mu-try-cloud