#33 Big Data processing avec Apache Kafka et KSQL

Slide 1

Slide 1 text

@tgrall Introduction to Apache Kafka & KSQL Tugdual Grall Product Management Red Hat @tgrall

Slide 2

Slide 2 text

@tgrall About me Tugdual “Tug” Grall ! Red Hat: Product Management ! Developer Tools/Developer Experience ! MapR: Tech Evangelist/Product Manager ! MongoDB: Technical Evangelist ! Couchbase: Technical Evangelist ! eXo Platform: CTO ! Oracle: PM/Developer/Consultant ! NantesJUG co-founder 2008! ! @tgrall ! http://github.com/tgrall ! [email protected] | [email protected] ! Pet Project  https://promoglisse-speed-challenge.com

Slide 3

Slide 3 text

@tgrall Red Hat Developer Tools Port Folio OPENSHIFT DEVELOPER TOOLS PUBLIC SAAS   TOOLS MIDDLEWARE AND RHEL DEVELOPER TOOLS - OpenShift-DO CLI - Container Development Kit - Eclipse Che - Eclipse Che plugins - DevStudio plugins - VS Code plugins - RHEL tools - OpenShift.io (ToolChain) - Hosted Eclipse Che https://developers.redhat.com/

Slide 4

Slide 4 text

@tgrall “ … a publish/subscribe messaging system …” What is Kafka ?

Slide 5

Slide 5 text

@tgrall “ … a streaming data platform …” What is Kafka ?

Slide 6

Slide 6 text

@tgrall “ … a distributed, horizontally-scalable, fault-tolerant, commit log …” What is Kafka ?

Slide 7

Slide 7 text

@tgrall ! Developed at LinkedIn back in 2010, open sourced in 2011 ! Designed to be fast, scalable, durable and available ! Distributed by nature ! Data partitioning ! High throughput / low latency ! Ability to handle huge number of consumers What is Kafka ?

Slide 8

Slide 8 text

@tgrall ! Strimzi provides a way to run an Apache Kafka cluster on OpenShift and Kubernetes in various deployment configurations ! http://strimzi.io/ ! Red Hat supported version: Red Hat AMQ Streams Apache Kafka & Kubernetes: Strimzi Project

Slide 9

Slide 9 text

@tgrall Ecosystem

Slide 10

Slide 10 text

@tgrall ! Apache Kafka has an ecosystem consisting of many components / tools ○ Kafka Core ■ Broker ■ Clients library (Producer, Consumer, Admin) ■ Management tools ○ Kafka Connect ○ Kafka Streams ○ Mirror Maker Apache Kafka ecosystem

Slide 11

Slide 11 text

@tgrall Apache Kafka ecosystem ! Kafka Broker ○ Central component responsible for hosting topics and delivering messages ○ One or more brokers run in a cluster alongside with a Zookeeper ensemble ! Kafka Producers and Consumers ○ Java-based clients for sending and receiving messages ! Kafka Admin tools ○ Java- and Scala- based tools for managing Kafka brokers ○ Managing topics, ACLs, monitoring etc. Apache Kafka components

Slide 12

Slide 12 text

@tgrall Kafka & Zookeeper Zookeeper Kafka Applications Admin tools

Slide 13

Slide 13 text

@tgrall ! Kafka Connect ○ Framework for transferring data between Kafka and other data systems ○ Facilitate data conversion, scaling, load balancing, fault tolerance, … ○ Connector plugins are deployed into Kafka connect cluster ■ Well defined API for creating new connectors (with Sink/Source) ■ Apache Kafka itself includes only FileSink and FileSource plugins (reading records from file and posting them as Kafka messages / writing Kafka messages to files) ■ Many additional plugins are available outside of Apache Kafka Kafka ecosystem Apache Kafka components

Slide 14

Slide 14 text

@tgrall Kafka ecosystem Apache Kafka components Connect API Connect API Source Sink Stream API

Slide 15

Slide 15 text

@tgrall ! Mirror Maker ○ Kafka clusters do not work well when split across multiple datacenters ■ Low bandwidth, High latency ■ For use within multiple datacenters it is recommended to setup independent cluster in each data center and mirror the data ○ Tool for replication of topics between different clusters Kafka ecosystem Apache Kafka components

Slide 16

Slide 16 text

@tgrall Across Data Centers Broker 1 T1 - P1 T1 - P2 Broker 1 T1 - P1 T1 - P2 Producer Data Center 1 Data Center 2 Broker 2 Broker N Broker 2 Broker N Producer Mirror Maker Mirror Maker Geo location 1 Geo location 2 Mirror Maker

Slide 17

Slide 17 text

@tgrall ! Clients for other languages ! REST Proxy for bridging between HTTP and Kafka ! Schema Registry ! Cluster Balancers ! Management and Monitoring consoles ! Kafka Connect plugins ! KSQL (Confluent) ! Kafka can be used with many other projects   (e.g. Apache Spark, Apache Flink, Apache Storm) Kafka ecosystem Outside of Apache Kafka project

Slide 18

Slide 18 text

@tgrall Topic & Partitions

Slide 19

Slide 19 text

@tgrall ! Messages / records are sent to / received from topic ○ Topics are split into one or more partitions ○ Partition = Shard ○ All actual work is done on partition level, topic is just a virtual object ! Each message is written only into a one selected partition ○ Partitioning is usually done based on the message key ○ Message ordering within the partition is fixed ! Clean-up policies ○ Based on size / message age ○ Compacted based on message key Topic & Partitions

Slide 20

Slide 20 text

@tgrall Topic & Partitions old new 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Producer Partition 0 Partition 1 Partition 2 Producing messages

Slide 21

Slide 21 text

@tgrall Topic & Partitions old new 0 1 2 3 4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Consumer Partition 0 Partition 1 Partition 2 Consuming messages

Slide 22

Slide 22 text

@tgrall ! They are “backup” for a partition ○ Provide redundancy ! It’s the way Kafka guarantees availability and durability in case of node failures ! Two roles : ○ Leader : a replica used by producers/consumers for exchanging messages ○ Followers : all the other replicas ■ They don’t serve client requests ■ They replicate messages from the leader to be “in-sync” (ISR) ○ A replica changes its role as brokers come and go Replication Leaders & Followers

Slide 23

Slide 23 text

@tgrall Broker 1 Partitions Distribution T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ! Leaders and followers spread across the cluster ○ producers/consumers connect to leaders ○ multiple connections needed for reading different partitions Leaders & Followers

Slide 24

Slide 24 text

@tgrall Broker 1 Partitions Distribution T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ! A broker with leader partition goes down ! New leader partition is elected on different node Leaders & Followers

Slide 25

Slide 25 text

@tgrall ! They are really “smart” (unlike “traditional” messaging) ! Configured with a “bootstrap servers” list for fetching first metadata ○ Where are interested topics ? Connect to broker which holds partition leaders ○ Producer specifies destination partition ○ Consumer handles messages offsets to read ○ If error happens, refresh metadata (something is changed in the cluster) ! Batching on producing/consuming Clients

Slide 26

Slide 26 text

@tgrall ! Destination partition computed on client ○ Round robin ○ Specified by hashing the “key” in the message ○ Custom partitioning ! Writes messages to “leader” for a partition ! Acknowledge : ○ No ack ○ Ack on message written to “leader” ○ Ack on message also replicated to “in-sync” replicas Producers

Slide 27

Slide 27 text

@tgrall ! Read from one (or more) partition(s) ! Track (commit) the offset for given partition ○ A partitioned topic “__consumer_offsets” is used for that ○ Key → [group, topic, partition], Value → [offset] ○ Offset is shared inside the consumer group ! QoS ○ At most once : read message, commit offset, process message ○ At least once : read message, process message, commit offset ○ Exactly once : read message, commit message output and offset to a transactional system ! Gets only “committed” messages (depends on producer “ack” level) Consumers

Slide 28

Slide 28 text

@tgrall Broker 1 Producers & Consumers T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Producer Consumer Consumer Producer Consumer Writing/Reading to/from leaders

Slide 29

Slide 29 text

@tgrall ! The consumer asks for a specific partition (assign) ○ An application using one or more consumers has to handle such assignment on its own, the scaling as well ! The consumer is part of a “consumer group” ○ Consumer groups are an easier way to scale up consumption ○ One of the consumers, as “group lead”, applies a strategy to assign partitions to consumers in the group ○ When new consumers join or leave, a rebalancing happens to reassign partitions ○ This allows pluggable strategies for partition assignment (e.g. stickiness) Consumer: partitions assignment Available approaches

Slide 30

Slide 30 text

@tgrall ! Consumer Group ○ Grouping multiple consumers ○ Each consumer reads from a “unique” subset of partition → max consumers = num partitions ○ They are “competing” consumers on the topic, each message delivered to one consumer ○ Messages with same “key” delivered to same consumer ! More consumer groups ○ Allows publish/subscribe ○ Same messages delivered to different consumers in different consumer groups Consumer Groups

Slide 31

Slide 31 text

@tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer Partitions assignment

Slide 32

Slide 32 text

@tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer Rebalancing

Slide 33

Slide 33 text

@tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2 Partition 3 Group 1 Consumer Consumer Consumer Consumer Consumer Max parallelism & idle consumer

Slide 34

Slide 34 text

@tgrall ! Encryption between clients and brokers and between brokers ○ Using SSL ! Authentication of clients (and brokers) connecting to brokers ○ Using SSL (mutual authentication) ○ Using SASL (with PLAIN, Kerberos or SCRAM-SHA as mechanisms) ! Authorization of read/writes operation by clients ○ ACLs on resources such as topics ○ Authenticated “principal” for configuring ACLs ○ Pluggable ! It’s possible to mix encryption/no-encryption and authentication/no- authentication Security

Slide 35

Slide 35 text

@tgrall Stream Processing

Slide 36

Slide 36 text

@tgrall Streaming technology is enabling the obvious: continuous processing on data   that is continuously produced

Slide 37

Slide 37 text

@tgrall Processing ! Request/Response ! Batch ! Stream Processing ! Real-time reaction to events ! Continuous applications ! Process both and real-time and historical data

Slide 38

Slide 38 text

@tgrall Stream Processing ! Validation ! Transformation ! Enrichment ! Deduplication ! Aggregations ! Joins ! Windowing

Slide 39

Slide 39 text

@tgrall Kafka Streams

Slide 40

Slide 40 text

@tgrall Kafka Streams Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.

Slide 41

Slide 41 text

@tgrall ! Kafka Streams ○ Stream processing framework ○ Streams are Kafka topics (as input and output) ○ It’s really just a Java library to include in your application ○ Scaling the stream application horizontally ○ Creates a topology of processing nodes (filter, map, join etc) acting on a stream ■ Low level processor API ■ High level DSL ■ Using “internal” topics (when re-partitioning is needed or for “stateful” transformations) Kafka ecosystem Apache Kafka components

Slide 42

Slide 42 text

@tgrall Kafka Streams API Streams API application Topic Topic Topic Topic Topic Topic Topic Processing Processing Processing

Slide 43

Slide 43 text

@tgrall Sample: Word count … StreamsBuilder builder = new StreamsBuilder(); KStream source = builder.stream(“streams-plaintext-input”); KTable counts = source .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" “))) .groupBy((key, value) -> value) .count(); // Serdes = Serializer/Deserialize counts.toStream().to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long())); … Source: Apache Kafka

Slide 44

Slide 44 text

@tgrall Kafka Streams Architecture Source: Confluent

Slide 45

Slide 45 text

@tgrall Aggregations ! Aggregate ! Reduce ! Count ! Based on a “key” (Grouping) or other time/session (Windowing)

Slide 46

Slide 46 text

@tgrall Windowing Window name Behavior Short description Tumbling time window Time-based Fixed-size, non-overlapping, gap-less windows Hopping time window Time-based Fixed-size, overlapping windows Sliding time window Time-based Fixed-size, overlapping windows that work on differences between record timestamps Session window Session-based Dynamically-sized, non-overlapping, data-driven windows Windowing let you control how to group records Source: Confluent

Slide 47

Slide 47 text

@tgrall Tumbling time window Source: Confluent

Slide 48

Slide 48 text

@tgrall Windowing // Key (String) is user ID, value (Avro record) is the page view event for that user. // Such a data stream is often called a “clickstream”. KStream pageViews = ...; // Count page views per window, per user, with tumbling windows of size 5 minutes KTable, Long> windowedPageViewCounts = pageViews .groupByKey(Serialized.with(Serdes.String(), genericAvroSerde)) .windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5))) .count(); Source: Confluent

Slide 49

Slide 49 text

@tgrall Joins Join operands Type (INNER) JOIN LEFT JOIN OUTER JOIN KStream-to-KStream Windowed Supported Supported Supported KTable-to-KTable Non-windowed Supported Supported Supported KStream-to-KTable Non-windowed Supported Supported Not Supported KStream-to-GlobalKTable Non-windowed Supported Supported Not Supported KTable-to-GlobalKTable N/A Not Supported Not Supported Not Supported Source: Confluent

Slide 50

Slide 50 text

@tgrall Joins KStream left = ...; KTable right = ...; // Java 8+ example, using lambda expressions KStream joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, Joined.keySerde(Serdes.String()) /* key */ .withValueSerde(Serdes.Long()) /* left value */ ); Source: Confluent

Slide 51

Slide 51 text

@tgrall KSQL

Slide 52

Slide 52 text

@tgrall Kafka History Source: Confluent

Slide 53

Slide 53 text

@tgrall KSQL Apache Kafka K-SQL Read/Write on Topic CREATE STREAM CREATE TABLE SELECT …. Data Processing

Slide 54

Slide 54 text

@tgrall KSQL Apache Kafka K-SQL Read/Write on Topic CREATE STREAM CREATE TABLE SELECT …. Data Processing KSQL is based in Kafka Streams

Slide 55

Slide 55 text

@tgrall KSQL: Select statement syntax SELECT select_expr [, ...] FROM from_item [ LEFT JOIN join_table ON join_criteria ] [ WINDOW window_expression ] [ WHERE condition ] [ GROUP BY grouping_expression ] [ HAVING having_expression ] [ LIMIT count ];

Slide 56

Slide 56 text

@tgrall KSQL: What for? ! Data Exploration: an easy way to look at topics ! Data Enrichment/ETL: join multiple topics ! Anomaly Detection: for example using windowing ! Real Time Monitoring/Alerting: find error when they happen

Slide 57

Slide 57 text

@tgrall Kafka & KSQL Demonstration

Slide 58

Slide 58 text

@tgrall Introduction to Apache Kafka & KSQL Tugdual Grall Product Management Red Hat @tgrall