#33 Big Data processing avec Apache Kafka et KSQL

@tgrall Introduction to Apache Kafka & KSQL Tugdual Grall Product
Management Red Hat @tgrall

@tgrall About me Tugdual “Tug” Grall ! Red Hat: Product
Management ! Developer Tools/Developer Experience ! MapR: Tech Evangelist/Product Manager ! MongoDB: Technical Evangelist ! Couchbase: Technical Evangelist ! eXo Platform: CTO ! Oracle: PM/Developer/Consultant ! NantesJUG co-founder 2008! ! @tgrall ! http://github.com/tgrall ! tug@redhat.com | tugdual@gmail.com ! Pet Project  https://promoglisse-speed-challenge.com

@tgrall Red Hat Developer Tools Port Folio OPENSHIFT DEVELOPER TOOLS
PUBLIC SAAS   TOOLS MIDDLEWARE AND RHEL DEVELOPER TOOLS - OpenShift-DO CLI - Container Development Kit - Eclipse Che - Eclipse Che plugins - DevStudio plugins - VS Code plugins - RHEL tools - OpenShift.io (ToolChain) - Hosted Eclipse Che https://developers.redhat.com/

@tgrall “ … a publish/subscribe messaging system …” What is
Kafka ?

@tgrall “ … a streaming data platform …” What is
Kafka ?

@tgrall “ … a distributed, horizontally-scalable, fault-tolerant, commit log …”
What is Kafka ?

@tgrall ! Developed at LinkedIn back in 2010, open sourced
in 2011 ! Designed to be fast, scalable, durable and available ! Distributed by nature ! Data partitioning ! High throughput / low latency ! Ability to handle huge number of consumers What is Kafka ?

@tgrall ! Strimzi provides a way to run an Apache
Kafka cluster on OpenShift and Kubernetes in various deployment configurations ! http://strimzi.io/ ! Red Hat supported version: Red Hat AMQ Streams Apache Kafka & Kubernetes: Strimzi Project

@tgrall Ecosystem

@tgrall ! Apache Kafka has an ecosystem consisting of many
components / tools ◦ Kafka Core ▪ Broker ▪ Clients library (Producer, Consumer, Admin) ▪ Management tools ◦ Kafka Connect ◦ Kafka Streams ◦ Mirror Maker Apache Kafka ecosystem

@tgrall Apache Kafka ecosystem ! Kafka Broker ◦ Central component
responsible for hosting topics and delivering messages ◦ One or more brokers run in a cluster alongside with a Zookeeper ensemble ! Kafka Producers and Consumers ◦ Java-based clients for sending and receiving messages ! Kafka Admin tools ◦ Java- and Scala- based tools for managing Kafka brokers ◦ Managing topics, ACLs, monitoring etc. Apache Kafka components

@tgrall Kafka & Zookeeper Zookeeper Kafka Applications Admin tools

@tgrall ! Kafka Connect ◦ Framework for transferring data between
Kafka and other data systems ◦ Facilitate data conversion, scaling, load balancing, fault tolerance, … ◦ Connector plugins are deployed into Kafka connect cluster ▪ Well defined API for creating new connectors (with Sink/Source) ▪ Apache Kafka itself includes only FileSink and FileSource plugins (reading records from file and posting them as Kafka messages / writing Kafka messages to files) ▪ Many additional plugins are available outside of Apache Kafka Kafka ecosystem Apache Kafka components

@tgrall Kafka ecosystem Apache Kafka components Connect API Connect API
Source Sink Stream API

@tgrall ! Mirror Maker ◦ Kafka clusters do not work
well when split across multiple datacenters ▪ Low bandwidth, High latency ▪ For use within multiple datacenters it is recommended to setup independent cluster in each data center and mirror the data ◦ Tool for replication of topics between different clusters Kafka ecosystem Apache Kafka components

@tgrall Across Data Centers Broker 1 T1 - P1 T1
- P2 Broker 1 T1 - P1 T1 - P2 Producer Data Center 1 Data Center 2 Broker 2 Broker N Broker 2 Broker N Producer Mirror Maker Mirror Maker Geo location 1 Geo location 2 Mirror Maker

@tgrall ! Clients for other languages ! REST Proxy for
bridging between HTTP and Kafka ! Schema Registry ! Cluster Balancers ! Management and Monitoring consoles ! Kafka Connect plugins ! KSQL (Confluent) ! Kafka can be used with many other projects   (e.g. Apache Spark, Apache Flink, Apache Storm) Kafka ecosystem Outside of Apache Kafka project

@tgrall Topic & Partitions

@tgrall ! Messages / records are sent to / received
from topic ◦ Topics are split into one or more partitions ◦ Partition = Shard ◦ All actual work is done on partition level, topic is just a virtual object ! Each message is written only into a one selected partition ◦ Partitioning is usually done based on the message key ◦ Message ordering within the partition is fixed ! Clean-up policies ◦ Based on size / message age ◦ Compacted based on message key Topic & Partitions

@tgrall Topic & Partitions old new 0 1 2 3
4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Producer Partition 0 Partition 1 Partition 2 Producing messages

@tgrall Topic & Partitions old new 0 1 2 3
4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Consumer Partition 0 Partition 1 Partition 2 Consuming messages

@tgrall ! They are “backup” for a partition ◦ Provide
redundancy ! It’s the way Kafka guarantees availability and durability in case of node failures ! Two roles : ◦ Leader : a replica used by producers/consumers for exchanging messages ◦ Followers : all the other replicas ▪ They don’t serve client requests ▪ They replicate messages from the leader to be “in-sync” (ISR) ◦ A replica changes its role as brokers come and go Replication Leaders & Followers

@tgrall Broker 1 Partitions Distribution T1 - P1 T1 -
P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ! Leaders and followers spread across the cluster ◦ producers/consumers connect to leaders ◦ multiple connections needed for reading different partitions Leaders & Followers

@tgrall Broker 1 Partitions Distribution T1 - P1 T1 -
P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ! A broker with leader partition goes down ! New leader partition is elected on different node Leaders & Followers

@tgrall ! They are really “smart” (unlike “traditional” messaging) !
Configured with a “bootstrap servers” list for fetching first metadata ◦ Where are interested topics ? Connect to broker which holds partition leaders ◦ Producer specifies destination partition ◦ Consumer handles messages offsets to read ◦ If error happens, refresh metadata (something is changed in the cluster) ! Batching on producing/consuming Clients

@tgrall ! Destination partition computed on client ◦ Round robin
◦ Specified by hashing the “key” in the message ◦ Custom partitioning ! Writes messages to “leader” for a partition ! Acknowledge : ◦ No ack ◦ Ack on message written to “leader” ◦ Ack on message also replicated to “in-sync” replicas Producers

@tgrall ! Read from one (or more) partition(s) ! Track
(commit) the offset for given partition ◦ A partitioned topic “__consumer_offsets” is used for that ◦ Key → [group, topic, partition], Value → [offset] ◦ Offset is shared inside the consumer group ! QoS ◦ At most once : read message, commit offset, process message ◦ At least once : read message, process message, commit offset ◦ Exactly once : read message, commit message output and offset to a transactional system ! Gets only “committed” messages (depends on producer “ack” level) Consumers

@tgrall Broker 1 Producers & Consumers T1 - P1 T1
- P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Producer Consumer Consumer Producer Consumer Writing/Reading to/from leaders

@tgrall ! The consumer asks for a specific partition (assign)
◦ An application using one or more consumers has to handle such assignment on its own, the scaling as well ! The consumer is part of a “consumer group” ◦ Consumer groups are an easier way to scale up consumption ◦ One of the consumers, as “group lead”, applies a strategy to assign partitions to consumers in the group ◦ When new consumers join or leave, a rebalancing happens to reassign partitions ◦ This allows pluggable strategies for partition assignment (e.g. stickiness) Consumer: partitions assignment Available approaches

@tgrall ! Consumer Group ◦ Grouping multiple consumers ◦ Each
consumer reads from a “unique” subset of partition → max consumers = num partitions ◦ They are “competing” consumers on the topic, each message delivered to one consumer ◦ Messages with same “key” delivered to same consumer ! More consumer groups ◦ Allows publish/subscribe ◦ Same messages delivered to different consumers in different consumer groups Consumer Groups

@tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2
Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer Partitions assignment

Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer Rebalancing

Partition 3 Group 1 Consumer Consumer Consumer Consumer Consumer Max parallelism & idle consumer

@tgrall ! Encryption between clients and brokers and between brokers
◦ Using SSL ! Authentication of clients (and brokers) connecting to brokers ◦ Using SSL (mutual authentication) ◦ Using SASL (with PLAIN, Kerberos or SCRAM-SHA as mechanisms) ! Authorization of read/writes operation by clients ◦ ACLs on resources such as topics ◦ Authenticated “principal” for configuring ACLs ◦ Pluggable ! It’s possible to mix encryption/no-encryption and authentication/no- authentication Security

@tgrall Stream Processing

@tgrall Streaming technology is enabling the obvious: continuous processing on
data   that is continuously produced

@tgrall Processing ! Request/Response ! Batch ! Stream Processing !
Real-time reaction to events ! Continuous applications ! Process both and real-time and historical data

@tgrall Stream Processing ! Validation ! Transformation ! Enrichment !
Deduplication ! Aggregations ! Joins ! Windowing

@tgrall Kafka Streams

@tgrall Kafka Streams Kafka Streams is a client library for
building applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.

@tgrall ! Kafka Streams ◦ Stream processing framework ◦ Streams
are Kafka topics (as input and output) ◦ It’s really just a Java library to include in your application ◦ Scaling the stream application horizontally ◦ Creates a topology of processing nodes (filter, map, join etc) acting on a stream ▪ Low level processor API ▪ High level DSL ▪ Using “internal” topics (when re-partitioning is needed or for “stateful” transformations) Kafka ecosystem Apache Kafka components

@tgrall Kafka Streams API Streams API application Topic Topic Topic
Topic Topic Topic Topic Processing Processing Processing

@tgrall Sample: Word count … StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream(“streams-plaintext-input”); KTable<String, Long> counts = source .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" “))) .groupBy((key, value) -> value) .count(); // Serdes = Serializer/Deserialize counts.toStream().to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long())); … Source: Apache Kafka

@tgrall Kafka Streams Architecture Source: Confluent

@tgrall Aggregations ! Aggregate ! Reduce ! Count ! Based
on a “key” (Grouping) or other time/session (Windowing)

@tgrall Windowing Window name Behavior Short description Tumbling time window
Time-based Fixed-size, non-overlapping, gap-less windows Hopping time window Time-based Fixed-size, overlapping windows Sliding time window Time-based Fixed-size, overlapping windows that work on differences between record timestamps Session window Session-based Dynamically-sized, non-overlapping, data-driven windows Windowing let you control how to group records Source: Confluent

@tgrall Tumbling time window Source: Confluent

@tgrall Windowing // Key (String) is user ID, value (Avro
record) is the page view event for that user. // Such a data stream is often called a “clickstream”. KStream<String, GenericRecord> pageViews = ...; // Count page views per window, per user, with tumbling windows of size 5 minutes KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews .groupByKey(Serialized.with(Serdes.String(), genericAvroSerde)) .windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5))) .count(); Source: Confluent

@tgrall Joins Join operands Type (INNER) JOIN LEFT JOIN OUTER
JOIN KStream-to-KStream Windowed Supported Supported Supported KTable-to-KTable Non-windowed Supported Supported Supported KStream-to-KTable Non-windowed Supported Supported Not Supported KStream-to-GlobalKTable Non-windowed Supported Supported Not Supported KTable-to-GlobalKTable N/A Not Supported Not Supported Not Supported Source: Confluent

@tgrall Joins KStream<String, Long> left = ...; KTable<String, Double> right
= ...; // Java 8+ example, using lambda expressions KStream<String, String> joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, Joined.keySerde(Serdes.String()) /* key */ .withValueSerde(Serdes.Long()) /* left value */ ); Source: Confluent

@tgrall KSQL

@tgrall Kafka History Source: Confluent

@tgrall KSQL Apache Kafka K-SQL Read/Write on Topic CREATE STREAM
CREATE TABLE SELECT …. Data Processing

@tgrall KSQL Apache Kafka K-SQL Read/Write on Topic CREATE STREAM
CREATE TABLE SELECT …. Data Processing KSQL is based in Kafka Streams

@tgrall KSQL: Select statement syntax SELECT select_expr [, ...] FROM
from_item [ LEFT JOIN join_table ON join_criteria ] [ WINDOW window_expression ] [ WHERE condition ] [ GROUP BY grouping_expression ] [ HAVING having_expression ] [ LIMIT count ];

@tgrall KSQL: What for? ! Data Exploration: an easy way
to look at topics ! Data Enrichment/ETL: join multiple topics ! Anomaly Detection: for example using windowing ! Real Time Monitoring/Alerting: find error when they happen

@tgrall Kafka & KSQL Demonstration

@tgrall Introduction to Apache Kafka & KSQL Tugdual Grall Product
Management Red Hat @tgrall

#33 Big Data processing avec Apache Kafka et KSQL

#33 Big Data processing avec Apache Kafka et KSQL

More Decks by Toulouse Data Science

Other Decks in Technology

Featured

Transcript