#33 Big Data processing avec Apache Kafka et KSQL

#33 Big Data processing avec Apache Kafka et KSQL

De plus en plus d'applications ont besoin de traiter les données au fil de l'eau. Apache Kafka offre une plateforme de streaming distribuée qui simplifie le "stream processing".
Durant cette session vous découvrirez les bases d'Apache Kafka ("messaging") mais aussi comment intégrer différentes sources de donnés par le biais de Kafka Connect; et le traitement des données par le biais d'une simple API: Kafka Streams.

Bio :
Tugdual Grall - Directeur du Product Management chez Red Hat, responsable de l'Experience Développeur. Avant de travailler chez Red Hat, "Tug" était Chief Technical Evangelist EMEA chez MapR., Technical Evangelist chez MongoDB et Couchbase. Tug a travaillé comme CTO chez eXo Platform, et comme Product Manager et Développeur sur la platform Java/JavaEE d’Oracle.
Tugdual est également co-fondateur du Nantes JUG (Java Users Group) qui réunit tous les mois depuis 2008 les développeurs et architectes de la région nantaise.

6aa4f3c589d3108830b371d0310bc4da?s=128

Toulouse Data Science

November 08, 2018
Tweet

Transcript

  1. 2.

    @tgrall About me Tugdual “Tug” Grall ! Red Hat: Product

    Management ! Developer Tools/Developer Experience ! MapR: Tech Evangelist/Product Manager ! MongoDB: Technical Evangelist ! Couchbase: Technical Evangelist ! eXo Platform: CTO ! Oracle: PM/Developer/Consultant ! NantesJUG co-founder 2008! ! @tgrall ! http://github.com/tgrall ! tug@redhat.com | tugdual@gmail.com ! Pet Project
 https://promoglisse-speed-challenge.com
  2. 3.

    @tgrall Red Hat Developer Tools Port Folio OPENSHIFT DEVELOPER TOOLS

    PUBLIC SAAS 
 TOOLS MIDDLEWARE AND RHEL DEVELOPER TOOLS - OpenShift-DO CLI - Container Development Kit - Eclipse Che - Eclipse Che plugins - DevStudio plugins - VS Code plugins - RHEL tools - OpenShift.io (ToolChain) - Hosted Eclipse Che https://developers.redhat.com/
  3. 7.

    @tgrall ! Developed at LinkedIn back in 2010, open sourced

    in 2011 ! Designed to be fast, scalable, durable and available ! Distributed by nature ! Data partitioning ! High throughput / low latency ! Ability to handle huge number of consumers What is Kafka ?
  4. 8.

    @tgrall ! Strimzi provides a way to run an Apache

    Kafka cluster on OpenShift and Kubernetes in various deployment configurations ! http://strimzi.io/ ! Red Hat supported version: Red Hat AMQ Streams Apache Kafka & Kubernetes: Strimzi Project
  5. 10.

    @tgrall ! Apache Kafka has an ecosystem consisting of many

    components / tools ◦ Kafka Core ▪ Broker ▪ Clients library (Producer, Consumer, Admin) ▪ Management tools ◦ Kafka Connect ◦ Kafka Streams ◦ Mirror Maker Apache Kafka ecosystem
  6. 11.

    @tgrall Apache Kafka ecosystem ! Kafka Broker ◦ Central component

    responsible for hosting topics and delivering messages ◦ One or more brokers run in a cluster alongside with a Zookeeper ensemble ! Kafka Producers and Consumers ◦ Java-based clients for sending and receiving messages ! Kafka Admin tools ◦ Java- and Scala- based tools for managing Kafka brokers ◦ Managing topics, ACLs, monitoring etc. Apache Kafka components
  7. 13.

    @tgrall ! Kafka Connect ◦ Framework for transferring data between

    Kafka and other data systems ◦ Facilitate data conversion, scaling, load balancing, fault tolerance, … ◦ Connector plugins are deployed into Kafka connect cluster ▪ Well defined API for creating new connectors (with Sink/Source) ▪ Apache Kafka itself includes only FileSink and FileSource plugins (reading records from file and posting them as Kafka messages / writing Kafka messages to files) ▪ Many additional plugins are available outside of Apache Kafka Kafka ecosystem Apache Kafka components
  8. 15.

    @tgrall ! Mirror Maker ◦ Kafka clusters do not work

    well when split across multiple datacenters ▪ Low bandwidth, High latency ▪ For use within multiple datacenters it is recommended to setup independent cluster in each data center and mirror the data ◦ Tool for replication of topics between different clusters Kafka ecosystem Apache Kafka components
  9. 16.

    @tgrall Across Data Centers Broker 1 T1 - P1 T1

    - P2 Broker 1 T1 - P1 T1 - P2 Producer Data Center 1 Data Center 2 Broker 2 Broker N Broker 2 Broker N Producer Mirror Maker Mirror Maker Geo location 1 Geo location 2 Mirror Maker
  10. 17.

    @tgrall ! Clients for other languages ! REST Proxy for

    bridging between HTTP and Kafka ! Schema Registry ! Cluster Balancers ! Management and Monitoring consoles ! Kafka Connect plugins ! KSQL (Confluent) ! Kafka can be used with many other projects 
 (e.g. Apache Spark, Apache Flink, Apache Storm) Kafka ecosystem Outside of Apache Kafka project
  11. 19.

    @tgrall ! Messages / records are sent to / received

    from topic ◦ Topics are split into one or more partitions ◦ Partition = Shard ◦ All actual work is done on partition level, topic is just a virtual object ! Each message is written only into a one selected partition ◦ Partitioning is usually done based on the message key ◦ Message ordering within the partition is fixed ! Clean-up policies ◦ Based on size / message age ◦ Compacted based on message key Topic & Partitions
  12. 20.

    @tgrall Topic & Partitions old new 0 1 2 3

    4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Producer Partition 0 Partition 1 Partition 2 Producing messages
  13. 21.

    @tgrall Topic & Partitions old new 0 1 2 3

    4 5 6 7 8 9 1 0 1 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 1 0 Consumer Partition 0 Partition 1 Partition 2 Consuming messages
  14. 22.

    @tgrall ! They are “backup” for a partition ◦ Provide

    redundancy ! It’s the way Kafka guarantees availability and durability in case of node failures ! Two roles : ◦ Leader : a replica used by producers/consumers for exchanging messages ◦ Followers : all the other replicas ▪ They don’t serve client requests ▪ They replicate messages from the leader to be “in-sync” (ISR) ◦ A replica changes its role as brokers come and go Replication Leaders & Followers
  15. 23.

    @tgrall Broker 1 Partitions Distribution T1 - P1 T1 -

    P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ! Leaders and followers spread across the cluster ◦ producers/consumers connect to leaders ◦ multiple connections needed for reading different partitions Leaders & Followers
  16. 24.

    @tgrall Broker 1 Partitions Distribution T1 - P1 T1 -

    P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 ! A broker with leader partition goes down ! New leader partition is elected on different node Leaders & Followers
  17. 25.

    @tgrall ! They are really “smart” (unlike “traditional” messaging) !

    Configured with a “bootstrap servers” list for fetching first metadata ◦ Where are interested topics ? Connect to broker which holds partition leaders ◦ Producer specifies destination partition ◦ Consumer handles messages offsets to read ◦ If error happens, refresh metadata (something is changed in the cluster) ! Batching on producing/consuming Clients
  18. 26.

    @tgrall ! Destination partition computed on client ◦ Round robin

    ◦ Specified by hashing the “key” in the message ◦ Custom partitioning ! Writes messages to “leader” for a partition ! Acknowledge : ◦ No ack ◦ Ack on message written to “leader” ◦ Ack on message also replicated to “in-sync” replicas Producers
  19. 27.

    @tgrall ! Read from one (or more) partition(s) ! Track

    (commit) the offset for given partition ◦ A partitioned topic “__consumer_offsets” is used for that ◦ Key → [group, topic, partition], Value → [offset] ◦ Offset is shared inside the consumer group ! QoS ◦ At most once : read message, commit offset, process message ◦ At least once : read message, process message, commit offset ◦ Exactly once : read message, commit message output and offset to a transactional system ! Gets only “committed” messages (depends on producer “ack” level) Consumers
  20. 28.

    @tgrall Broker 1 Producers & Consumers T1 - P1 T1

    - P2 T2 - P1 T2 - P2 Broker 2 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Broker 3 T1 - P1 T1 - P2 T2 - P1 T2 - P2 Producer Consumer Consumer Producer Consumer Writing/Reading to/from leaders
  21. 29.

    @tgrall ! The consumer asks for a specific partition (assign)

    ◦ An application using one or more consumers has to handle such assignment on its own, the scaling as well ! The consumer is part of a “consumer group” ◦ Consumer groups are an easier way to scale up consumption ◦ One of the consumers, as “group lead”, applies a strategy to assign partitions to consumers in the group ◦ When new consumers join or leave, a rebalancing happens to reassign partitions ◦ This allows pluggable strategies for partition assignment (e.g. stickiness) Consumer: partitions assignment Available approaches
  22. 30.

    @tgrall ! Consumer Group ◦ Grouping multiple consumers ◦ Each

    consumer reads from a “unique” subset of partition → max consumers = num partitions ◦ They are “competing” consumers on the topic, each message delivered to one consumer ◦ Messages with same “key” delivered to same consumer ! More consumer groups ◦ Allows publish/subscribe ◦ Same messages delivered to different consumers in different consumer groups Consumer Groups
  23. 31.

    @tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2

    Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer Partitions assignment
  24. 32.

    @tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2

    Partition 3 Group 1 Consumer Consumer Group 2 Consumer Consumer Consumer Rebalancing
  25. 33.

    @tgrall Topic Consumer Groups Partition 0 Partition 1 Partition 2

    Partition 3 Group 1 Consumer Consumer Consumer Consumer Consumer Max parallelism & idle consumer
  26. 34.

    @tgrall ! Encryption between clients and brokers and between brokers

    ◦ Using SSL ! Authentication of clients (and brokers) connecting to brokers ◦ Using SSL (mutual authentication) ◦ Using SASL (with PLAIN, Kerberos or SCRAM-SHA as mechanisms) ! Authorization of read/writes operation by clients ◦ ACLs on resources such as topics ◦ Authenticated “principal” for configuring ACLs ◦ Pluggable ! It’s possible to mix encryption/no-encryption and authentication/no- authentication Security
  27. 37.

    @tgrall Processing ! Request/Response ! Batch ! Stream Processing !

    Real-time reaction to events ! Continuous applications ! Process both and real-time and historical data
  28. 38.

    @tgrall Stream Processing ! Validation ! Transformation ! Enrichment !

    Deduplication ! Aggregations ! Joins ! Windowing
  29. 40.

    @tgrall Kafka Streams Kafka Streams is a client library for

    building applications and microservices, where the input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.
  30. 41.

    @tgrall ! Kafka Streams ◦ Stream processing framework ◦ Streams

    are Kafka topics (as input and output) ◦ It’s really just a Java library to include in your application ◦ Scaling the stream application horizontally ◦ Creates a topology of processing nodes (filter, map, join etc) acting on a stream ▪ Low level processor API ▪ High level DSL ▪ Using “internal” topics (when re-partitioning is needed or for “stateful” transformations) Kafka ecosystem Apache Kafka components
  31. 42.

    @tgrall Kafka Streams API Streams API application Topic Topic Topic

    Topic Topic Topic Topic Processing Processing Processing
  32. 43.

    @tgrall Sample: Word count … StreamsBuilder builder = new StreamsBuilder();

    KStream<String, String> source = builder.stream(“streams-plaintext-input”); KTable<String, Long> counts = source .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" “))) .groupBy((key, value) -> value) .count(); // Serdes = Serializer/Deserialize counts.toStream().to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long())); … Source: Apache Kafka
  33. 45.

    @tgrall Aggregations ! Aggregate ! Reduce ! Count ! Based

    on a “key” (Grouping) or other time/session (Windowing)
  34. 46.

    @tgrall Windowing Window name Behavior Short description Tumbling time window

    Time-based Fixed-size, non-overlapping, gap-less windows Hopping time window Time-based Fixed-size, overlapping windows Sliding time window Time-based Fixed-size, overlapping windows that work on differences between record timestamps Session window Session-based Dynamically-sized, non-overlapping, data-driven windows Windowing let you control how to group records Source: Confluent
  35. 48.

    @tgrall Windowing // Key (String) is user ID, value (Avro

    record) is the page view event for that user. // Such a data stream is often called a “clickstream”. KStream<String, GenericRecord> pageViews = ...; // Count page views per window, per user, with tumbling windows of size 5 minutes KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews .groupByKey(Serialized.with(Serdes.String(), genericAvroSerde)) .windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5))) .count(); Source: Confluent
  36. 49.

    @tgrall Joins Join operands Type (INNER) JOIN LEFT JOIN OUTER

    JOIN KStream-to-KStream Windowed Supported Supported Supported KTable-to-KTable Non-windowed Supported Supported Supported KStream-to-KTable Non-windowed Supported Supported Not Supported KStream-to-GlobalKTable Non-windowed Supported Supported Not Supported KTable-to-GlobalKTable N/A Not Supported Not Supported Not Supported Source: Confluent
  37. 50.

    @tgrall Joins KStream<String, Long> left = ...; KTable<String, Double> right

    = ...; // Java 8+ example, using lambda expressions KStream<String, String> joined = left.join(right, (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, Joined.keySerde(Serdes.String()) /* key */ .withValueSerde(Serdes.Long()) /* left value */ ); Source: Confluent
  38. 53.
  39. 54.

    @tgrall KSQL Apache Kafka K-SQL Read/Write on Topic CREATE STREAM

    CREATE TABLE SELECT …. Data Processing KSQL is based in Kafka Streams
  40. 55.

    @tgrall KSQL: Select statement syntax SELECT select_expr [, ...] FROM

    from_item [ LEFT JOIN join_table ON join_criteria ] [ WINDOW window_expression ] [ WHERE condition ] [ GROUP BY grouping_expression ] [ HAVING having_expression ] [ LIMIT count ];
  41. 56.

    @tgrall KSQL: What for? ! Data Exploration: an easy way

    to look at topics ! Data Enrichment/ETL: join multiple topics ! Anomaly Detection: for example using windowing ! Real Time Monitoring/Alerting: find error when they happen