$30 off During Our Annual Pro Sale. View Details »

#33 Big Data processing avec Apache Kafka et KSQL

#33 Big Data processing avec Apache Kafka et KSQL

De plus en plus d'applications ont besoin de traiter les données au fil de l'eau. Apache Kafka offre une plateforme de streaming distribuée qui simplifie le "stream processing".
Durant cette session vous découvrirez les bases d'Apache Kafka ("messaging") mais aussi comment intégrer différentes sources de donnés par le biais de Kafka Connect; et le traitement des données par le biais d'une simple API: Kafka Streams.

Bio :
Tugdual Grall - Directeur du Product Management chez Red Hat, responsable de l'Experience Développeur. Avant de travailler chez Red Hat, "Tug" était Chief Technical Evangelist EMEA chez MapR., Technical Evangelist chez MongoDB et Couchbase. Tug a travaillé comme CTO chez eXo Platform, et comme Product Manager et Développeur sur la platform Java/JavaEE d’Oracle.
Tugdual est également co-fondateur du Nantes JUG (Java Users Group) qui réunit tous les mois depuis 2008 les développeurs et architectes de la région nantaise.

Toulouse Data Science

November 08, 2018
Tweet

More Decks by Toulouse Data Science

Other Decks in Technology

Transcript

  1. @tgrall
    Introduction to
    Apache Kafka
    &
    KSQL
    Tugdual Grall
    Product Management
    Red Hat
    @tgrall

    View Slide

  2. @tgrall
    About me
    Tugdual “Tug” Grall
    ! Red Hat: Product Management
    ! Developer Tools/Developer Experience
    ! MapR: Tech Evangelist/Product Manager
    ! MongoDB: Technical Evangelist
    ! Couchbase: Technical Evangelist
    ! eXo Platform: CTO
    ! Oracle: PM/Developer/Consultant
    ! NantesJUG co-founder 2008!
    ! @tgrall
    ! http://github.com/tgrall
    ! [email protected] | [email protected]
    ! Pet Project

    https://promoglisse-speed-challenge.com

    View Slide

  3. @tgrall
    Red Hat Developer Tools Port Folio
    OPENSHIFT
    DEVELOPER TOOLS
    PUBLIC SAAS 

    TOOLS
    MIDDLEWARE AND
    RHEL DEVELOPER
    TOOLS
    - OpenShift-DO CLI
    - Container Development Kit
    - Eclipse Che
    - Eclipse Che plugins
    - DevStudio plugins
    - VS Code plugins
    - RHEL tools
    - OpenShift.io (ToolChain)
    - Hosted Eclipse Che
    https://developers.redhat.com/

    View Slide

  4. @tgrall
    “ … a publish/subscribe
    messaging system …”
    What is Kafka ?

    View Slide

  5. @tgrall
    “ … a streaming
    data platform …”
    What is Kafka ?

    View Slide

  6. @tgrall
    “ … a distributed, horizontally-scalable,
    fault-tolerant, commit log …”
    What is Kafka ?

    View Slide

  7. @tgrall
    ! Developed at LinkedIn back in 2010, open sourced in 2011
    ! Designed to be fast, scalable, durable and available
    ! Distributed by nature
    ! Data partitioning
    ! High throughput / low latency
    ! Ability to handle huge number of consumers
    What is Kafka ?

    View Slide

  8. @tgrall
    ! Strimzi provides a way to run an Apache Kafka cluster
    on OpenShift and Kubernetes in various deployment configurations
    ! http://strimzi.io/
    ! Red Hat supported version: Red Hat AMQ Streams
    Apache Kafka & Kubernetes: Strimzi Project

    View Slide

  9. @tgrall
    Ecosystem

    View Slide

  10. @tgrall
    ! Apache Kafka has an ecosystem consisting of many components / tools
    ○ Kafka Core
    ■ Broker
    ■ Clients library (Producer, Consumer, Admin)
    ■ Management tools
    ○ Kafka Connect
    ○ Kafka Streams
    ○ Mirror Maker
    Apache Kafka ecosystem

    View Slide

  11. @tgrall
    Apache Kafka ecosystem
    ! Kafka Broker
    ○ Central component responsible for hosting topics and delivering messages
    ○ One or more brokers run in a cluster alongside with a Zookeeper ensemble
    ! Kafka Producers and Consumers
    ○ Java-based clients for sending and receiving messages
    ! Kafka Admin tools
    ○ Java- and Scala- based tools for managing Kafka brokers
    ○ Managing topics, ACLs, monitoring etc.
    Apache Kafka components

    View Slide

  12. @tgrall
    Kafka & Zookeeper
    Zookeeper
    Kafka
    Applications
    Admin tools

    View Slide

  13. @tgrall
    ! Kafka Connect
    ○ Framework for transferring data between Kafka and other data systems
    ○ Facilitate data conversion, scaling, load balancing, fault tolerance, …
    ○ Connector plugins are deployed into Kafka connect cluster
    ■ Well defined API for creating new connectors (with Sink/Source)
    ■ Apache Kafka itself includes only FileSink and FileSource plugins (reading records
    from file and posting them as Kafka messages / writing Kafka messages to files)
    ■ Many additional plugins are available outside of Apache Kafka
    Kafka ecosystem
    Apache Kafka components

    View Slide

  14. @tgrall
    Kafka ecosystem
    Apache Kafka components
    Connect API
    Connect API
    Source Sink
    Stream API

    View Slide

  15. @tgrall
    ! Mirror Maker
    ○ Kafka clusters do not work well when split across multiple datacenters
    ■ Low bandwidth, High latency
    ■ For use within multiple datacenters it is recommended to setup independent cluster in
    each data center and mirror the data
    ○ Tool for replication of topics between different clusters
    Kafka ecosystem
    Apache Kafka components

    View Slide

  16. @tgrall
    Across Data Centers
    Broker 1
    T1 - P1
    T1 - P2
    Broker 1
    T1 - P1
    T1 - P2
    Producer
    Data Center
    1
    Data Center 2
    Broker 2
    Broker N
    Broker 2
    Broker N
    Producer
    Mirror Maker
    Mirror Maker
    Geo location
    1
    Geo location 2
    Mirror Maker

    View Slide

  17. @tgrall
    ! Clients for other languages
    ! REST Proxy for bridging between HTTP and Kafka
    ! Schema Registry
    ! Cluster Balancers
    ! Management and Monitoring consoles
    ! Kafka Connect plugins
    ! KSQL (Confluent)
    ! Kafka can be used with many other projects 

    (e.g. Apache Spark, Apache Flink, Apache Storm)
    Kafka ecosystem
    Outside of Apache Kafka project

    View Slide

  18. @tgrall
    Topic & Partitions

    View Slide

  19. @tgrall
    ! Messages / records are sent to / received from topic
    ○ Topics are split into one or more partitions
    ○ Partition = Shard
    ○ All actual work is done on partition level, topic is just a virtual object
    ! Each message is written only into a one selected partition
    ○ Partitioning is usually done based on the message key
    ○ Message ordering within the partition is fixed
    ! Clean-up policies
    ○ Based on size / message age
    ○ Compacted based on message key
    Topic & Partitions

    View Slide

  20. @tgrall
    Topic & Partitions
    old new
    0 1 2 3 4 5 6 7 8 9
    1
    0
    1
    1
    0 1 2 3 4 5 6
    0 1 2 3 4 5 6 7 8 9
    1
    0
    Producer
    Partition 0
    Partition 1
    Partition 2
    Producing messages

    View Slide

  21. @tgrall
    Topic & Partitions
    old new
    0 1 2 3 4 5 6 7 8 9
    1
    0
    1
    1
    0 1 2 3 4 5 6
    0 1 2 3 4 5 6 7 8 9
    1
    0
    Consumer
    Partition 0
    Partition 1
    Partition 2
    Consuming messages

    View Slide

  22. @tgrall
    ! They are “backup” for a partition
    ○ Provide redundancy
    ! It’s the way Kafka guarantees availability and durability in case of node
    failures
    ! Two roles :
    ○ Leader : a replica used by producers/consumers for exchanging messages
    ○ Followers : all the other replicas
    ■ They don’t serve client requests
    ■ They replicate messages from the leader to be “in-sync” (ISR)
    ○ A replica changes its role as brokers come and go
    Replication
    Leaders & Followers

    View Slide

  23. @tgrall
    Broker 1
    Partitions Distribution
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Broker 2
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Broker 3
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    ! Leaders and followers spread across the cluster
    ○ producers/consumers connect to leaders
    ○ multiple connections needed for reading different partitions
    Leaders & Followers

    View Slide

  24. @tgrall
    Broker 1
    Partitions Distribution
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Broker 2
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Broker 3
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    ! A broker with leader partition goes down
    ! New leader partition is elected on different node
    Leaders & Followers

    View Slide

  25. @tgrall
    ! They are really “smart” (unlike “traditional” messaging)
    ! Configured with a “bootstrap servers” list for fetching first metadata
    ○ Where are interested topics ? Connect to broker which holds partition leaders
    ○ Producer specifies destination partition
    ○ Consumer handles messages offsets to read
    ○ If error happens, refresh metadata (something is changed in the cluster)
    ! Batching on producing/consuming
    Clients

    View Slide

  26. @tgrall
    ! Destination partition computed on client
    ○ Round robin
    ○ Specified by hashing the “key” in the message
    ○ Custom partitioning
    ! Writes messages to “leader” for a partition
    ! Acknowledge :
    ○ No ack
    ○ Ack on message written to “leader”
    ○ Ack on message also replicated to “in-sync” replicas
    Producers

    View Slide

  27. @tgrall
    ! Read from one (or more) partition(s)
    ! Track (commit) the offset for given partition
    ○ A partitioned topic “__consumer_offsets” is used for that
    ○ Key → [group, topic, partition], Value → [offset]
    ○ Offset is shared inside the consumer group
    ! QoS
    ○ At most once : read message, commit offset, process message
    ○ At least once : read message, process message, commit offset
    ○ Exactly once : read message, commit message output and offset to a transactional system
    ! Gets only “committed” messages (depends on producer “ack” level)
    Consumers

    View Slide

  28. @tgrall
    Broker 1
    Producers & Consumers
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Broker 2
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Broker 3
    T1 - P1
    T1 - P2
    T2 - P1
    T2 - P2
    Producer Consumer
    Consumer
    Producer
    Consumer
    Writing/Reading to/from leaders

    View Slide

  29. @tgrall
    ! The consumer asks for a specific partition (assign)
    ○ An application using one or more consumers has to handle such assignment on its own, the
    scaling as well
    ! The consumer is part of a “consumer group”
    ○ Consumer groups are an easier way to scale up consumption
    ○ One of the consumers, as “group lead”, applies a strategy to assign partitions to consumers in
    the group
    ○ When new consumers join or leave, a rebalancing happens to reassign partitions
    ○ This allows pluggable strategies for partition assignment (e.g. stickiness)
    Consumer: partitions assignment
    Available approaches

    View Slide

  30. @tgrall
    ! Consumer Group
    ○ Grouping multiple consumers
    ○ Each consumer reads from a “unique” subset of partition → max consumers = num partitions
    ○ They are “competing” consumers on the topic, each message delivered to one consumer
    ○ Messages with same “key” delivered to same consumer
    ! More consumer groups
    ○ Allows publish/subscribe
    ○ Same messages delivered to different consumers in different consumer groups
    Consumer Groups

    View Slide

  31. @tgrall
    Topic
    Consumer Groups
    Partition 0
    Partition 1
    Partition 2
    Partition 3
    Group 1
    Consumer
    Consumer
    Group 2
    Consumer
    Consumer
    Consumer
    Partitions assignment

    View Slide

  32. @tgrall
    Topic
    Consumer Groups
    Partition 0
    Partition 1
    Partition 2
    Partition 3
    Group 1
    Consumer
    Consumer
    Group 2
    Consumer
    Consumer
    Consumer
    Rebalancing

    View Slide

  33. @tgrall
    Topic
    Consumer Groups
    Partition 0
    Partition 1
    Partition 2
    Partition 3
    Group 1
    Consumer
    Consumer
    Consumer
    Consumer
    Consumer
    Max parallelism & idle consumer

    View Slide

  34. @tgrall
    ! Encryption between clients and brokers and between brokers
    ○ Using SSL
    ! Authentication of clients (and brokers) connecting to brokers
    ○ Using SSL (mutual authentication)
    ○ Using SASL (with PLAIN, Kerberos or SCRAM-SHA as mechanisms)
    ! Authorization of read/writes operation by clients
    ○ ACLs on resources such as topics
    ○ Authenticated “principal” for configuring ACLs
    ○ Pluggable
    ! It’s possible to mix encryption/no-encryption and authentication/no-
    authentication
    Security

    View Slide

  35. @tgrall
    Stream Processing

    View Slide

  36. @tgrall
    Streaming technology is enabling the obvious:
    continuous processing on data 

    that is continuously produced

    View Slide

  37. @tgrall
    Processing
    ! Request/Response
    ! Batch
    ! Stream Processing
    ! Real-time reaction to events
    ! Continuous applications
    ! Process both and real-time and historical data

    View Slide

  38. @tgrall
    Stream Processing
    ! Validation
    ! Transformation
    ! Enrichment
    ! Deduplication
    ! Aggregations
    ! Joins
    ! Windowing

    View Slide

  39. @tgrall
    Kafka Streams

    View Slide

  40. @tgrall
    Kafka Streams
    Kafka Streams is a client library for building applications and microservices,
    where the input and output data are stored in Kafka clusters. It combines the
    simplicity of writing and deploying standard Java and Scala applications on the
    client side with the benefits of Kafka's server-side cluster technology.

    View Slide

  41. @tgrall
    ! Kafka Streams
    ○ Stream processing framework
    ○ Streams are Kafka topics (as input and output)
    ○ It’s really just a Java library to include in your application
    ○ Scaling the stream application horizontally
    ○ Creates a topology of processing nodes (filter, map, join etc) acting on a stream
    ■ Low level processor API
    ■ High level DSL
    ■ Using “internal” topics (when re-partitioning is needed or for “stateful”
    transformations)
    Kafka ecosystem
    Apache Kafka components

    View Slide

  42. @tgrall
    Kafka Streams API
    Streams API
    application
    Topic
    Topic
    Topic Topic
    Topic
    Topic
    Topic
    Processing Processing Processing

    View Slide

  43. @tgrall
    Sample: Word count

    StreamsBuilder builder = new StreamsBuilder();
    KStream source = builder.stream(“streams-plaintext-input”);
    KTable counts = source
    .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" “)))
    .groupBy((key, value) -> value)
    .count();
    // Serdes = Serializer/Deserialize
    counts.toStream().to("streams-wordcount-output", Produced.with(Serdes.String(), Serdes.Long()));

    Source: Apache Kafka

    View Slide

  44. @tgrall
    Kafka Streams Architecture
    Source: Confluent

    View Slide

  45. @tgrall
    Aggregations
    ! Aggregate
    ! Reduce
    ! Count
    ! Based on a “key” (Grouping) or other time/session (Windowing)

    View Slide

  46. @tgrall
    Windowing
    Window name Behavior Short description
    Tumbling time window Time-based Fixed-size, non-overlapping, gap-less windows
    Hopping time window Time-based Fixed-size, overlapping windows
    Sliding time window Time-based
    Fixed-size, overlapping windows that work on differences between record
    timestamps
    Session window Session-based Dynamically-sized, non-overlapping, data-driven windows
    Windowing let you control how to group records
    Source: Confluent

    View Slide

  47. @tgrall
    Tumbling time window
    Source: Confluent

    View Slide

  48. @tgrall
    Windowing
    // Key (String) is user ID, value (Avro record) is the page view event for that user.
    // Such a data stream is often called a “clickstream”.
    KStream pageViews = ...;
    // Count page views per window, per user, with tumbling windows of size 5 minutes
    KTable, Long> windowedPageViewCounts = pageViews
    .groupByKey(Serialized.with(Serdes.String(), genericAvroSerde))
    .windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5)))
    .count();
    Source: Confluent

    View Slide

  49. @tgrall
    Joins
    Join operands Type (INNER) JOIN LEFT JOIN OUTER JOIN
    KStream-to-KStream Windowed Supported Supported Supported
    KTable-to-KTable Non-windowed Supported Supported Supported
    KStream-to-KTable Non-windowed Supported Supported Not Supported
    KStream-to-GlobalKTable Non-windowed Supported Supported Not Supported
    KTable-to-GlobalKTable N/A Not Supported Not Supported Not Supported
    Source: Confluent

    View Slide

  50. @tgrall
    Joins
    KStream left = ...;
    KTable right = ...;
    // Java 8+ example, using lambda expressions
    KStream joined = left.join(right,
    (leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue,
    Joined.keySerde(Serdes.String()) /* key */
    .withValueSerde(Serdes.Long()) /* left value */
    );
    Source: Confluent

    View Slide

  51. @tgrall
    KSQL

    View Slide

  52. @tgrall
    Kafka History
    Source: Confluent

    View Slide

  53. @tgrall
    KSQL
    Apache Kafka K-SQL
    Read/Write on Topic
    CREATE STREAM
    CREATE TABLE
    SELECT ….
    Data Processing

    View Slide

  54. @tgrall
    KSQL
    Apache Kafka K-SQL
    Read/Write on Topic
    CREATE STREAM
    CREATE TABLE
    SELECT ….
    Data Processing
    KSQL is based in Kafka Streams

    View Slide

  55. @tgrall
    KSQL: Select statement syntax
    SELECT select_expr [, ...]
    FROM from_item
    [ LEFT JOIN join_table ON join_criteria ]
    [ WINDOW window_expression ]
    [ WHERE condition ]
    [ GROUP BY grouping_expression ]
    [ HAVING having_expression ]
    [ LIMIT count ];

    View Slide

  56. @tgrall
    KSQL: What for?
    ! Data Exploration: an easy way to look at topics
    ! Data Enrichment/ETL: join multiple topics
    ! Anomaly Detection: for example using windowing
    ! Real Time Monitoring/Alerting: find error when they happen

    View Slide

  57. @tgrall
    Kafka & KSQL
    Demonstration

    View Slide

  58. @tgrall
    Introduction to
    Apache Kafka
    &
    KSQL
    Tugdual Grall
    Product Management
    Red Hat
    @tgrall

    View Slide