[VirtualJUG] Apache Kafka — A Streaming Data Platform

[VirtualJUG] Apache Kafka — A Streaming Data Platform

When it comes time to choose a distributed messaging system, everyone knows the answer: Apache Kafka. But how about when you’re on the hook to choose a world-class, horizontally scalable stream data processing system? When you need not just publish and subscribe messaging, but also long-term storage, a flexible integration framework, and a means of deploying real-time stream processing applications at scale without having to integrate a number of different pieces of infrastructure yourself? The answer is still Apache Kafka.

In this talk, we’ll make a rapid-fire review of the breadth of Kafka as a streaming data platform. We’ll look at its internal architecture, including how it partitions messaging workloads in a fault-tolerant way. We’ll learn how it provides message durability. We’ll look at its approach to pub/sub messaging. We’ll even take a peek at how Kafka Connect provides code-free, scalable, fault-tolerant integration, and how the Streams API provides a complete framework for computation over all the streaming data in your cluster.

0680be1c881abcf19219f09f1e8cf140?s=128

Viktor Gamov

August 07, 2018
Tweet

Transcript

  1. @ @gamussa @virtualJUG @confluentinc @gamussa @virtualJUG @confluentinc

  2. @ @gamussa @virtualJUG @confluentinc

  3. @ @gamussa @virtualJUG @confluentinc Solutions Architect Developer Advocate @gamussa in

    internetz Hey you, yes, you, go follow me in twitter © Who am I?
  4. @ @gamussa @virtualJUG @confluentinc Kafka & Confluent

  5. @ @gamussa @virtualJUG @confluentinc We are hiring! https://www.confluent.io/careers/

  6. @ @gamussa @virtualJUG @confluentinc A company is build on DATA

    FLOWS but All we have is DATA STORES
  7. @ @gamussa @virtualJUG @confluentinc

  8. @ @gamussa @virtualJUG @confluentinc

  9. @ @gamussa @virtualJUG @confluentinc

  10. @ @gamussa @virtualJUG @confluentinc Kafka Serving Layer (Cassandra, KV-storage, cache,

    etc.) Kafka Streams / KSQL Continuous Computation High Throughput Messaging API based clustering Origins in Stream Processing
  11. @ @gamussa @virtualJUG @confluentinc Streaming Platform 1.Pub / Sub 2.Store

    3.Process
  12. @ @gamussa @virtualJUG @confluentinc Kafka is a Streaming Platform The

    Log Connectors Connectors Producer Consumer Streaming Engine
  13. @ @gamussa @virtualJUG @confluentinc authorization_attempts possible_fraud What exactly is Stream

    Processing?
  14. @ @gamussa @virtualJUG @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  15. @ @gamussa @virtualJUG @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  16. @ @gamussa @virtualJUG @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  17. @ @gamussa @virtualJUG @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  18. @ @gamussa @virtualJUG @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  19. @ @gamussa @virtualJUG @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  20. @ @gamussa @virtualJUG @confluentinc Streaming is the toolset for dealing

    with events as they move!
  21. @ @gamussa @virtualJUG @confluentinc What is a Streaming Platform? The

    Log Connectors Connectors Producer Consumer Streaming Engine
  22. @ @gamussa @virtualJUG @confluentinc Kafka’s Distributed Log The Log Connectors

    Connectors Producer Consumer Streaming Engine
  23. @ @gamussa @virtualJUG @confluentinc Similar to a traditional messaging system

    (ActiveMQ, Rabbit etc) but with: (a) Far better scalability (b) Built in fault tolerance / HA (c) Storage The log is a type of durable messaging system
  24. @ @gamussa @virtualJUG @confluentinc The log is a simple idea

    Messages are added at the end of the log Old New
  25. @ @gamussa @virtualJUG @confluentinc Consumers have a position all of

    their own Sally is here George is here Fred is here Old New Scan Scan Scan
  26. @ @gamussa @virtualJUG @confluentinc Only Sequential Access Old New Read

    to offset & scan
  27. @ @gamussa @virtualJUG @confluentinc Scaling Out

  28. @ @gamussa @virtualJUG @confluentinc Shard data to get scalability Messages

    are sent to different partitions Producer (1) Producer (2) Producer (3) Cluster of machines Partitions live on different machines
  29. @ @gamussa @virtualJUG @confluentinc Replicate to get fault tolerance replicate

    msg msg leader Machine A Machine B
  30. @ @gamussa @virtualJUG @confluentinc Partition Leadership and Replication Broker 1

    Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  31. @ @gamussa @virtualJUG @confluentinc Replication provides resiliency A ‘replica’ takes

    over on machine failure
  32. @ @gamussa @virtualJUG @confluentinc Partition Leadership and Replication - node

    failure Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  33. @ @gamussa @virtualJUG @confluentinc Linearly Scalable Architecture Single topic: -

    Many producers machines - Many consumer machines - Many Broker machines No Bottleneck!! Consumers Producers
  34. @ @gamussa @virtualJUG @confluentinc Worldwide, localized views 34 NY London

    Tokyo Replicator Replicator Replicator
  35. @ @gamussa @virtualJUG @confluentinc The Connect API The Log Connectors

    Connectors Producer Consumer Streaming Engine
  36. @ @gamussa @virtualJUG @confluentinc Ingest / Egest into any data

    source Kafka Connect Kafka Connect
  37. @ @gamussa @virtualJUG @confluentinc Ingest/Egest data from/to data sources Amazon

    S3 Elasticsearch HDFS JDBC Couchbase Cassandra Oracle SAP Vertica Blockchain JMX Kenesis MongoDB MQTT NATS Postgres Rabbit Redis Twitter Bintray DynamoDB FTP Github BigQuery Google Pub Sub RethinkDB Salesforce Solr Splunk
  38. @ @gamussa @virtualJUG @confluentinc Kafka Streams and KSQL The Log

    Connectors Connectors Producer Consumer Streaming Engine
  39. @ @gamussa @virtualJUG @confluentinc SELECT card_number, count(*) FROM authorization_attempts WINDOW

    (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; Engine for Continuous Computation
  40. @ @gamussa @virtualJUG @confluentinc But it’s just an API public

    static void main(String[] args) { StreamsBuilder builder = new StreamsBuilder(); builder.stream("caterpillars") .map(StreamsApp::coolTransformation) .to("butterflies"); new KafkaStreams(builder.build(), props()).start(); }
  41. @ @gamussa @virtualJUG @confluentinc Compacted Topic Join Stream Table Kafka

    Kafka Streams / KSQL Topic Join Streams and Tables
  42. @ @gamussa @virtualJUG @confluentinc KAFKA Payments Orders Buffer 5 mins

    Emailer Windows / Retention – Handle Late Events In an asynchronous world, will the payment come first, or the order? Join by Key
  43. @ @gamussa @virtualJUG @confluentinc Windows / Retention – Handle Late

    Events KAFKA Payments Orders Buffer 5 mins Emailer Join by Key KStream orders = builder.stream("Orders"); KStream payments = builder.stream("Payments"); orders.join(payments, KeyValue::new, JoinWindows.of(1 * MIN)) .peek((key, pair) -> emailer.sendMail(pair));
  44. @ @gamussa @virtualJUG @confluentinc A KTable is just a stream

    with infinite retention KAFKA Emailer Orders, Payments Customers Join
  45. @ @gamussa @virtualJUG @confluentinc A KTable is a stream with

    infinite retention KAFKA Emailer Orders, Payments Customers Join Materialize a table in two lines of code! KStream orders = builder.stream("Orders"); KStream payments = builder.stream("Payments"); KTable customers = builder.table("Customers"); orders.join(payments, EmailTuple::new, JoinWindows.of(1*MIN)) .join(customers, (tuple, cust) -> tuple.setCust(cust)) .peek((key, tuple) -> emailer.sendMail(tuple));
  46. @ @gamussa @virtualJUG @confluentinc The Log Connectors Connectors Producer Consumer

    Streaming Engine Kafka is a complete Streaming Platform
  47. @ @gamussa @virtualJUG @confluentinc Find your local Meetup Group https://cnfl.io/kafka-meetups

    Join us in Slack http://cnfl.io/slack Grab Stream Processing books https://cnfl.io/book-bundle
  48. @ @gamussa @virtualJUG @confluentinc www.kafka-summit.org promo: Gamov20

  49. @ @gamussa @virtualJUG @confluentinc https://www.confluent.io/download/

  50. @ @gamussa @virtualJUG @confluentinc One more thing…

  51. @ @gamussa @virtualJUG @confluentinc

  52. @ @gamussa @virtualJUG @confluentinc

  53. @ @gamussa @virtualJUG @confluentinc A Major New Paradigm

  54. @ @gamussa @virtualJUG @confluentinc Thanks! @gamussa viktor@confluent.io We are hiring!

    https://www.confluent.io/careers/