Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[SF JUG] Apache Kafka — A Streaming Data Platform

[SF JUG] Apache Kafka — A Streaming Data Platform

When it comes time to choose a distributed messaging system, everyone knows the answer: Apache Kafka. But how about when you’re on the hook to choose a world-class, horizontally scalable stream data processing system? When you need not just publish and subscribe messaging, but also long-term storage, a flexible integration framework, and a means of deploying real-time stream processing applications at scale without having to integrate a number of different pieces of infrastructure yourself? The answer is still Apache Kafka.

In this talk, we’ll make a rapid-fire review of the breadth of Kafka as a streaming data platform. We’ll look at its internal architecture, including how it partitions messaging workloads in a fault-tolerant way. We’ll learn how it provides message durability. We’ll look at its approach to pub/sub messaging. We’ll even take a peek at how Kafka Connect provides code-free, scalable, fault-tolerant integration, and how the Streams API provides a complete framework for computation over all the streaming data in your cluster.

Viktor Gamov

May 21, 2018
Tweet

More Decks by Viktor Gamov

Other Decks in Technology

Transcript

  1. @ @gamussa @sfjava @confluentinc Solutions Architect Developer Advocate @gamussa in

    internetz Hey you, yes, you, go follow me in twitter © Who am I?
  2. @ @gamussa @sfjava @confluentinc Kafka is a Streaming Platform The

    Log Connectors Connectors Producer Consumer Streaming Engine
  3. @ @gamussa @sfjava @confluentinc Kafka Serving Layer (Cassandra, KV-storage etc.)

    Kafka Streams / KSQL Continuous Computation High Throughput Messaging API based clustering Origins in Stream Processing
  4. @ @gamussa @sfjava @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  5. @ @gamussa @sfjava @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  6. @ @gamussa @sfjava @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  7. @ @gamussa @sfjava @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  8. @ @gamussa @sfjava @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  9. @ @gamussa @sfjava @confluentinc CREATE STREAM possible_fraud AS SELECT card_number,

    count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; authorization_attempts possible_fraud What exactly is Stream Processing?
  10. @ @gamussa @sfjava @confluentinc What is a Streaming Platform? The

    Log Connectors Connectors Producer Consumer Streaming Engine
  11. @ @gamussa @sfjava @confluentinc Similar to a traditional messaging system

    (ActiveMQ, Rabbit etc) but with: (a) Far better scalability (b) Built in fault tolerance / HA (c) Storage The log is a type of durable messaging system
  12. @ @gamussa @sfjava @confluentinc The log is a simple idea

    Messages are added at the end of the log Old New
  13. @ @gamussa @sfjava @confluentinc Consumers have a position all of

    their own Sally is here George is here Fred is here Old New Scan Scan Scan
  14. @ @gamussa @sfjava @confluentinc Shard data to get scalability Messages

    are sent to different partitions Producer (1) Producer (2) Producer (3) Cluster of machines Partitions live on different machines
  15. @ @gamussa @sfjava @confluentinc Partition Leadership and Replication Broker 1

    Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  16. @ @gamussa @sfjava @confluentinc Partition Leadership and Replication - node

    failure Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  17. @ @gamussa @sfjava @confluentinc Linearly Scalable Architecture Single topic: -

    Many producers machines - Many consumer machines - Many Broker machines No Bottleneck!! Consumers Producers
  18. @ @gamussa @sfjava @confluentinc The Connect API The Log Connectors

    Connectors Producer Consumer Streaming Engine
  19. @ @gamussa @sfjava @confluentinc Ingest/Egest data from/to data sources Amazon

    S3 Elasticsearch HDFS JDBC Couchbase Cassandra Oracle SAP Vertica Blockchain JMX Kenesis MongoDB MQTT NATS Postgres Rabbit Redis Twitter DynamoDB FTP Github BigQuery Google Pub Sub RethinkDB Salesforce Solr Splunk
  20. @ @gamussa @sfjava @confluentinc Kafka Streams and KSQL The Log

    Connectors Connectors Producer Consumer Streaming Engine
  21. @ @gamussa @sfjava @confluentinc SELECT card_number, count(*) FROM authorization_attempts WINDOW

    (SIZE 5 MINUTE) GROUP BY card_number HAVING count(*) > 3; Engine for Continuous Computation
  22. @ @gamussa @sfjava @confluentinc But it’s just an API public

    static void main(String[] args) { 
 StreamsBuilder builder = new StreamsBuilder();
 builder.stream("caterpillars")
 .map(StreamsApp ::coolTransformation)
 .to("butterflies");
 
 new KafkaStreams(builder.build(), props()).start(); 
 }
  23. @ @gamussa @sfjava @confluentinc Compacted Topic Join Stream Table Kafka

    Kafka Streams / KSQL Topic Join Streams and Tables
  24. @ @gamussa @sfjava @confluentinc KAFKA Payments Orders Buffer 5 mins

    Emailer Windows / Retention – Handle Late Events In an asynchronous world, will the payment come first, or the order? Join by Key
  25. @ @gamussa @sfjava @confluentinc Windows / Retention – Handle Late

    Events KAFKA Payments Orders Buffer 5 mins Emailer Join by Key KStream orders = builder.stream("Orders");
 KStream payments = builder.stream("Payments");
 
 orders.join(payments, 
 KeyValue ::new, 
 JoinWindows.of(1 * MIN))
 .peek((key, pair) -> emailer.sendMail(pair));
  26. @ @gamussa @sfjava @confluentinc A KTable is just a stream

    with infinite retention KAFKA Emailer Orders, Payments Customers Join
  27. @ @gamussa @sfjava @confluentinc A KTable is a stream with

    infinite retention KAFKA Emailer Orders, Payments Customers Join Materialize a table in two lines of code! KStream orders = builder.stream("Orders");
 KStream payments = builder.stream("Payments");
 KTable customers = builder.table("Customers");
 
 orders.join(payments, EmailTuple ::new, JoinWindows.of(1*MIN))
 .join(customers, (tuple, cust) -> tuple.setCust(cust))
 .peek((key, tuple) -> emailer.sendMail(tuple));
  28. @ @gamussa @sfjava @confluentinc The Log Connectors Connectors Producer Consumer

    Streaming Engine Kafka is a complete Streaming Platform
  29. @ @gamussa @sfjava @confluentinc The Log Connectors Connectors Producer Consumer

    Streaming Engine Kafka is a complete Streaming Platform