Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[SF JUG] Apache Kafka — A Streaming Data Platform

[SF JUG] Apache Kafka — A Streaming Data Platform

When it comes time to choose a distributed messaging system, everyone knows the answer: Apache Kafka. But how about when you’re on the hook to choose a world-class, horizontally scalable stream data processing system? When you need not just publish and subscribe messaging, but also long-term storage, a flexible integration framework, and a means of deploying real-time stream processing applications at scale without having to integrate a number of different pieces of infrastructure yourself? The answer is still Apache Kafka.

In this talk, we’ll make a rapid-fire review of the breadth of Kafka as a streaming data platform. We’ll look at its internal architecture, including how it partitions messaging workloads in a fault-tolerant way. We’ll learn how it provides message durability. We’ll look at its approach to pub/sub messaging. We’ll even take a peek at how Kafka Connect provides code-free, scalable, fault-tolerant integration, and how the Streams API provides a complete framework for computation over all the streaming data in your cluster.

Viktor Gamov

May 21, 2018
Tweet

More Decks by Viktor Gamov

Other Decks in Technology

Transcript

  1. @
    Apache Kafka
    A Streaming Data Platform
    and
    #javapuzzlersng
    @gamussa @sfjava @confluentinc

    View Slide

  2. @
    @gamussa @sfjava @confluentinc

    View Slide

  3. @
    @gamussa @sfjava @confluentinc

    View Slide

  4. @
    @gamussa @sfjava @confluentinc
    Who am I?

    View Slide

  5. @
    @gamussa @sfjava @confluentinc
    Solutions Architect
    Who am I?

    View Slide

  6. @
    @gamussa @sfjava @confluentinc
    Solutions Architect
    Developer Advocate
    Who am I?

    View Slide

  7. @
    @gamussa @sfjava @confluentinc
    Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Who am I?

    View Slide

  8. @
    @gamussa @sfjava @confluentinc
    Solutions Architect
    Developer Advocate
    @gamussa in internetz
    Hey you, yes, you,
    go follow me in twitter ©
    Who am I?

    View Slide

  9. @
    @gamussa @sfjava @confluentinc
    Kafka & Confluent

    View Slide

  10. @
    @gamussa @sfjava @confluentinc
    We are hiring!
    https://www.confluent.io/careers/

    View Slide

  11. @
    @gamussa @sfjava @confluentinc

    View Slide

  12. @
    @gamussa @sfjava @confluentinc
    A company is build on

    View Slide

  13. @
    @gamussa @sfjava @confluentinc
    A company is build on
    DATA FLOWS
    but
    All we have is
    DATA STORES

    View Slide

  14. @
    @gamussa @sfjava @confluentinc

    View Slide

  15. @
    @gamussa @sfjava @confluentinc

    View Slide

  16. @
    @gamussa @sfjava @confluentinc

    View Slide

  17. @
    @gamussa @sfjava @confluentinc
    Kafka is a Streaming Platform
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine

    View Slide

  18. @
    @gamussa @sfjava @confluentinc
    Kafka
    Serving
    Layer
    (Cassandra,
    KV-storage etc.)
    Kafka
    Streams /
    KSQL
    Continuous
    Computation
    High Throughput
    Messaging
    API based
    clustering
    Origins in Stream Processing

    View Slide

  19. @
    @gamussa @sfjava @confluentinc
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  20. @
    @gamussa @sfjava @confluentinc
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 MINUTE)
    GROUP BY card_number
    HAVING count(*) > 3;
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  21. @
    @gamussa @sfjava @confluentinc
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 MINUTE)
    GROUP BY card_number
    HAVING count(*) > 3;
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  22. @
    @gamussa @sfjava @confluentinc
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 MINUTE)
    GROUP BY card_number
    HAVING count(*) > 3;
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  23. @
    @gamussa @sfjava @confluentinc
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 MINUTE)
    GROUP BY card_number
    HAVING count(*) > 3;
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  24. @
    @gamussa @sfjava @confluentinc
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 MINUTE)
    GROUP BY card_number
    HAVING count(*) > 3;
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  25. @
    @gamussa @sfjava @confluentinc
    CREATE STREAM possible_fraud AS
    SELECT card_number, count(*)
    FROM authorization_attempts
    WINDOW TUMBLING (SIZE 5 MINUTE)
    GROUP BY card_number
    HAVING count(*) > 3;
    authorization_attempts possible_fraud
    What exactly is Stream Processing?

    View Slide

  26. @
    @gamussa @sfjava @confluentinc
    Streaming

    is the toolset for dealing 

    with events 

    as they move!

    View Slide

  27. @
    @gamussa @sfjava @confluentinc
    What is a Streaming Platform?
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine

    View Slide

  28. @
    @gamussa @sfjava @confluentinc
    Kafka’s Distributed Log
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine

    View Slide

  29. @
    @gamussa @sfjava @confluentinc
    The log is a type of durable messaging system

    View Slide

  30. @
    @gamussa @sfjava @confluentinc
    Similar to a traditional messaging system (ActiveMQ, Rabbit etc)
    but with:
    (a) Far better scalability
    (b) Built in fault tolerance / HA
    (c) Storage
    The log is a type of durable messaging system

    View Slide

  31. @
    @gamussa @sfjava @confluentinc
    The log is a simple idea
    Messages are added
    at the end of the log
    Old New

    View Slide

  32. @
    @gamussa @sfjava @confluentinc
    Consumers have a position all of their own
    Sally
    is here
    George
    is here
    Fred
    is here
    Old New
    Scan Scan
    Scan

    View Slide

  33. @
    @gamussa @sfjava @confluentinc
    Only Sequential Access
    Old New
    Read to offset & scan

    View Slide

  34. @
    @gamussa @sfjava @confluentinc
    Scaling Out

    View Slide

  35. @
    @gamussa @sfjava @confluentinc
    Shard data to get scalability
    Messages are sent to different partitions
    Producer (1) Producer (2) Producer (3)
    Cluster of machines
    Partitions live on different machines

    View Slide

  36. @
    @gamussa @sfjava @confluentinc
    Replicate to get fault tolerance
    replicate
    msg
    msg
    leader
    Machine A
    Machine B

    View Slide

  37. @
    @gamussa @sfjava @confluentinc
    Partition Leadership and Replication
    Broker 1
    Topic1
    partition1
    Broker 2 Broker 3 Broker 4
    Topic1
    partition1
    Topic1
    partition1
    Leader Follower
    Topic1
    partition2
    Topic1
    partition2
    Topic1
    partition2
    Topic1
    partition3
    Topic1
    partition4
    Topic1
    partition3
    Topic1
    partition3
    Topic1
    partition4
    Topic1
    partition4

    View Slide

  38. @
    @gamussa @sfjava @confluentinc
    Replication provides resiliency
    A ‘replica’ takes over on machine failure

    View Slide

  39. @
    @gamussa @sfjava @confluentinc
    Partition Leadership and Replication - node failure
    Broker 1
    Topic1
    partition1
    Broker 2 Broker 3 Broker 4
    Topic1
    partition1
    Topic1
    partition1
    Leader Follower
    Topic1
    partition2
    Topic1
    partition2
    Topic1
    partition2
    Topic1
    partition3
    Topic1
    partition4
    Topic1
    partition3
    Topic1
    partition3
    Topic1
    partition4
    Topic1
    partition4

    View Slide

  40. @
    @gamussa @sfjava @confluentinc
    Linearly Scalable Architecture
    Single topic:
    - Many producers machines
    - Many consumer machines
    - Many Broker machines
    No Bottleneck!!
    Consumers
    Producers

    View Slide

  41. @
    @gamussa @sfjava @confluentinc
    Worldwide, localized views
    !33
    NY
    London
    Tokyo
    Replicator Replicator
    Replicator

    View Slide

  42. @
    @gamussa @sfjava @confluentinc
    The Connect API
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine

    View Slide

  43. @
    @gamussa @sfjava @confluentinc
    Ingest / Egest into any data source
    Kafka

    Connect
    Kafka

    Connect

    View Slide

  44. @
    @gamussa @sfjava @confluentinc
    Ingest/Egest data from/to data sources
    Amazon S3
    Elasticsearch
    HDFS
    JDBC
    Couchbase
    Cassandra
    Oracle
    SAP
    Vertica
    Blockchain
    JMX
    Kenesis
    MongoDB
    MQTT
    NATS
    Postgres
    Rabbit
    Redis
    Twitter
    DynamoDB
    FTP
    Github
    BigQuery
    Google Pub Sub
    RethinkDB
    Salesforce
    Solr
    Splunk

    View Slide

  45. @
    @gamussa @sfjava @confluentinc
    Kafka Streams and KSQL
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine

    View Slide

  46. @
    @gamussa @sfjava @confluentinc
    SELECT card_number, count(*)


    FROM authorization_attempts


    WINDOW (SIZE 5 MINUTE)


    GROUP BY card_number


    HAVING count(*) > 3;


    Engine for Continuous Computation

    View Slide

  47. @
    @gamussa @sfjava @confluentinc
    But it’s just an API
    public static void main(String[] args) {

    StreamsBuilder builder = new StreamsBuilder();

    builder.stream("caterpillars")

    .map(StreamsApp ::coolTransformation)

    .to("butterflies");


    new KafkaStreams(builder.build(), props()).start();

    }

    View Slide

  48. @
    @gamussa @sfjava @confluentinc
    Compacted

    Topic
    Join
    Stream
    Table
    Kafka Kafka Streams / KSQL
    Topic
    Join Streams and Tables

    View Slide

  49. @
    @gamussa @sfjava @confluentinc
    KAFKA
    Payments
    Orders
    Buffer 5 mins
    Emailer
    Windows / Retention – Handle Late Events
    In an asynchronous world, will the payment come first, or the order?
    Join by Key

    View Slide

  50. @
    @gamussa @sfjava @confluentinc
    Windows / Retention – Handle Late Events
    KAFKA
    Payments
    Orders
    Buffer 5 mins
    Emailer
    Join by Key
    KStream orders = builder.stream("Orders");

    KStream payments = builder.stream("Payments");


    orders.join(payments, 

    KeyValue ::new, 

    JoinWindows.of(1 * MIN))

    .peek((key, pair) -> emailer.sendMail(pair));

    View Slide

  51. @
    @gamussa @sfjava @confluentinc
    A KTable is just a stream with infinite retention
    KAFKA
    Emailer
    Orders, Payments
    Customers Join

    View Slide

  52. @
    @gamussa @sfjava @confluentinc
    A KTable is a stream with infinite retention
    KAFKA
    Emailer
    Orders, Payments
    Customers
    Join Materialize a table in
    two lines of code!
    KStream orders = builder.stream("Orders");

    KStream payments = builder.stream("Payments");

    KTable customers = builder.table("Customers");


    orders.join(payments, EmailTuple ::new, JoinWindows.of(1*MIN))

    .join(customers, (tuple, cust) -> tuple.setCust(cust))

    .peek((key, tuple) -> emailer.sendMail(tuple));

    View Slide

  53. @
    @gamussa @sfjava @confluentinc
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine
    Kafka is a complete Streaming Platform

    View Slide

  54. @
    @gamussa @sfjava @confluentinc
    The Log Connectors
    Connectors
    Producer Consumer
    Streaming Engine
    Kafka is a complete Streaming Platform

    View Slide

  55. @
    @gamussa @sfjava @confluentinc
    https://www.confluent.io/download/

    View Slide

  56. @
    @gamussa @sfjava @confluentinc
    We are hiring!
    https://www.confluent.io/careers/

    View Slide

  57. @
    @gamussa @sfjava @confluentinc
    One more thing…

    View Slide

  58. @
    @gamussa @sfjava @confluentinc

    View Slide

  59. @
    @gamussa @sfjava @confluentinc

    View Slide

  60. @
    @gamussa @sfjava @confluentinc

    View Slide

  61. @
    @gamussa @sfjava @confluentinc

    View Slide

  62. @
    @gamussa @sfjava @confluentinc

    View Slide

  63. @
    @gamussa @sfjava @confluentinc
    A Major New Paradigm

    View Slide

  64. @
    @gamussa @sfjava @confluentinc
    Thanks!
    Stay for #javapuzzlersng!!!
    @gamussa
    [email protected]
    We are hiring!
    https://www.confluent.io/careers/

    View Slide