$30 off During Our Annual Pro Sale. View Details »

Understanding Streaming Data and Analytics with Apache Kafka®

Understanding Streaming Data and Analytics with Apache Kafka®

Ricardo Ferreira

October 01, 2020
Tweet

More Decks by Ricardo Ferreira

Other Decks in Programming

Transcript

  1. Understanding streaming
    data and analytics with
    apache kafka®
    @riferrei | @apachekafka | @elastic

    View Slide

  2. About me
    @riferrei | @apachekafka | @elastic
    • RICARDO FERREIRA
    • Developer advocate
    • Elastic community team
    • Kafka summit pc member
    [email protected]
    [email protected]

    View Slide

  3. View Slide

  4. View Slide

  5. @riferrei | @apachekafka | @elastic
    ”there were lots of databases and
    other systems built to store data,
    but what was missing in our
    architecture was something that
    would help us to handle continuous
    flows of data.” – jay kreps
    Origins of apache kafka

    View Slide

  6. View Slide

  7. @riferrei | @apachekafka | @elastic
    Event-driven architecture
    Job change recommendation engine
    Search engine
    Email service

    View Slide

  8. @riferrei | @apachekafka | @elastic
    SQL
    SQL
    SQL
    Recommendation engine
    Search engine
    Email service
    database
    LOG
    IMPLEMENT WITH a DATABASE

    View Slide

  9. @riferrei | @apachekafka | @elastic
    Databases CAN’T handle events
    database
    1000x more volume
    Non-transactional events
    Transactional events
    LOG

    View Slide

  10. Databases 30
    years ago...

    View Slide

  11. Databases
    these days

    View Slide

  12. @riferrei | @apachekafka | @elastic
    Databases
    are limited

    View Slide

  13. Limited?
    Are you
    kidding me?

    View Slide

  14. @riferrei | @apachekafka | @elastic
    ARE DATABASES LIMITED?
    YES THEY ARE. WHY
    DO WE HAVE TO MOVE
    DATA FROM ONE DB
    TO ANOTHER JUST
    for ANALYTICS?

    View Slide

  15. @riferrei | @apachekafka | @elastic
    What then?

    View Slide

  16. “The truth is the log.
    The database is a cache
    of a subset of the log.”
    — pat helland
    Immutability changes everything
    http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

    View Slide

  17. @riferrei | @apachekafka | @elastic
    log as first-class citizen
    database
    LOG
    0 1 2 3 4 5 6 7 8
    LOG
    reads
    writes
    Destination System a
    (time = 1)
    Destination System b
    (time = 3)

    View Slide

  18. @riferrei | @apachekafka | @elastic
    SOLUTION: BUILD A COMMIT LOG
    Commit LOG
    User
    tracking
    Historical
    data
    Operational
    metrics
    Nosql
    database
    Graph
    database
    Sql
    database
    ...
    HADOOP
    Elastic
    search
    grafana
    Machine
    learning
    REC.
    ENGINE SEARCH SECURITY EMAIL
    SOCIAL
    GRAPH
    microservices

    View Slide

  19. @riferrei | @apachekafka | @elastic
    ”WE’VE COME TO THINK OF KAFKA AS A
    STREAMING PLATFORM: A SYSTEM THAT
    LETS YOU PUBLISH AND SUBSCRIBE TO
    STREAMS OF DATA, STORE THEM, AND
    PROCESS THEM, AND THAT IS EXACTLY
    WHAT APACHE KAFKA IS BUILT TO BE.”
    – jay kreps
    Origins of apache kafka

    View Slide

  20. @riferrei | @apachekafka | @elastic
    ORIGINS OF APACHE KAFKA
    Databases Messaging
    Batch
    Expensive
    Time Consuming
    Difficult to Scale
    No Persistence After
    Consumption
    No Replay
    Highly Scalable
    Durable
    Persistent
    Ordered
    Fast (Low Latency)

    View Slide

  21. @riferrei | @apachekafka | @elastic
    ORIGINS OF APACHE KAFKA
    Databases Messaging
    Batch
    Expensive
    Time Consuming
    Difficult to Scale
    No Persistence After
    Consumption
    No Replay
    Highly Scalable
    Durable
    Persistent
    Ordered
    Fast (Low Latency)
    Highly Scalable
    Durable
    Persistent
    Ordered
    Fast (Low Latency)
    Distributed
    Commit log

    View Slide

  22. @riferrei | @apachekafka | @elastic
    ORIGINS OF APACHE KAFKA
    Databases Messaging
    Batch
    Expensive
    Time Consuming
    Difficult to Scale
    No Persistence After
    Consumption
    No Replay
    Highly Scalable
    Durable
    Persistent
    Ordered
    Fast (Low Latency)
    Highly Scalable
    Durable
    Persistent
    Ordered
    Fast (Low Latency)
    Stream processing
    Continuous flows
    Scalable integration
    Distributed
    Streaming platform

    View Slide

  23. @riferrei | @apachekafka | @elastic
    ”the ability to combine these three
    areas – to bring all the streams of
    data together across all the use
    cases – is what makes the idea of a
    streaming platform so appealing
    to people” – jay kreps
    Origins of apache kafka

    View Slide

  24. @riferrei | @confluentinc | @itau

    View Slide

  25. 01
    Data Streams
    with messaging
    02
    Data analytics with
    stream processing
    03
    Sophisticated
    STORAGE SYSTEM
    Distributed streaming platform

    View Slide

  26. @riferrei | @apachekafka | @elastic
    Data streams
    With messaging

    View Slide

  27. @riferrei | @apachekafka | @elastic
    producer
    Messaging as you know it
    consumer
    broker
    write
    push

    View Slide

  28. @riferrei | @apachekafka | @elastic
    producer
    Kafka does messaging different
    consumer
    broker
    write
    pull

    View Slide

  29. @riferrei | @apachekafka | @elastic
    Kafka does messaging different
    broker
    pull
    Group 1
    Group 2
    Group 3
    pull
    pull
    queueing
    Pub/sub

    View Slide

  30. @riferrei | @apachekafka | @elastic
    Kafka does messaging different
    0 1 2 3 4 5 6 7
    topic
    0 1 2 3
    Partition 1
    4 5 6 7
    Partition 2

    View Slide

  31. @riferrei | @apachekafka | @elastic
    Kafka does messaging different
    0 1 2 3
    Partition 1
    4 5 6 7
    Partition 2
    8 9
    Partition 3
    producer
    write
    consumer
    consumer
    consumer
    pull
    pull
    pull

    View Slide

  32. @riferrei | @apachekafka | @elastic
    Kafka does messaging different
    0 1 2 3
    Partition 1
    4 5 6 7
    Partition 2
    8 9
    Partition 3
    producer Key 002

    View Slide

  33. @riferrei | @apachekafka | @elastic
    Kafka does messaging different
    producer
    write
    consumer
    pull
    Bytes
    serialize deserialize

    View Slide

  34. @riferrei | @apachekafka | @elastic
    producer
    Kafka does messaging different
    broker
    write
    250gb 250gb 500gb
    Data is always Persistent

    View Slide

  35. @riferrei | @apachekafka | @elastic
    Data ANALYTICS
    WITH STREAM
    PROCESSING

    View Slide

  36. @riferrei | @apachekafka | @elastic
    How to process data streams?
    consumer
    broker
    1) pull
    number of
    records < 4
    12
    number of
    records > 5
    9
    3) write
    2) process

    View Slide

  37. @riferrei | @apachekafka | @elastic
    How to process data streams?
    consumer
    broker
    1) pull
    3) write
    What IF WE COULD HAVE
    A Processing LAYER FOR
    THE DATA STREAMS?
    number of
    records < 4
    12
    number of
    records > 5
    9
    2) process

    View Slide

  38. @riferrei | @apachekafka | @elastic
    Using stream processors
    producer consumer
    broker
    write pull
    Stream
    processors

    View Slide

  39. @riferrei | @apachekafka | @elastic
    Using stream processors
    Kafka streams

    View Slide

  40. @riferrei | @apachekafka | @elastic
    Using stream processors
    ksqldb

    View Slide

  41. @riferrei | @apachekafka | @elastic
    Scalable data integration
    broker
    Stream
    processors
    connectors

    View Slide

  42. @riferrei | @apachekafka | @elastic
    sophisticated
    Storage system

    View Slide

  43. @riferrei | @apachekafka | @elastic
    Kafka as a storage system
    Broker 1
    250gb 250gb 500gb
    1tb storage
    Broker 2
    500gb 500gb 500gb
    1.5tb storage
    Cluster storage → 2.5tb
    Elastic storage

    View Slide

  44. @riferrei | @apachekafka | @elastic
    Kafka as a storage system
    Broker 1
    250gb 250gb 500gb
    1tb storage
    Broker 2
    500gb 500gb 500gb
    1.5tb storage
    Partition-level replication
    Partition 1
    Partition 2 Partition 2

    View Slide

  45. @riferrei | @apachekafka | @elastic
    Kafka as a storage system
    Commit LOG consumer
    Polling 100 records
    consumer
    Constant time performance
    Time spent: 1 MS
    Polling 100 records
    Time spent: 1 MS
    Commit LOG
    5kb
    5tb

    View Slide

  46. @riferrei | @apachekafka | @elastic
    Kafka as a storage system
    Optimized for massive reads
    Broker 1
    250gb 250gb 500gb
    1tb storage
    pagecache
    nic
    consumer
    Kafka uses the sendfile api to:
    - Bypass pagecache to kernel space
    - Bypass kernel space to user buffer
    - Bypass user buffer to kernel space
    - Bypass kernel space to socket buffer
    Partition 1
    Partition 2

    View Slide

  47. @riferrei | @apachekafka | @elastic
    Kafka as a storage system
    File management in kafka
    Partition 0
    Partition 1
    Partition 2
    Segment 0
    Segment 1
    +
    Segment 2
    +
    0000Seg1.log
    0000Seg1.index

    View Slide

  48. @riferrei | @apachekafka | @elastic
    Putting the
    Pieces together

    View Slide

  49. @riferrei | @apachekafka | @elastic
    Streaming PAC-MAN

    View Slide

  50. @riferrei | @apachekafka | @elastic
    STREAMING PAC-MAN
    Api
    gateway
    Lambda
    function
    Kafka
    (MSK)
    Ksqldb
    (ecs)
    Kafka
    (MSK)
    scoreboard
    https://github.com/riferrei/streaming-pacman-aws

    View Slide

  51. @riferrei | @apachekafka | @elastic
    2. Name yourself
    1. Get the game
    Streaming pac-man

    View Slide

  52. @riferrei | @apachekafka | @elastic
    Making data
    available

    View Slide

  53. @riferrei | @apachekafka | @elastic
    Api
    gateway
    Lambda
    function
    scoreboard
    Redis
    cache
    push
    From kafka to the world

    View Slide

  54. From kafka to the world
    @riferrei | @apachekafka | @elastic
    Amazon
    alexa
    Lambda
    function
    scoreboard
    Redis
    cache
    push

    View Slide

  55. @riferrei | @apachekafka | @elastic
    Your
    code
    Ksqldb
    (ECS)
    pull
    Kafka
    (MSK)
    From kafka to the world

    View Slide

  56. @riferrei | @apachekafka | @elastic
    how can I
    learn more?

    View Slide

  57. @riferrei | @apachekafka | @elastic
    Use professional books

    View Slide

  58. @riferrei | @apachekafka | @elastic
    Thank you

    View Slide