Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Streaming Data and Analytics with Apache Kafka®

Understanding Streaming Data and Analytics with Apache Kafka®

Ricardo Ferreira

October 01, 2020
Tweet

More Decks by Ricardo Ferreira

Other Decks in Programming

Transcript

  1. About me @riferrei | @apachekafka | @elastic • RICARDO FERREIRA

    • Developer advocate • Elastic community team • Kafka summit pc member • [email protected][email protected]
  2. @riferrei | @apachekafka | @elastic ”there were lots of databases

    and other systems built to store data, but what was missing in our architecture was something that would help us to handle continuous flows of data.” – jay kreps Origins of apache kafka
  3. @riferrei | @apachekafka | @elastic SQL SQL SQL Recommendation engine

    Search engine Email service database LOG IMPLEMENT WITH a DATABASE
  4. @riferrei | @apachekafka | @elastic Databases CAN’T handle events database

    1000x more volume Non-transactional events Transactional events LOG
  5. @riferrei | @apachekafka | @elastic ARE DATABASES LIMITED? YES THEY

    ARE. WHY DO WE HAVE TO MOVE DATA FROM ONE DB TO ANOTHER JUST for ANALYTICS?
  6. “The truth is the log. The database is a cache

    of a subset of the log.” — pat helland Immutability changes everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
  7. @riferrei | @apachekafka | @elastic log as first-class citizen database

    LOG 0 1 2 3 4 5 6 7 8 LOG reads writes Destination System a (time = 1) Destination System b (time = 3)
  8. @riferrei | @apachekafka | @elastic SOLUTION: BUILD A COMMIT LOG

    Commit LOG User tracking Historical data Operational metrics Nosql database Graph database Sql database ... HADOOP Elastic search grafana Machine learning REC. ENGINE SEARCH SECURITY EMAIL SOCIAL GRAPH microservices
  9. @riferrei | @apachekafka | @elastic ”WE’VE COME TO THINK OF

    KAFKA AS A STREAMING PLATFORM: A SYSTEM THAT LETS YOU PUBLISH AND SUBSCRIBE TO STREAMS OF DATA, STORE THEM, AND PROCESS THEM, AND THAT IS EXACTLY WHAT APACHE KAFKA IS BUILT TO BE.” – jay kreps Origins of apache kafka
  10. @riferrei | @apachekafka | @elastic ORIGINS OF APACHE KAFKA Databases

    Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption No Replay Highly Scalable Durable Persistent Ordered Fast (Low Latency)
  11. @riferrei | @apachekafka | @elastic ORIGINS OF APACHE KAFKA Databases

    Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption No Replay Highly Scalable Durable Persistent Ordered Fast (Low Latency) Highly Scalable Durable Persistent Ordered Fast (Low Latency) Distributed Commit log
  12. @riferrei | @apachekafka | @elastic ORIGINS OF APACHE KAFKA Databases

    Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption No Replay Highly Scalable Durable Persistent Ordered Fast (Low Latency) Highly Scalable Durable Persistent Ordered Fast (Low Latency) Stream processing Continuous flows Scalable integration Distributed Streaming platform
  13. @riferrei | @apachekafka | @elastic ”the ability to combine these

    three areas – to bring all the streams of data together across all the use cases – is what makes the idea of a streaming platform so appealing to people” – jay kreps Origins of apache kafka
  14. 01 Data Streams with messaging 02 Data analytics with stream

    processing 03 Sophisticated STORAGE SYSTEM Distributed streaming platform
  15. @riferrei | @apachekafka | @elastic Kafka does messaging different broker

    pull Group 1 Group 2 Group 3 pull pull queueing Pub/sub
  16. @riferrei | @apachekafka | @elastic Kafka does messaging different 0

    1 2 3 4 5 6 7 topic 0 1 2 3 Partition 1 4 5 6 7 Partition 2
  17. @riferrei | @apachekafka | @elastic Kafka does messaging different 0

    1 2 3 Partition 1 4 5 6 7 Partition 2 8 9 Partition 3 producer write consumer consumer consumer pull pull pull
  18. @riferrei | @apachekafka | @elastic Kafka does messaging different 0

    1 2 3 Partition 1 4 5 6 7 Partition 2 8 9 Partition 3 producer Key 002
  19. @riferrei | @apachekafka | @elastic Kafka does messaging different producer

    write consumer pull Bytes serialize deserialize
  20. @riferrei | @apachekafka | @elastic producer Kafka does messaging different

    broker write 250gb 250gb 500gb Data is always Persistent
  21. @riferrei | @apachekafka | @elastic How to process data streams?

    consumer broker 1) pull number of records < 4 12 number of records > 5 9 3) write 2) process
  22. @riferrei | @apachekafka | @elastic How to process data streams?

    consumer broker 1) pull 3) write What IF WE COULD HAVE A Processing LAYER FOR THE DATA STREAMS? number of records < 4 12 number of records > 5 9 2) process
  23. @riferrei | @apachekafka | @elastic Kafka as a storage system

    Broker 1 250gb 250gb 500gb 1tb storage Broker 2 500gb 500gb 500gb 1.5tb storage Cluster storage → 2.5tb Elastic storage
  24. @riferrei | @apachekafka | @elastic Kafka as a storage system

    Broker 1 250gb 250gb 500gb 1tb storage Broker 2 500gb 500gb 500gb 1.5tb storage Partition-level replication Partition 1 Partition 2 Partition 2
  25. @riferrei | @apachekafka | @elastic Kafka as a storage system

    Commit LOG consumer Polling 100 records consumer Constant time performance Time spent: 1 MS Polling 100 records Time spent: 1 MS Commit LOG 5kb 5tb
  26. @riferrei | @apachekafka | @elastic Kafka as a storage system

    Optimized for massive reads Broker 1 250gb 250gb 500gb 1tb storage pagecache nic consumer Kafka uses the sendfile api to: - Bypass pagecache to kernel space - Bypass kernel space to user buffer - Bypass user buffer to kernel space - Bypass kernel space to socket buffer Partition 1 Partition 2
  27. @riferrei | @apachekafka | @elastic Kafka as a storage system

    File management in kafka Partition 0 Partition 1 Partition 2 Segment 0 Segment 1 + Segment 2 + 0000Seg1.log 0000Seg1.index
  28. @riferrei | @apachekafka | @elastic STREAMING PAC-MAN Api gateway Lambda

    function Kafka (MSK) Ksqldb (ecs) Kafka (MSK) scoreboard https://github.com/riferrei/streaming-pacman-aws
  29. From kafka to the world @riferrei | @apachekafka | @elastic

    Amazon alexa Lambda function scoreboard Redis cache push