Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building real-time streaming applications using Apache Kafka

Jayesh
October 28, 2017

Building real-time streaming applications using Apache Kafka

Act on the data as it happens! This presentation discusses Apache Kafka, the need for building real-time streaming applications and how to build streaming applications using Kafka Streams.

We end with discussing the current data architecture at Hotstar

Jayesh

October 28, 2017
Tweet

More Decks by Jayesh

Other Decks in Technology

Transcript

  1. Agenda • What is Apache Kafka? • Why do we

    need stream processing? • Stream processing using Apache Kafka • Kafka @ Hotstar Feel free to stop me for questions 2
  2. $ whoami • Personalisation lead at Hotstar • Led Data

    Infrastructure team at Grofers and TinyOwl • Kafka fanboy • Usually rant on twitter @jayeshsidhwani 3
  3. What is Kafka? 4 • Kafka is a scalable, fault-tolerant,

    distributed queue • Producers and Consumers • Uses ◦ Asynchronous communication in event-driven architectures ◦ Message broadcast for database replication Diagram credits: http://kafka.apache.org
  4. • Brokers ◦ Heart of Kafka ◦ Stores data ◦

    Data stored into topics • Zookeeper ◦ Manages cluster state information ◦ Leader election Inside Kafka 5 BROKER ZOOKEEPER BROKER BROKER ZOOKEEPER TOPIC TOPIC TOPIC P P P C C C
  5. • Topics are partitioned ◦ A partition is a append-only

    commit-log file ◦ Achieves horizontal scalability • Messages written in a partitions are ordered • Each message gets an auto-incrementing offset # ◦ {“user_id”: 1, “term”: “GoT”} is a message in the topic searched Inside a topic 6 Diagram credits: http://kafka.apache.org
  6. How do consumers read? • Consumer subscribes to a topic

    • Consumers read from the head of the queue • Multiple consumers can read from a single topic 7 Diagram credits: http://kafka.apache.org
  7. Kafka consumer scales horizontally • Consumers can be grouped •

    Consumer Groups ◦ Horizontally scalable ◦ Fault tolerant ◦ Delivery guaranteed 8 Diagram credits: http://kafka.apache.org
  8. Discrete data processing models 10 APP APP APP • Request

    / Response processing mode ◦ Processing time <1 second ◦ Clients can use this data
  9. Discrete data processing models 11 APP APP APP • Request

    / Response processing mode ◦ Processing time <1 second ◦ Clients can use this data DWH HADOOP • Batch processing mode ◦ Processing time few hours to a day ◦ Analysts can use this data
  10. Discrete data processing models 12 • As the system grows,

    such synchronous processing model leads to a spaghetti and unmaintainable design APP APP APP APP SEARCH MONIT CACHE
  11. Promise of stream processing 13 • Untangle movement of data

    ◦ Single source of truth ◦ No duplicate writes ◦ Anyone can consume anything ◦ Decouples data generation from data computation APP APP APP APP SEARCH MONIT CACHE STREAM PROCESSING FRAMEWORK
  12. Promise of stream processing 14 • Untangle movement of data

    ◦ Single source of truth ◦ No duplicate writes ◦ Anyone can consume anything • Process, transform and react on the data as it happens ◦ Sub-second latencies ◦ Anomaly detection on bad stream quality ◦ Timely notification to users who dropped off in a live match Intelligence APP APP APP APP STREAM PROCESSING FRAMEWORK Filter Window Join Anomaly Action
  13. Stream processing frameworks • Write your own? ◦ Windowing ◦

    State management ◦ Fault tolerance ◦ Scalability • Use frameworks such as Apache Spark, Samza, Storm ◦ Batteries attached ◦ Cluster manager to coordinate resources ◦ High memory / cpu footprint 16
  14. Kafka Streams • Kafka Streams is a simple, low-latency, framework

    independent stream processing framework • Simple DSL • Same principles as Kafka consumer (minus operations overhead) • No cluster manager! yay! 17
  15. Writing Kafka Streams • Define a processing topology ◦ Source

    nodes ◦ Processor nodes ▪ One or more ▪ Filtering, windowing, joins etc ◦ Sink nodes • Compile it and run like any other java application 18
  16. Kafka Streams architecture and operations • Kafka manages ◦ Parallelism

    ◦ Fault tolerance ◦ Ordering ◦ State Management 20 Diagram credits: http://confluent.io
  17. Streaming joins and state-stores • Beyond filtering and windowing •

    Streaming joins are hard to scale ◦ Kafka scales at 800k writes/sec* ◦ How about your database? • Solution: Cache a static stream in-memory ◦ Join with running stream ◦ Stream<>table duality • Kafka supports in-memory cache OOB ◦ RocksDB ◦ In-memory hash ◦ Persistent / Transient 21 Diagram credits: http://confluent.io *achieved using librdkafka c++ library
  18. Demo • Inputs: ◦ Incoming stream of benchmark stream quality

    from CDN provider ◦ Incoming stream quality reported by Hotstar clients • Output: ◦ Calculate the locations reporting bad QoS in real-time 22 Diagram credits: http://confluent.io *achieved using librdkafka c++ library
  19. Demo • Inputs: ◦ Incoming stream of benchmark stream quality

    from CDN provider ◦ Incoming stream quality reported by Hotstar clients • Output: ◦ Calculate the locations reporting bad QoS in real-time 23 Diagram credits: http://confluent.io *achieved using librdkafka c++ library CDN benchmarks Client reports Alerts
  20. 26