Building real-time streaming applications using Apache Kafka

8d943d05291180fae0976a9015e6f73a?s=47 Jayesh
October 28, 2017

Building real-time streaming applications using Apache Kafka

Act on the data as it happens! This presentation discusses Apache Kafka, the need for building real-time streaming applications and how to build streaming applications using Kafka Streams.

We end with discussing the current data architecture at Hotstar

8d943d05291180fae0976a9015e6f73a?s=128

Jayesh

October 28, 2017
Tweet

Transcript

  1. 2.

    Agenda • What is Apache Kafka? • Why do we

    need stream processing? • Stream processing using Apache Kafka • Kafka @ Hotstar Feel free to stop me for questions 2
  2. 3.

    $ whoami • Personalisation lead at Hotstar • Led Data

    Infrastructure team at Grofers and TinyOwl • Kafka fanboy • Usually rant on twitter @jayeshsidhwani 3
  3. 4.

    What is Kafka? 4 • Kafka is a scalable, fault-tolerant,

    distributed queue • Producers and Consumers • Uses ◦ Asynchronous communication in event-driven architectures ◦ Message broadcast for database replication Diagram credits: http://kafka.apache.org
  4. 5.

    • Brokers ◦ Heart of Kafka ◦ Stores data ◦

    Data stored into topics • Zookeeper ◦ Manages cluster state information ◦ Leader election Inside Kafka 5 BROKER ZOOKEEPER BROKER BROKER ZOOKEEPER TOPIC TOPIC TOPIC P P P C C C
  5. 6.

    • Topics are partitioned ◦ A partition is a append-only

    commit-log file ◦ Achieves horizontal scalability • Messages written in a partitions are ordered • Each message gets an auto-incrementing offset # ◦ {“user_id”: 1, “term”: “GoT”} is a message in the topic searched Inside a topic 6 Diagram credits: http://kafka.apache.org
  6. 7.

    How do consumers read? • Consumer subscribes to a topic

    • Consumers read from the head of the queue • Multiple consumers can read from a single topic 7 Diagram credits: http://kafka.apache.org
  7. 8.

    Kafka consumer scales horizontally • Consumers can be grouped •

    Consumer Groups ◦ Horizontally scalable ◦ Fault tolerant ◦ Delivery guaranteed 8 Diagram credits: http://kafka.apache.org
  8. 10.

    Discrete data processing models 10 APP APP APP • Request

    / Response processing mode ◦ Processing time <1 second ◦ Clients can use this data
  9. 11.

    Discrete data processing models 11 APP APP APP • Request

    / Response processing mode ◦ Processing time <1 second ◦ Clients can use this data DWH HADOOP • Batch processing mode ◦ Processing time few hours to a day ◦ Analysts can use this data
  10. 12.

    Discrete data processing models 12 • As the system grows,

    such synchronous processing model leads to a spaghetti and unmaintainable design APP APP APP APP SEARCH MONIT CACHE
  11. 13.

    Promise of stream processing 13 • Untangle movement of data

    ◦ Single source of truth ◦ No duplicate writes ◦ Anyone can consume anything ◦ Decouples data generation from data computation APP APP APP APP SEARCH MONIT CACHE STREAM PROCESSING FRAMEWORK
  12. 14.

    Promise of stream processing 14 • Untangle movement of data

    ◦ Single source of truth ◦ No duplicate writes ◦ Anyone can consume anything • Process, transform and react on the data as it happens ◦ Sub-second latencies ◦ Anomaly detection on bad stream quality ◦ Timely notification to users who dropped off in a live match Intelligence APP APP APP APP STREAM PROCESSING FRAMEWORK Filter Window Join Anomaly Action
  13. 16.

    Stream processing frameworks • Write your own? ◦ Windowing ◦

    State management ◦ Fault tolerance ◦ Scalability • Use frameworks such as Apache Spark, Samza, Storm ◦ Batteries attached ◦ Cluster manager to coordinate resources ◦ High memory / cpu footprint 16
  14. 17.

    Kafka Streams • Kafka Streams is a simple, low-latency, framework

    independent stream processing framework • Simple DSL • Same principles as Kafka consumer (minus operations overhead) • No cluster manager! yay! 17
  15. 18.

    Writing Kafka Streams • Define a processing topology ◦ Source

    nodes ◦ Processor nodes ▪ One or more ▪ Filtering, windowing, joins etc ◦ Sink nodes • Compile it and run like any other java application 18
  16. 20.

    Kafka Streams architecture and operations • Kafka manages ◦ Parallelism

    ◦ Fault tolerance ◦ Ordering ◦ State Management 20 Diagram credits: http://confluent.io
  17. 21.

    Streaming joins and state-stores • Beyond filtering and windowing •

    Streaming joins are hard to scale ◦ Kafka scales at 800k writes/sec* ◦ How about your database? • Solution: Cache a static stream in-memory ◦ Join with running stream ◦ Stream<>table duality • Kafka supports in-memory cache OOB ◦ RocksDB ◦ In-memory hash ◦ Persistent / Transient 21 Diagram credits: http://confluent.io *achieved using librdkafka c++ library
  18. 22.

    Demo • Inputs: ◦ Incoming stream of benchmark stream quality

    from CDN provider ◦ Incoming stream quality reported by Hotstar clients • Output: ◦ Calculate the locations reporting bad QoS in real-time 22 Diagram credits: http://confluent.io *achieved using librdkafka c++ library
  19. 23.

    Demo • Inputs: ◦ Incoming stream of benchmark stream quality

    from CDN provider ◦ Incoming stream quality reported by Hotstar clients • Output: ◦ Calculate the locations reporting bad QoS in real-time 23 Diagram credits: http://confluent.io *achieved using librdkafka c++ library CDN benchmarks Client reports Alerts
  20. 26.

    26