Building real-time streaming applications using Apache Kafka

Slide 1

Slide 1 text

Real time stream processing using Apache Kafka

Slide 2

Slide 2 text

Agenda ● What is Apache Kafka? ● Why do we need stream processing? ● Stream processing using Apache Kafka ● Kafka @ Hotstar Feel free to stop me for questions 2

Slide 3

Slide 3 text

$ whoami ● Personalisation lead at Hotstar ● Led Data Infrastructure team at Grofers and TinyOwl ● Kafka fanboy ● Usually rant on twitter @jayeshsidhwani 3

Slide 4

Slide 4 text

What is Kafka? 4 ● Kafka is a scalable, fault-tolerant, distributed queue ● Producers and Consumers ● Uses ○ Asynchronous communication in event-driven architectures ○ Message broadcast for database replication Diagram credits: http://kafka.apache.org

Slide 5

Slide 5 text

● Brokers ○ Heart of Kafka ○ Stores data ○ Data stored into topics ● Zookeeper ○ Manages cluster state information ○ Leader election Inside Kafka 5 BROKER ZOOKEEPER BROKER BROKER ZOOKEEPER TOPIC TOPIC TOPIC P P P C C C

Slide 6

Slide 6 text

● Topics are partitioned ○ A partition is a append-only commit-log file ○ Achieves horizontal scalability ● Messages written in a partitions are ordered ● Each message gets an auto-incrementing offset # ○ {“user_id”: 1, “term”: “GoT”} is a message in the topic searched Inside a topic 6 Diagram credits: http://kafka.apache.org

Slide 7

Slide 7 text

How do consumers read? ● Consumer subscribes to a topic ● Consumers read from the head of the queue ● Multiple consumers can read from a single topic 7 Diagram credits: http://kafka.apache.org

Slide 8

Slide 8 text

Kafka consumer scales horizontally ● Consumers can be grouped ● Consumer Groups ○ Horizontally scalable ○ Fault tolerant ○ Delivery guaranteed 8 Diagram credits: http://kafka.apache.org

Slide 9

Slide 9 text

Stream processing and its use-cases 9

Slide 10

Slide 10 text

Discrete data processing models 10 APP APP APP ● Request / Response processing mode ○ Processing time <1 second ○ Clients can use this data

Slide 11

Slide 11 text

Discrete data processing models 11 APP APP APP ● Request / Response processing mode ○ Processing time <1 second ○ Clients can use this data DWH HADOOP ● Batch processing mode ○ Processing time few hours to a day ○ Analysts can use this data

Slide 12

Slide 12 text

Discrete data processing models 12 ● As the system grows, such synchronous processing model leads to a spaghetti and unmaintainable design APP APP APP APP SEARCH MONIT CACHE

Slide 13

Slide 13 text

Promise of stream processing 13 ● Untangle movement of data ○ Single source of truth ○ No duplicate writes ○ Anyone can consume anything ○ Decouples data generation from data computation APP APP APP APP SEARCH MONIT CACHE STREAM PROCESSING FRAMEWORK

Slide 14

Slide 14 text

Promise of stream processing 14 ● Untangle movement of data ○ Single source of truth ○ No duplicate writes ○ Anyone can consume anything ● Process, transform and react on the data as it happens ○ Sub-second latencies ○ Anomaly detection on bad stream quality ○ Timely notification to users who dropped off in a live match Intelligence APP APP APP APP STREAM PROCESSING FRAMEWORK Filter Window Join Anomaly Action

Slide 15

Slide 15 text

Stream processing using Kafka 15

Slide 16

Slide 16 text

Stream processing frameworks ● Write your own? ○ Windowing ○ State management ○ Fault tolerance ○ Scalability ● Use frameworks such as Apache Spark, Samza, Storm ○ Batteries attached ○ Cluster manager to coordinate resources ○ High memory / cpu footprint 16

Slide 17

Slide 17 text

Kafka Streams ● Kafka Streams is a simple, low-latency, framework independent stream processing framework ● Simple DSL ● Same principles as Kafka consumer (minus operations overhead) ● No cluster manager! yay! 17

Slide 18

Slide 18 text

Writing Kafka Streams ● Define a processing topology ○ Source nodes ○ Processor nodes ■ One or more ■ Filtering, windowing, joins etc ○ Sink nodes ● Compile it and run like any other java application 18

Slide 19

Slide 19 text

Demo Simple Kafka Stream 19

Slide 20

Slide 20 text

Kafka Streams architecture and operations ● Kafka manages ○ Parallelism ○ Fault tolerance ○ Ordering ○ State Management 20 Diagram credits: http://confluent.io

Slide 21

Slide 21 text

Streaming joins and state-stores ● Beyond filtering and windowing ● Streaming joins are hard to scale ○ Kafka scales at 800k writes/sec* ○ How about your database? ● Solution: Cache a static stream in-memory ○ Join with running stream ○ Stream<>table duality ● Kafka supports in-memory cache OOB ○ RocksDB ○ In-memory hash ○ Persistent / Transient 21 Diagram credits: http://confluent.io *achieved using librdkafka c++ library

Slide 22

Slide 22 text

Demo ● Inputs: ○ Incoming stream of benchmark stream quality from CDN provider ○ Incoming stream quality reported by Hotstar clients ● Output: ○ Calculate the locations reporting bad QoS in real-time 22 Diagram credits: http://confluent.io *achieved using librdkafka c++ library

Slide 23

Slide 23 text

Demo ● Inputs: ○ Incoming stream of benchmark stream quality from CDN provider ○ Incoming stream quality reported by Hotstar clients ● Output: ○ Calculate the locations reporting bad QoS in real-time 23 Diagram credits: http://confluent.io *achieved using librdkafka c++ library CDN benchmarks Client reports Alerts