Running Production
Kafka Clusters in
Kubernetes
Balthazar Rouberol - Datadog
London Kafka Summit - May 2019
Slide 2
Slide 2 text
– Why are we even doing this?
– Established Kafka tooling and practices at Datadog
– Description of important Kubernetes resources
– Deployment of Kafka in Kubernetes
– Description of routine operations
Agenda
Slide 3
Slide 3 text
Who am I?
@brouberol
Data Reliability Engineer @ Datadog
Slide 4
Slide 4 text
– New instance of Datadog in Europe
– Completely independant and isolated from the existing system
– Leave legacy behind and start fresh
– Have every team use it
Background
Slide 5
Slide 5 text
– Dedicated resources
– Local storage
– Clusters running up to hundreds of instances
– Rack awareness
– Unsupervised (when possible) operations backed by kafka-kit *
– Simplified configuration
* https://github.com/datadog/kafka-kit
Objectives
“Map”: assignment of a set of topics to Kafka brokers
map1: "events.*" => [1001,1002,1003,1004,1005,1006]
map2: "check_runs|notifications" => [1007,1008,1009]
Topic mapping
Slide 13
Slide 13 text
“Map”: assignment of a set of topics to Kafka brokers
map1: "events.*" => 6x i3.4xlarge
map2: "check_runs|notifications" => 3x i3.8xlarge
Topic mapping
Slide 14
Slide 14 text
k8s concepts
Slide 15
Slide 15 text
NodeGroup
Slide 16
Slide 16 text
StatefulSet
Slide 17
Slide 17 text
Persistent Volume (Claim)
Slide 18
Slide 18 text
Cluster deployment in k8s
Slide 19
Slide 19 text
One broker pod per node (nodeAffinity & podAntiAffinity)
Broker deployment
Slide 20
Slide 20 text
One NodeGroup/StatefulSet per map
Slide 21
Slide 21 text
– Instance store drives
– Data is persisted between pod restarts
– Data replicated on new nodes
– Rack-awareness: query zone of current node via k8s API at init
Data persistence and locality
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
Broker identity
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
ZooKeeper:
– Liveness: port 2181 open?
– Readiness: leader/follower?
Kafka:
– Liveness: port 9092 open?
– Readiness: broker 100% in-sync?
– break the glass: forceable readiness (when 2 incidents coincide)
Pod health and readiness
Slide 26
Slide 26 text
Safe rolling-restarts
Slide 27
Slide 27 text
– Topic definition in a ConfigMap
– Regularly applied via a CronJob
– Broker ids/map resolved by looking up the k8s API
Topic management
This can go if we need to save time
Slide 28
Slide 28 text
Operations
Slide 29
Slide 29 text
Broker
replacement
Slide 30
Slide 30 text
No content
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
No content
Slide 36
Slide 36 text
SSL certificate management
Slide 37
Slide 37 text
SSL certificate management
Slide 38
Slide 38 text
– We twist the Kubernetes model to ensure dedicated resources
– We take advantage of Kubernetes’ APIs to simplify configuration and
operations
– The kafka-kit tooling works well in Kubernetes
– We gradually automate operations where possible
Conclusion