Running large Kafka
clusters with minimum toil
Balthazar Rouberol
DRE - Datadog
Slide 2
Slide 2 text
Who am I?
Balthazar Rouberol
Senior Data Reliability Engineer, Datadog
Slide 3
Slide 3 text
– Multiple regions / datacenters / cloud providers
– dozens of Kafka/ZooKeeper clusters
– PB of data on local storage
– Trillions of messages per day
– Double-digit GB/s bandwidth
– 2 (mostly) dedicated SREs
Our Kafka infrastructure
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
● Disk full
● Broker dead
● Storage hotspot
● Network hotspot
● Hot reassignment
● Expired SSL certificates
● $$$
● Computers
What can go wrong?
Slide 7
Slide 7 text
● Partition assignment calculation
● Investigating under-replication
● Replacing brokers
● Adjusting reassignment throttle
● Scaling up / down
● Computers
● Humans
What can be time consuming?
Slide 8
Slide 8 text
Tooling
Slide 9
Slide 9 text
A good partition assignment enforces rack balancing and de-hotspots
● disk usage
● network throughput
● leadership
Getting partition assignment right
Slide 10
Slide 10 text
Homogeneous partition size?
Slide 11
Slide 11 text
Homogeneous partition size?
Slide 12
Slide 12 text
Usage:
topicmappr [command]
Available Commands:
help Help about any command
rebalance Rebalance partition allotments among a set of topics and brokers
rebuild Rebuild a partition map for one or more topics
version Print the version
https://github.com/datadog/kafka-kit
Enters topicmappr
Slide 13
Slide 13 text
$ topicmappr rebuild --topics --brokers
● Assumes homogeneous partition size by default
● Can binpack on partition sizes and disk usage
● Possible optimizations:
○ Partition spread
○ Storage homogeneity
○ Leadership / broker
topicmappr rebuild
● ~80-85% {disk storage, bandwidth} per broker pool
● Rebalance first, scale up with leeway
Capacity model
Slide 22
Slide 22 text
autothrottle: reassign fast enough
Slide 23
Slide 23 text
Adjust retention, don’t page
Slide 24
Slide 24 text
SSL certificates hot reloading
Slide 25
Slide 25 text
SSL certificates hot reloading
Slide 26
Slide 26 text
Make everything discoverable
$ autothrottle-cli get
no throttle override is set
$ curl localhost:8080/api/kafka/ops/throttle
{
"throttle": null,
"autoremove": false
}
Slide 27
Slide 27 text
Build layered tooling
Slide 28
Slide 28 text
Monitoring
Slide 29
Slide 29 text
● Storage hotspot (>90%)
● Sustained elevated traffic
● Under replication by topic/cluster
● Long running reassignment
● Replication factor = 1
● Set write success SLI/SLO
● SSL certificate TTL
Monitoring
Slide 30
Slide 30 text
– Alert by topic or even cluster
– Exports tagged partition metrics
– Automatically muted during rolling-restarts
Monitoring: under-replication
Slide 31
Slide 31 text
Measure write success
Slide 32
Slide 32 text
Measure write success: poor man’s version
– Write synthetics data to a SLI topic
– Every broker is at least leader of a partition
– Should reflect write success
Slide 33
Slide 33 text
– Kafka admin tools are not sufficient at scale
– Measure partition volume
– Measure under-replication / topic
– Partition assignment is a machine job
– Know your bottleneck (storage / bandwidth)
– Make everything discoverable
– Monitor unsafe configuration
– Set write success SLO
Conclusion