Slide 1

Slide 1 text

Running Production Kafka Clusters in Kubernetes Balthazar Rouberol - Datadog London Kafka Summit - May 2019

Slide 2

Slide 2 text

– Why are we even doing this? – Established Kafka tooling and practices at Datadog – Description of important Kubernetes resources – Deployment of Kafka in Kubernetes – Description of routine operations Agenda

Slide 3

Slide 3 text

Who am I? @brouberol Data Reliability Engineer @ Datadog

Slide 4

Slide 4 text

– New instance of Datadog in Europe – Completely independant and isolated from the existing system – Leave legacy behind and start fresh – Have every team use it Background

Slide 5

Slide 5 text

– Dedicated resources – Local storage – Clusters running up to hundreds of instances – Rack awareness – Unsupervised (when possible) operations backed by kafka-kit * – Simplified configuration * https://github.com/datadog/kafka-kit Objectives

Slide 6

Slide 6 text

Tooling

Slide 7

Slide 7 text

– https://github.com/datadog/kafka-kit – topicmappr: ○partition to broker mapping ○failed broker replacement ○storage-based cluster rebalancing Kafka-Kit: scaling operations

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

– https://github.com/datadog/kafka-kit – topicmappr: ○partition to broker mapping ○failed broker replacement ○storage-based cluster rebalancing – autothrottle: replication auto-throttling Kafka-Kit: scaling operations

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

– https://github.com/datadog/kafka-kit – topicmappr: ○partition to broker mapping ○failed broker replacement ○storage-based cluster rebalancing – autothrottle: replication auto-throttling – pluggable metrics backend Kafka-Kit: scaling operations

Slide 12

Slide 12 text

“Map”: assignment of a set of topics to Kafka brokers 
 map1: "events.*" => [1001,1002,1003,1004,1005,1006]
 
 map2: "check_runs|notifications" => [1007,1008,1009] Topic mapping

Slide 13

Slide 13 text

“Map”: assignment of a set of topics to Kafka brokers 
 map1: "events.*" => 6x i3.4xlarge
 
 map2: "check_runs|notifications" => 3x i3.8xlarge Topic mapping

Slide 14

Slide 14 text

k8s concepts

Slide 15

Slide 15 text

NodeGroup

Slide 16

Slide 16 text

StatefulSet

Slide 17

Slide 17 text

Persistent Volume (Claim)

Slide 18

Slide 18 text

Cluster deployment in k8s

Slide 19

Slide 19 text

One broker pod per node (nodeAffinity & podAntiAffinity) Broker deployment

Slide 20

Slide 20 text

One NodeGroup/StatefulSet per map

Slide 21

Slide 21 text

– Instance store drives – Data is persisted between pod restarts – Data replicated on new nodes – Rack-awareness: query zone of current node via k8s API at init Data persistence and locality

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Broker identity

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

ZooKeeper: – Liveness: port 2181 open? – Readiness: leader/follower? Kafka: – Liveness: port 9092 open? – Readiness: broker 100% in-sync? – break the glass: forceable readiness (when 2 incidents coincide) Pod health and readiness

Slide 26

Slide 26 text

Safe rolling-restarts

Slide 27

Slide 27 text

– Topic definition in a ConfigMap – Regularly applied via a CronJob – Broker ids/map resolved by looking up the k8s API Topic management This can go if we need to save time

Slide 28

Slide 28 text

Operations

Slide 29

Slide 29 text

Broker replacement

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

SSL certificate management

Slide 37

Slide 37 text

SSL certificate management

Slide 38

Slide 38 text

– We twist the Kubernetes model to ensure dedicated resources – We take advantage of Kubernetes’ APIs to simplify configuration and operations – The kafka-kit tooling works well in Kubernetes – We gradually automate operations where possible Conclusion

Slide 39

Slide 39 text

Thank you! @brouberol
 
 We’re hiring! https://www.datadoghq.com/careers