Running Production Kafka Clusters in Kubernetes

Running Production Kafka Clusters in Kubernetes

Deploying stateful production applications in Kubernetes, such as Kafka, is often seen as ill-advised. The arguments are that it’s easy to get wrong, requires learning new skills, is too risky for unclear gains, or that Kubernetes is simply too young a project. This does not have to be true, and we will explain why. Datadog having made the choice to migrate its entire infrastructure to Kubernetes, my team was tasked with deploying reliable, production-ready Kafka clusters.

This talk will go over our deployment strategy, lessons learned, describe the challenges we faced along the way, as well as the reliability benefits we have observed.

This presentation will go through:
– an introduction to the tools and practices establised by Datadog
– a brief introduction of Kubernetes and associated concepts
– a deep dive into the deployment and bootstrap strategy of a production-bearing Kafka cluster in Kubernetes
– a walkthrough of some routine operations in a Kubernetes-based Kafka cluster

6832e99e94636c4872030004c6f8fd70?s=128

Balthazar Rouberol

May 14, 2019
Tweet

Transcript

  1. Running Production Kafka Clusters in Kubernetes Balthazar Rouberol - Datadog

    London Kafka Summit - May 2019
  2. – Why are we even doing this? – Established Kafka

    tooling and practices at Datadog – Description of important Kubernetes resources – Deployment of Kafka in Kubernetes – Description of routine operations Agenda
  3. Who am I? @brouberol Data Reliability Engineer @ Datadog

  4. – New instance of Datadog in Europe – Completely independant

    and isolated from the existing system – Leave legacy behind and start fresh – Have every team use it Background
  5. – Dedicated resources – Local storage – Clusters running up

    to hundreds of instances – Rack awareness – Unsupervised (when possible) operations backed by kafka-kit * – Simplified configuration * https://github.com/datadog/kafka-kit Objectives
  6. Tooling

  7. – https://github.com/datadog/kafka-kit – topicmappr: ◦partition to broker mapping ◦failed broker

    replacement ◦storage-based cluster rebalancing Kafka-Kit: scaling operations
  8. None
  9. – https://github.com/datadog/kafka-kit – topicmappr: ◦partition to broker mapping ◦failed broker

    replacement ◦storage-based cluster rebalancing – autothrottle: replication auto-throttling Kafka-Kit: scaling operations
  10. None
  11. – https://github.com/datadog/kafka-kit – topicmappr: ◦partition to broker mapping ◦failed broker

    replacement ◦storage-based cluster rebalancing – autothrottle: replication auto-throttling – pluggable metrics backend Kafka-Kit: scaling operations
  12. “Map”: assignment of a set of topics to Kafka brokers

    
 map1: "events.*" => [1001,1002,1003,1004,1005,1006]
 
 map2: "check_runs|notifications" => [1007,1008,1009] Topic mapping
  13. “Map”: assignment of a set of topics to Kafka brokers

    
 map1: "events.*" => 6x i3.4xlarge
 
 map2: "check_runs|notifications" => 3x i3.8xlarge Topic mapping
  14. k8s concepts

  15. NodeGroup

  16. StatefulSet

  17. Persistent Volume (Claim)

  18. Cluster deployment in k8s

  19. One broker pod per node (nodeAffinity & podAntiAffinity) Broker deployment

  20. One NodeGroup/StatefulSet per map

  21. – Instance store drives – Data is persisted between pod

    restarts – Data replicated on new nodes – Rack-awareness: query zone of current node via k8s API at init Data persistence and locality
  22. None
  23. Broker identity

  24. None
  25. ZooKeeper: – Liveness: port 2181 open? – Readiness: leader/follower? Kafka:

    – Liveness: port 9092 open? – Readiness: broker 100% in-sync? – break the glass: forceable readiness (when 2 incidents coincide) Pod health and readiness
  26. Safe rolling-restarts

  27. – Topic definition in a ConfigMap – Regularly applied via

    a CronJob – Broker ids/map resolved by looking up the k8s API Topic management This can go if we need to save time
  28. Operations

  29. Broker replacement

  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. SSL certificate management

  37. SSL certificate management

  38. – We twist the Kubernetes model to ensure dedicated resources

    – We take advantage of Kubernetes’ APIs to simplify configuration and operations – The kafka-kit tooling works well in Kubernetes – We gradually automate operations where possible Conclusion
  39. Thank you! @brouberol
 
 We’re hiring! https://www.datadoghq.com/careers