Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Running Production Kafka Clusters in Kubernetes

Running Production Kafka Clusters in Kubernetes

Deploying stateful production applications in Kubernetes, such as Kafka, is often seen as ill-advised. The arguments are that it’s easy to get wrong, requires learning new skills, is too risky for unclear gains, or that Kubernetes is simply too young a project. This does not have to be true, and we will explain why. Datadog having made the choice to migrate its entire infrastructure to Kubernetes, my team was tasked with deploying reliable, production-ready Kafka clusters.

This talk will go over our deployment strategy, lessons learned, describe the challenges we faced along the way, as well as the reliability benefits we have observed.

This presentation will go through:
– an introduction to the tools and practices establised by Datadog
– a brief introduction of Kubernetes and associated concepts
– a deep dive into the deployment and bootstrap strategy of a production-bearing Kafka cluster in Kubernetes
– a walkthrough of some routine operations in a Kubernetes-based Kafka cluster

Balthazar Rouberol

May 14, 2019
Tweet

More Decks by Balthazar Rouberol

Other Decks in Technology

Transcript

  1. – Why are we even doing this? – Established Kafka

    tooling and practices at Datadog – Description of important Kubernetes resources – Deployment of Kafka in Kubernetes – Description of routine operations Agenda
  2. – New instance of Datadog in Europe – Completely independant

    and isolated from the existing system – Leave legacy behind and start fresh – Have every team use it Background
  3. – Dedicated resources – Local storage – Clusters running up

    to hundreds of instances – Rack awareness – Unsupervised (when possible) operations backed by kafka-kit * – Simplified configuration * https://github.com/datadog/kafka-kit Objectives
  4. – https://github.com/datadog/kafka-kit – topicmappr: ◦partition to broker mapping ◦failed broker

    replacement ◦storage-based cluster rebalancing Kafka-Kit: scaling operations
  5. – https://github.com/datadog/kafka-kit – topicmappr: ◦partition to broker mapping ◦failed broker

    replacement ◦storage-based cluster rebalancing – autothrottle: replication auto-throttling Kafka-Kit: scaling operations
  6. – https://github.com/datadog/kafka-kit – topicmappr: ◦partition to broker mapping ◦failed broker

    replacement ◦storage-based cluster rebalancing – autothrottle: replication auto-throttling – pluggable metrics backend Kafka-Kit: scaling operations
  7. “Map”: assignment of a set of topics to Kafka brokers

    
 map1: "events.*" => [1001,1002,1003,1004,1005,1006]
 
 map2: "check_runs|notifications" => [1007,1008,1009] Topic mapping
  8. “Map”: assignment of a set of topics to Kafka brokers

    
 map1: "events.*" => 6x i3.4xlarge
 
 map2: "check_runs|notifications" => 3x i3.8xlarge Topic mapping
  9. – Instance store drives – Data is persisted between pod

    restarts – Data replicated on new nodes – Rack-awareness: query zone of current node via k8s API at init Data persistence and locality
  10. ZooKeeper: – Liveness: port 2181 open? – Readiness: leader/follower? Kafka:

    – Liveness: port 9092 open? – Readiness: broker 100% in-sync? – break the glass: forceable readiness (when 2 incidents coincide) Pod health and readiness
  11. – Topic definition in a ConfigMap – Regularly applied via

    a CronJob – Broker ids/map resolved by looking up the k8s API Topic management This can go if we need to save time
  12. – We twist the Kubernetes model to ensure dedicated resources

    – We take advantage of Kubernetes’ APIs to simplify configuration and operations – The kafka-kit tooling works well in Kubernetes – We gradually automate operations where possible Conclusion