$30 off During Our Annual Pro Sale. View Details »

Running Production Kafka Clusters in Kubernetes

Running Production Kafka Clusters in Kubernetes

Deploying stateful production applications in Kubernetes, such as Kafka, is often seen as ill-advised. The arguments are that it’s easy to get wrong, requires learning new skills, is too risky for unclear gains, or that Kubernetes is simply too young a project. This does not have to be true, and we will explain why. Datadog having made the choice to migrate its entire infrastructure to Kubernetes, my team was tasked with deploying reliable, production-ready Kafka clusters.

This talk will go over our deployment strategy, lessons learned, describe the challenges we faced along the way, as well as the reliability benefits we have observed.

This presentation will go through:
– an introduction to the tools and practices establised by Datadog
– a brief introduction of Kubernetes and associated concepts
– a deep dive into the deployment and bootstrap strategy of a production-bearing Kafka cluster in Kubernetes
– a walkthrough of some routine operations in a Kubernetes-based Kafka cluster

Balthazar Rouberol

May 14, 2019
Tweet

More Decks by Balthazar Rouberol

Other Decks in Technology

Transcript

  1. Running Production
    Kafka Clusters in
    Kubernetes
    Balthazar Rouberol - Datadog
    London Kafka Summit - May 2019

    View Slide

  2. – Why are we even doing this?
    – Established Kafka tooling and practices at Datadog
    – Description of important Kubernetes resources
    – Deployment of Kafka in Kubernetes
    – Description of routine operations
    Agenda

    View Slide

  3. Who am I?
    @brouberol
    Data Reliability Engineer @ Datadog

    View Slide

  4. – New instance of Datadog in Europe
    – Completely independant and isolated from the existing system
    – Leave legacy behind and start fresh
    – Have every team use it
    Background

    View Slide

  5. – Dedicated resources
    – Local storage
    – Clusters running up to hundreds of instances
    – Rack awareness
    – Unsupervised (when possible) operations backed by kafka-kit *
    – Simplified configuration
    * https://github.com/datadog/kafka-kit
    Objectives

    View Slide

  6. Tooling

    View Slide

  7. – https://github.com/datadog/kafka-kit
    – topicmappr:
    ○partition to broker mapping
    ○failed broker replacement
    ○storage-based cluster rebalancing
    Kafka-Kit: scaling operations

    View Slide

  8. View Slide

  9. – https://github.com/datadog/kafka-kit
    – topicmappr:
    ○partition to broker mapping
    ○failed broker replacement
    ○storage-based cluster rebalancing
    – autothrottle: replication auto-throttling
    Kafka-Kit: scaling operations

    View Slide

  10. View Slide

  11. – https://github.com/datadog/kafka-kit
    – topicmappr:
    ○partition to broker mapping
    ○failed broker replacement
    ○storage-based cluster rebalancing
    – autothrottle: replication auto-throttling
    – pluggable metrics backend
    Kafka-Kit: scaling operations

    View Slide

  12. “Map”: assignment of a set of topics to Kafka brokers

    map1: "events.*" => [1001,1002,1003,1004,1005,1006]


    map2: "check_runs|notifications" => [1007,1008,1009]
    Topic mapping

    View Slide

  13. “Map”: assignment of a set of topics to Kafka brokers

    map1: "events.*" => 6x i3.4xlarge


    map2: "check_runs|notifications" => 3x i3.8xlarge
    Topic mapping

    View Slide

  14. k8s concepts

    View Slide

  15. NodeGroup

    View Slide

  16. StatefulSet

    View Slide

  17. Persistent Volume (Claim)

    View Slide

  18. Cluster deployment in k8s

    View Slide

  19. One broker pod per node (nodeAffinity & podAntiAffinity)
    Broker deployment

    View Slide

  20. One NodeGroup/StatefulSet per map

    View Slide

  21. – Instance store drives
    – Data is persisted between pod restarts
    – Data replicated on new nodes
    – Rack-awareness: query zone of current node via k8s API at init
    Data persistence and locality

    View Slide

  22. View Slide

  23. Broker identity

    View Slide

  24. View Slide

  25. ZooKeeper:
    – Liveness: port 2181 open?
    – Readiness: leader/follower?
    Kafka:
    – Liveness: port 9092 open?
    – Readiness: broker 100% in-sync?
    – break the glass: forceable readiness (when 2 incidents coincide)
    Pod health and readiness

    View Slide

  26. Safe rolling-restarts

    View Slide

  27. – Topic definition in a ConfigMap
    – Regularly applied via a CronJob
    – Broker ids/map resolved by looking up the k8s API
    Topic management
    This can go if we need to save time

    View Slide

  28. Operations

    View Slide

  29. Broker
    replacement

    View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. SSL certificate management

    View Slide

  37. SSL certificate management

    View Slide

  38. – We twist the Kubernetes model to ensure dedicated resources
    – We take advantage of Kubernetes’ APIs to simplify configuration and
    operations
    – The kafka-kit tooling works well in Kubernetes
    – We gradually automate operations where possible
    Conclusion

    View Slide

  39. Thank you!
    @brouberol


    We’re hiring!
    https://www.datadoghq.com/careers

    View Slide