$30 off During Our Annual Pro Sale. View Details »

Kafka on Kubernetes: Keeping It Simple

Nikki Thean
October 01, 2019

Kafka on Kubernetes: Keeping It Simple

I gave a presentation at Kafka Summit San Francisco 2019 that attempts to convince viewers that it is possible to run a stable Kafka cluster on Kubernetes without a complicated setup.

Bonus: I illustrated the slides by hand, and hid the Kafka logo in many of the pictures!

The talk description is below:

If you’ve ever thought that running Kafka on Kubernetes was a terrible idea, welcome to the club: that’s what we thought as well. When we migrated Etsy’s Kafka deployment from bare metal to GCP, we made a surprising discovery: running Kafka on Kubernetes was the best option for us — and it wasn’t half as complicated as we thought it had to be.

I’ll use the story of our cloud migration journey to frame a discussion of how a “simple” Kafka-on-k8s setup can work. You’ll walk away with an example of how to set up a stable production system that uses vanilla Kubernetes workloads, as well as some technical tips and tricks — and pitfalls to avoid — for your own cloud migration or first Kafka-on-k8s deployment.

Nikki Thean

October 01, 2019
Tweet

Other Decks in Technology

Transcript

  1. KAFKA ON KUBERNETES:
    Keeping It Simple

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. We’re not doing that…

    right?

    View Slide

  9. We’re not doing that…

    right?

    View Slide

  10. We’re not doing that…

    right?
    Definitely not!

    View Slide

  11. ONE YEAR LATER

    View Slide

  12. ONE YEAR LATER
    1. Production Kafka cluster on Kubernetes.
    2. Suggesting this idea to other people.

    View Slide

  13. ONE YEAR LATER
    1. Production Kafka cluster on Kubernetes.
    2. Suggesting this idea to other people.

    View Slide

  14. We’re not doing that…

    right?
    Definitely not!

    View Slide

  15. RUNNING KAFKA ON
    KUBERNETES DOESN'T
    HAVE TO BE COMPLICATED.

    View Slide

  16. RUNNING KAFKA ON
    KUBERNETES DOESN'T
    HAVE TO BE COMPLICATED.
    * "Complicated"

    View Slide

  17. RUNNING KAFKA ON
    KUBERNETES DOESN'T
    HAVE TO BE COMPLICATED.
    * "Complicated" = custom resource definitions, plugins, operators, etc.

    View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. WHAT YOU’LL GET OUT OF THIS

    View Slide

  25. WHAT YOU’LL GET OUT OF THIS
    ➤ Example of real-life production setup

    View Slide

  26. WHAT YOU’LL GET OUT OF THIS
    ➤ Example of real-life production setup
    ➤ Technical tips and tricks

    View Slide

  27. WHAT YOU’LL GET OUT OF THIS
    ➤ Example of real-life production setup
    ➤ Technical tips and tricks
    ➤ Advice for migrating production systems

    View Slide

  28. SYSTEMS MENTIONED
    ➤ Kafka
    ➤ Kubernetes
    ➤ Chef
    ➤ Terraform
    ➤ Helm
    ➤ Prometheus
    ➤ Google Cloud Platform

    View Slide

  29. SYSTEMS MENTIONED
    ➤ Kafka
    ➤ Kubernetes
    ➤ Chef
    ➤ Terraform
    ➤ Helm
    ➤ Prometheus
    ➤ Google Cloud Platform

    View Slide

  30. SYSTEMS MENTIONED
    ➤ Kafka
    ➤ Kubernetes
    ➤ Chef
    ➤ Terraform
    ➤ Helm
    ➤ Prometheus
    ➤ Google Cloud Platform

    View Slide

  31. WHAT WE BUILT
    and how it works

    View Slide

  32. View Slide

  33. WHAT WE CONSIDERED
    ➤ VMs + Chef

    View Slide

  34. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure

    View Slide

  35. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure
    ➤ VMs + Machine/Docker Images

    View Slide

  36. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure
    ➤ VMs + Machine/Docker Images
    ➤ Same amount of toil, new cloud infra problems

    View Slide

  37. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure
    ➤ VMs + Machine/Docker Images
    ➤ Same amount of toil, new cloud infra problems
    ➤ Managed Instance Groups

    View Slide

  38. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure
    ➤ VMs + Machine/Docker Images
    ➤ Same amount of toil, new cloud infra problems
    ➤ Managed Instance Groups
    ➤ Not good for stateful workloads

    View Slide

  39. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure
    ➤ VMs + Machine/Docker Images
    ➤ Same amount of toil, new cloud infra problems
    ➤ Managed Instance Groups
    ➤ Not good for stateful workloads
    ➤ …Kubernetes?!

    View Slide

  40. WHAT WE CONSIDERED
    ➤ VMs + Chef
    ➤ Mutable infrastructure
    ➤ VMs + Machine/Docker Images
    ➤ Same amount of toil, new cloud infra problems
    ➤ Managed Instance Groups
    ➤ Not good for stateful workloads
    ➤ …Kubernetes?!

    View Slide

  41. View Slide

  42. View Slide

  43. CROSS-ZONE TRAFFIC IS $$$.

    View Slide

  44. RESOURCE MANAGEMENT
    ➤ Examples of separation:
    ➤ zones

    View Slide

  45. RESOURCE MANAGEMENT
    ➤ Examples of separation:
    ➤ zones
    ➤ node type

    View Slide

  46. RESOURCE MANAGEMENT
    ➤ Examples of separation:
    ➤ zones
    ➤ node type - varying workloads, e.g. high CPU requirement

    View Slide

  47. RESOURCE MANAGEMENT
    ➤ Examples of separation:
    ➤ zones
    ➤ node pools - varying workloads, e.g. high CPU requirement
    ➤ Workload allocation controlled by:
    ➤ nodeSelectors
    ➤ pool: highmem
    ➤ failure-domain.beta.kubernetes.io/zone: us-central1-a

    View Slide

  48. RESOURCE MANAGEMENT
    ➤ Examples of separation:
    ➤ zones
    ➤ node pools - varying workloads, e.g. high CPU requirement
    ➤ Workload allocation controlled by:
    ➤ nodeSelectors
    ➤ pool: highmem
    ➤ failure-domain.beta.kubernetes.io/zone: us-central1-a
    ➤ taints + tolerations
    ➤ - key: pool

    operator: Equal

    value: highmem

    effect: NoSchedule

    View Slide

  49. SERVICE DISCOVERY
    ➤ Within the cluster: ClusterIP Service
    ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
    Kafka broker

    View Slide

  50. View Slide

  51. SERVICE DISCOVERY
    ➤ Within the cluster: ClusterIP Service
    ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
    Kafka broker
    ➤ External services to Kafka:
    ➤ Bootstrapping: Cloud DNS to LoadBalancer

    View Slide

  52. View Slide

  53. SERVICE DISCOVERY
    ➤ Within the cluster: ClusterIP Service
    ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
    Kafka broker
    ➤ External services to Kafka:
    ➤ Bootstrapping: Cloud DNS to LoadBalancer
    ➤ Direct to broker: NodePort*
    * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

    View Slide

  54. SERVICE DISCOVERY
    ➤ Within the cluster: ClusterIP Service
    ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
    Kafka broker
    ➤ External services to Kafka:
    ➤ Bootstrapping: Cloud DNS to LoadBalancer
    ➤ Direct to broker: NodePort*
    ➤ Configuring your Kafka listeners
    * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

    View Slide

  55. SERVICE DISCOVERY
    ➤ Within the cluster: ClusterIP Service
    ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
    Kafka broker
    ➤ External services to Kafka:
    ➤ Bootstrapping: Cloud DNS to LoadBalancer
    ➤ Direct to broker: NodePort*
    ➤ Configuring your Kafka listeners
    * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

    View Slide

  56. SERVICE DISCOVERY
    ➤ Within the cluster: ClusterIP Service
    ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
    Kafka broker
    ➤ External services to Kafka:
    ➤ Bootstrapping: Cloud DNS to LoadBalancer
    ➤ Direct to broker: NodePort*
    ➤ Configuring your Kafka listeners
    ➤ https://rmoff.net/2018/08/02/kafka-listeners-explained/
    * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

    View Slide

  57. SOME NOTES ON MANAGEMENT

    View Slide

  58. SOME NOTES ON MANAGEMENT
    ➤ Installation, deploy:
    ➤ Confluent Helm charts
    ➤ https://github.com/confluentinc/cp-helm-charts

    View Slide

  59. SOME NOTES ON MANAGEMENT
    ➤ Installation, deploy:
    ➤ Confluent Helm charts
    ➤ https://github.com/confluentinc/cp-helm-charts
    ➤ No Tiller, Helm templating only

    View Slide

  60. SOME NOTES ON MANAGEMENT
    ➤ Installation, deploy:
    ➤ Confluent Helm charts
    ➤ https://github.com/confluentinc/cp-helm-charts
    ➤ No Tiller, Helm templating only
    ➤ Confluent Docker images, with additions

    View Slide

  61. SOME NOTES ON MANAGEMENT
    ➤ Monitoring:

    View Slide

  62. SOME NOTES ON MANAGEMENT
    ➤ Monitoring:
    ➤ Kafka exposes JMX metrics by default

    View Slide

  63. SOME NOTES ON MANAGEMENT
    ➤ Monitoring:
    ➤ Kafka exposes JMX metrics by default
    ➤ Prometheus JMX Exporter as Java agent (vs. Helm chart
    sidecar)

    View Slide

  64. WHAT WE LEARNED
    and what we think you should know

    View Slide

  65. WHAT WE LEARNED

    View Slide

  66. WHAT WE LEARNED

    View Slide

  67. WHAT WE LEARNED
    ➤ Ephemeral resources

    View Slide

  68. WHAT WE LEARNED
    ➤ Ephemeral resources

    View Slide

  69. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions

    View Slide

  70. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions
    ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)

    View Slide

  71. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions
    ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)

    View Slide

  72. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions
    ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
    ➤ Check the Apache Kafka JIRA

    View Slide

  73. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions
    ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
    ➤ Oh, and upgrade Kafka.

    View Slide

  74. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions
    ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
    ➤ Check JIRA/Upgrade Kafka
    ➤ Producers and consumers too!

    View Slide

  75. WHAT WE LEARNED
    ➤ Ephemeral resources
    ➤ Don’t assume static IPs; make sure Kafka clients can handle
    pod evictions
    ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
    ➤ Check JIRA/Upgrade Kafka
    ➤ Producers and consumers too!
    ➤ Examples: KAFKA-7755, KAFKA-7890
    ➤ See also: cp-helm-charts issue #240

    View Slide

  76. WHAT WE LEARNED
    ➤ Different kinds of updates/rolling restarts

    View Slide

  77. WHAT WE LEARNED
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties

    View Slide

  78. WHAT WE LEARNED
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies

    View Slide

  79. WHAT WE LEARNED
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes

    View Slide

  80. Health checks are important for self-healing clusters!

    View Slide

  81. Health checks are important for self-healing clusters!

    View Slide

  82. Health checks are important for self-healing clusters!

    View Slide

  83. CONTAINER PROBES

    View Slide

  84. CONTAINER PROBES
    ➤ Liveness Probe

    “Should I restart this container?”

    View Slide

  85. CONTAINER PROBES
    ➤ Liveness Probe

    “Should I restart this container?”
    ➤ Readiness Probe

    “Should this container accept traffic?”

    View Slide

  86. View Slide

  87. https://github.com/andreas-schroeder/kafka-health-check

    View Slide

  88. CONTAINER PROBES
    ➤ Liveness Probe

    “Should I restart this container?”
    ➤ Endpoint: is this broker healthy?

    View Slide

  89. CONTAINER PROBES
    ➤ Liveness Probe

    “Should I restart this container?”
    ➤ Endpoint: is this broker healthy?
    ➤ Don’t be too strict!

    View Slide

  90. https://github.com/andreas-schroeder/kafka-health-check

    View Slide

  91. WHAT WE HAD TO FIGURE OUT
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes
    ➤ How health checks affect updates

    View Slide

  92. WHAT WE HAD TO FIGURE OUT
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes
    ➤ How health checks affect updates

    View Slide

  93. WHAT WE HAD TO FIGURE OUT
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes
    ➤ How health checks affect updates
    ➤ podDisruptionBudget

    View Slide

  94. WHAT WE HAD TO FIGURE OUT
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes
    ➤ How health checks affect updates
    ➤ podDisruptionBudget, podManagementPolicy

    View Slide

  95. WHAT WE HAD TO FIGURE OUT
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes
    ➤ How health checks affect updates
    ➤ podDisruptionBudget, podManagementPolicy
    ➤ Health check overrides
    ➤ What happens if you deploy a change that breaks the
    health checks?

    View Slide

  96. WHAT WE HAD TO FIGURE OUT
    ➤ Different kinds of updates/rolling restarts
    ➤ Changes to Kafka cluster: versions, broker properties
    ➤ Workload configuration: resources, security policies
    ➤ Upgrading Kubernetes nodes
    ➤ How health checks affect updates
    ➤ podDisruptionBudget, podManagementPolicy
    ➤ Health check overrides
    ➤ What happens if you deploy a change that breaks the
    health checks?
    ➤ See Kubernetes issue #62750

    View Slide

  97. MIGRATING PRODUCTION ARCHITECTURE

    View Slide

  98. MIGRATING PRODUCTION ARCHITECTURE

    View Slide

  99. MIGRATING PRODUCTION ARCHITECTURE

    View Slide

  100. MIGRATING PRODUCTION ARCHITECTURE

    View Slide

  101. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!

    View Slide

  102. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!

    View Slide

  103. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!
    ➤ Simulate common maintenance tasks

    View Slide

  104. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!
    ➤ Simulate common maintenance tasks
    ➤ Benchmark for performance
    ➤ kafka-producer-perf-test, kafka-consumer-perf-test

    View Slide

  105. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!
    ➤ Simulate common maintenance tasks
    ➤ Benchmark for performance
    ➤ kafka-producer-perf-test, kafka-consumer-perf-test
    ➤ Variables: disk type, CPU count, producer record size,
    producer batch size, Java opts…

    View Slide

  106. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!
    ➤ Simulate common maintenance tasks
    ➤ Benchmark for performance
    ➤ kafka-producer-perf-test, kafka-consumer-perf-test
    ➤ Variables: disk type, CPU count, producer record size,
    producer batch size, Java opts…
    ➤ We were able to use Compute Engine persistent disks
    (shared storage) rather than local SSDs

    View Slide

  107. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!
    ➤ Simulate common maintenance tasks
    ➤ Benchmark for performance
    ➤ kafka-producer-perf-test, kafka-consumer-perf-test
    ➤ Variables: disk type, CPU count, producer record size,
    producer batch size, Java opts…
    ➤ We were able to use Compute Engine persistent disks
    (shared storage) rather than local SSDs
    ➤ Simulate failure

    View Slide

  108. MIGRATING PRODUCTION ARCHITECTURE
    ➤ Get your hands dirty!
    ➤ Simulate common maintenance tasks
    ➤ Benchmark for performance
    ➤ kafka-producer-perf-test, kafka-consumer-perf-test
    ➤ Variables: disk type, CPU count, producer record size,
    producer batch size, Java opts…
    ➤ We were able to use Compute Engine persistent disks
    (shared storage) rather than local SSDs
    ➤ Simulate failure

    View Slide

  109. Try to keep it simple!

    View Slide

  110. WHY IT WORKS FOR US
    ( for now, at least!)

    View Slide

  111. WHY IT WORKS FOR US

    View Slide

  112. WHY IT WORKS FOR US
    ➤ Increased automation

    View Slide

  113. WHY IT WORKS FOR US
    ➤ Increased automation
    ➤ Simpler configuration

    View Slide

  114. WHY IT WORKS FOR US
    ➤ Increased automation
    ➤ Simpler configuration
    ➤ Efficient resource usage
    ➤ Bin packing
    ➤ GKE autoscaling

    View Slide

  115. WHY IT WORKS FOR US
    ➤ Increased automation
    ➤ Simpler configuration
    ➤ Efficient resource usage
    ➤ Bin packing
    ➤ GKE autoscaling
    ➤ Improved developer workflows for streaming services
    ➤ e.g. adding new Kafka Streams applications, Kafka Connect
    workloads

    View Slide

  116. View Slide

  117. THANK YOU!
    Twitter: @NikkiThean

    Confluent Slack: @nikki

    Email: [email protected]
    Thank you to Kamo for drawing inspiration!

    View Slide