Kafka on Kubernetes: Keeping It Simple

KAFKA ON KUBERNETES: Keeping It Simple

We’re not doing that…  right?

We’re not doing that…  right? Definitely not!

ONE YEAR LATER

ONE YEAR LATER 1. Production Kafka cluster on Kubernetes. 2.
Suggesting this idea to other people.

We’re not doing that…  right? Definitely not!

RUNNING KAFKA ON KUBERNETES DOESN'T HAVE TO BE COMPLICATED.

RUNNING KAFKA ON KUBERNETES DOESN'T HAVE TO BE COMPLICATED. *
"Complicated"

RUNNING KAFKA ON KUBERNETES DOESN'T HAVE TO BE COMPLICATED. *
"Complicated" = custom resource deﬁnitions, plugins, operators, etc.

WHAT YOU’LL GET OUT OF THIS

WHAT YOU’LL GET OUT OF THIS ➤ Example of real-life
production setup

production setup ➤ Technical tips and tricks

production setup ➤ Technical tips and tricks ➤ Advice for migrating production systems

SYSTEMS MENTIONED ➤ Kafka ➤ Kubernetes ➤ Chef ➤ Terraform
➤ Helm ➤ Prometheus ➤ Google Cloud Platform

WHAT WE BUILT and how it works

WHAT WE CONSIDERED ➤ VMs + Chef

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure

➤ VMs + Machine/Docker Images

➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems

➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups

➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups ➤ Not good for stateful workloads

➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups ➤ Not good for stateful workloads ➤ …Kubernetes?!

CROSS-ZONE TRAFFIC IS $$$.

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones ➤ node
type

type - varying workloads, e.g. high CPU requirement

pools - varying workloads, e.g. high CPU requirement ➤ Workload allocation controlled by: ➤ nodeSelectors ➤ pool: highmem ➤ failure-domain.beta.kubernetes.io/zone: us-central1-a

pools - varying workloads, e.g. high CPU requirement ➤ Workload allocation controlled by: ➤ nodeSelectors ➤ pool: highmem ➤ failure-domain.beta.kubernetes.io/zone: us-central1-a ➤ taints + tolerations ➤ - key: pool  operator: Equal  value: highmem  effect: NoSchedule

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g.
Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker

Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer

Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* ➤ Conﬁguring your Kafka listeners * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* ➤ Conﬁguring your Kafka listeners ➤ https://rmoﬀ.net/2018/08/02/kafka-listeners-explained/ * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

SOME NOTES ON MANAGEMENT

SOME NOTES ON MANAGEMENT ➤ Installation, deploy: ➤ Conﬂuent Helm
charts ➤ https://github.com/conﬂuentinc/cp-helm-charts

charts ➤ https://github.com/conﬂuentinc/cp-helm-charts ➤ No Tiller, Helm templating only

charts ➤ https://github.com/conﬂuentinc/cp-helm-charts ➤ No Tiller, Helm templating only ➤ Conﬂuent Docker images, with additions

SOME NOTES ON MANAGEMENT ➤ Monitoring:

SOME NOTES ON MANAGEMENT ➤ Monitoring: ➤ Kafka exposes JMX
metrics by default

SOME NOTES ON MANAGEMENT ➤ Monitoring: ➤ Kafka exposes JMX
metrics by default ➤ Prometheus JMX Exporter as Java agent (vs. Helm chart sidecar)

WHAT WE LEARNED and what we think you should know

WHAT WE LEARNED

WHAT WE LEARNED ➤ Ephemeral resources

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static
IPs; make sure Kafka clients can handle pod evictions

IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)

IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Check the Apache Kafka JIRA

IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Oh, and upgrade Kafka.

IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Check JIRA/Upgrade Kafka ➤ Producers and consumers too!

IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Check JIRA/Upgrade Kafka ➤ Producers and consumers too! ➤ Examples: KAFKA-7755, KAFKA-7890 ➤ See also: cp-helm-charts issue #240

WHAT WE LEARNED ➤ Diﬀerent kinds of updates/rolling restarts

WHAT WE LEARNED ➤ Diﬀerent kinds of updates/rolling restarts ➤
Changes to Kafka cluster: versions, broker properties

Changes to Kafka cluster: versions, broker properties ➤ Workload conﬁguration: resources, security policies

Changes to Kafka cluster: versions, broker properties ➤ Workload conﬁguration: resources, security policies ➤ Upgrading Kubernetes nodes

Health checks are important for self-healing clusters!

CONTAINER PROBES

CONTAINER PROBES ➤ Liveness Probe  “Should I restart this container?”

➤ Readiness Probe  “Should this container accept traﬃc?”

https://github.com/andreas-schroeder/kafka-health-check

➤ Endpoint: is this broker healthy?

➤ Endpoint: is this broker healthy? ➤ Don’t be too strict!

https://github.com/andreas-schroeder/kafka-health-check

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of
updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates

updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload conﬁguration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks aﬀect updates ➤ podDisruptionBudget

updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload conﬁguration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks aﬀect updates ➤ podDisruptionBudget, podManagementPolicy

updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload conﬁguration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks aﬀect updates ➤ podDisruptionBudget, podManagementPolicy ➤ Health check overrides ➤ What happens if you deploy a change that breaks the health checks?

updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload conﬁguration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks aﬀect updates ➤ podDisruptionBudget, podManagementPolicy ➤ Health check overrides ➤ What happens if you deploy a change that breaks the health checks? ➤ See Kubernetes issue #62750

MIGRATING PRODUCTION ARCHITECTURE

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty!

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate
common maintenance tasks

common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test

common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts…

common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts… ➤ We were able to use Compute Engine persistent disks (shared storage) rather than local SSDs

common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts… ➤ We were able to use Compute Engine persistent disks (shared storage) rather than local SSDs ➤ Simulate failure

Try to keep it simple!

WHY IT WORKS FOR US ( for now, at least!)

WHY IT WORKS FOR US

WHY IT WORKS FOR US ➤ Increased automation

WHY IT WORKS FOR US ➤ Increased automation ➤ Simpler
conﬁguration

conﬁguration ➤ Eﬃcient resource usage ➤ Bin packing ➤ GKE autoscaling

configuration ➤ Efficient resource usage ➤ Bin packing ➤ GKE autoscaling ➤ Improved developer workflows for streaming services ➤ e.g. adding new Kafka Streams applications, Kafka Connect workloads

THANK YOU! Twitter: @NikkiThean  Confluent Slack: @nikki  Email: [email protected] Thank
you to Kamo for drawing inspiration!

Kafka on Kubernetes: Keeping It Simple

Kafka on Kubernetes: Keeping It Simple

Other Decks in Technology

Featured

Transcript