Slide 1

Slide 1 text

KAFKA ON KUBERNETES: Keeping It Simple

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

We’re not doing that…
 right?

Slide 9

Slide 9 text

We’re not doing that…
 right?

Slide 10

Slide 10 text

We’re not doing that…
 right? Definitely not!

Slide 11

Slide 11 text

ONE YEAR LATER

Slide 12

Slide 12 text

ONE YEAR LATER 1. Production Kafka cluster on Kubernetes. 2. Suggesting this idea to other people.

Slide 13

Slide 13 text

ONE YEAR LATER 1. Production Kafka cluster on Kubernetes. 2. Suggesting this idea to other people.

Slide 14

Slide 14 text

We’re not doing that…
 right? Definitely not!

Slide 15

Slide 15 text

RUNNING KAFKA ON KUBERNETES DOESN'T HAVE TO BE COMPLICATED.

Slide 16

Slide 16 text

RUNNING KAFKA ON KUBERNETES DOESN'T HAVE TO BE COMPLICATED. * "Complicated"

Slide 17

Slide 17 text

RUNNING KAFKA ON KUBERNETES DOESN'T HAVE TO BE COMPLICATED. * "Complicated" = custom resource definitions, plugins, operators, etc.

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

WHAT YOU’LL GET OUT OF THIS

Slide 25

Slide 25 text

WHAT YOU’LL GET OUT OF THIS ➤ Example of real-life production setup

Slide 26

Slide 26 text

WHAT YOU’LL GET OUT OF THIS ➤ Example of real-life production setup ➤ Technical tips and tricks

Slide 27

Slide 27 text

WHAT YOU’LL GET OUT OF THIS ➤ Example of real-life production setup ➤ Technical tips and tricks ➤ Advice for migrating production systems

Slide 28

Slide 28 text

SYSTEMS MENTIONED ➤ Kafka ➤ Kubernetes ➤ Chef ➤ Terraform ➤ Helm ➤ Prometheus ➤ Google Cloud Platform

Slide 29

Slide 29 text

SYSTEMS MENTIONED ➤ Kafka ➤ Kubernetes ➤ Chef ➤ Terraform ➤ Helm ➤ Prometheus ➤ Google Cloud Platform

Slide 30

Slide 30 text

SYSTEMS MENTIONED ➤ Kafka ➤ Kubernetes ➤ Chef ➤ Terraform ➤ Helm ➤ Prometheus ➤ Google Cloud Platform

Slide 31

Slide 31 text

WHAT WE BUILT and how it works

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

WHAT WE CONSIDERED ➤ VMs + Chef

Slide 34

Slide 34 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure

Slide 35

Slide 35 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure ➤ VMs + Machine/Docker Images

Slide 36

Slide 36 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure ➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems

Slide 37

Slide 37 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure ➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups

Slide 38

Slide 38 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure ➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups ➤ Not good for stateful workloads

Slide 39

Slide 39 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure ➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups ➤ Not good for stateful workloads ➤ …Kubernetes?!

Slide 40

Slide 40 text

WHAT WE CONSIDERED ➤ VMs + Chef ➤ Mutable infrastructure ➤ VMs + Machine/Docker Images ➤ Same amount of toil, new cloud infra problems ➤ Managed Instance Groups ➤ Not good for stateful workloads ➤ …Kubernetes?!

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

CROSS-ZONE TRAFFIC IS $$$.

Slide 44

Slide 44 text

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones

Slide 45

Slide 45 text

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones ➤ node type

Slide 46

Slide 46 text

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones ➤ node type - varying workloads, e.g. high CPU requirement

Slide 47

Slide 47 text

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones ➤ node pools - varying workloads, e.g. high CPU requirement ➤ Workload allocation controlled by: ➤ nodeSelectors ➤ pool: highmem ➤ failure-domain.beta.kubernetes.io/zone: us-central1-a

Slide 48

Slide 48 text

RESOURCE MANAGEMENT ➤ Examples of separation: ➤ zones ➤ node pools - varying workloads, e.g. high CPU requirement ➤ Workload allocation controlled by: ➤ nodeSelectors ➤ pool: highmem ➤ failure-domain.beta.kubernetes.io/zone: us-central1-a ➤ taints + tolerations ➤ - key: pool
 operator: Equal
 value: highmem
 effect: NoSchedule

Slide 49

Slide 49 text

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

Slide 54

Slide 54 text

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* ➤ Configuring your Kafka listeners * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

Slide 55

Slide 55 text

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* ➤ Configuring your Kafka listeners * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

Slide 56

Slide 56 text

SERVICE DISCOVERY ➤ Within the cluster: ClusterIP Service ➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to Kafka broker ➤ External services to Kafka: ➤ Bootstrapping: Cloud DNS to LoadBalancer ➤ Direct to broker: NodePort* ➤ Configuring your Kafka listeners ➤ https://rmoff.net/2018/08/02/kafka-listeners-explained/ * In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

Slide 57

Slide 57 text

SOME NOTES ON MANAGEMENT

Slide 58

Slide 58 text

SOME NOTES ON MANAGEMENT ➤ Installation, deploy: ➤ Confluent Helm charts ➤ https://github.com/confluentinc/cp-helm-charts

Slide 59

Slide 59 text

SOME NOTES ON MANAGEMENT ➤ Installation, deploy: ➤ Confluent Helm charts ➤ https://github.com/confluentinc/cp-helm-charts ➤ No Tiller, Helm templating only

Slide 60

Slide 60 text

SOME NOTES ON MANAGEMENT ➤ Installation, deploy: ➤ Confluent Helm charts ➤ https://github.com/confluentinc/cp-helm-charts ➤ No Tiller, Helm templating only ➤ Confluent Docker images, with additions

Slide 61

Slide 61 text

SOME NOTES ON MANAGEMENT ➤ Monitoring:

Slide 62

Slide 62 text

SOME NOTES ON MANAGEMENT ➤ Monitoring: ➤ Kafka exposes JMX metrics by default

Slide 63

Slide 63 text

SOME NOTES ON MANAGEMENT ➤ Monitoring: ➤ Kafka exposes JMX metrics by default ➤ Prometheus JMX Exporter as Java agent (vs. Helm chart sidecar)

Slide 64

Slide 64 text

WHAT WE LEARNED and what we think you should know

Slide 65

Slide 65 text

WHAT WE LEARNED

Slide 66

Slide 66 text

WHAT WE LEARNED

Slide 67

Slide 67 text

WHAT WE LEARNED ➤ Ephemeral resources

Slide 68

Slide 68 text

WHAT WE LEARNED ➤ Ephemeral resources

Slide 69

Slide 69 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions

Slide 70

Slide 70 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)

Slide 71

Slide 71 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)

Slide 72

Slide 72 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Check the Apache Kafka JIRA

Slide 73

Slide 73 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Oh, and upgrade Kafka.

Slide 74

Slide 74 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Check JIRA/Upgrade Kafka ➤ Producers and consumers too!

Slide 75

Slide 75 text

WHAT WE LEARNED ➤ Ephemeral resources ➤ Don’t assume static IPs; make sure Kafka clients can handle pod evictions ➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl) ➤ Check JIRA/Upgrade Kafka ➤ Producers and consumers too! ➤ Examples: KAFKA-7755, KAFKA-7890 ➤ See also: cp-helm-charts issue #240

Slide 76

Slide 76 text

WHAT WE LEARNED ➤ Different kinds of updates/rolling restarts

Slide 77

Slide 77 text

WHAT WE LEARNED ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties

Slide 78

Slide 78 text

WHAT WE LEARNED ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies

Slide 79

Slide 79 text

WHAT WE LEARNED ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes

Slide 80

Slide 80 text

Health checks are important for self-healing clusters!

Slide 81

Slide 81 text

Health checks are important for self-healing clusters!

Slide 82

Slide 82 text

Health checks are important for self-healing clusters!

Slide 83

Slide 83 text

CONTAINER PROBES

Slide 84

Slide 84 text

CONTAINER PROBES ➤ Liveness Probe
 “Should I restart this container?”

Slide 85

Slide 85 text

CONTAINER PROBES ➤ Liveness Probe
 “Should I restart this container?” ➤ Readiness Probe
 “Should this container accept traffic?”

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

https://github.com/andreas-schroeder/kafka-health-check

Slide 88

Slide 88 text

CONTAINER PROBES ➤ Liveness Probe
 “Should I restart this container?” ➤ Endpoint: is this broker healthy?

Slide 89

Slide 89 text

CONTAINER PROBES ➤ Liveness Probe
 “Should I restart this container?” ➤ Endpoint: is this broker healthy? ➤ Don’t be too strict!

Slide 90

Slide 90 text

https://github.com/andreas-schroeder/kafka-health-check

Slide 91

Slide 91 text

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates

Slide 92

Slide 92 text

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates

Slide 93

Slide 93 text

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates ➤ podDisruptionBudget

Slide 94

Slide 94 text

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates ➤ podDisruptionBudget, podManagementPolicy

Slide 95

Slide 95 text

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates ➤ podDisruptionBudget, podManagementPolicy ➤ Health check overrides ➤ What happens if you deploy a change that breaks the health checks?

Slide 96

Slide 96 text

WHAT WE HAD TO FIGURE OUT ➤ Different kinds of updates/rolling restarts ➤ Changes to Kafka cluster: versions, broker properties ➤ Workload configuration: resources, security policies ➤ Upgrading Kubernetes nodes ➤ How health checks affect updates ➤ podDisruptionBudget, podManagementPolicy ➤ Health check overrides ➤ What happens if you deploy a change that breaks the health checks? ➤ See Kubernetes issue #62750

Slide 97

Slide 97 text

MIGRATING PRODUCTION ARCHITECTURE

Slide 98

Slide 98 text

MIGRATING PRODUCTION ARCHITECTURE

Slide 99

Slide 99 text

MIGRATING PRODUCTION ARCHITECTURE

Slide 100

Slide 100 text

MIGRATING PRODUCTION ARCHITECTURE

Slide 101

Slide 101 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty!

Slide 102

Slide 102 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty!

Slide 103

Slide 103 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate common maintenance tasks

Slide 104

Slide 104 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test

Slide 105

Slide 105 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts…

Slide 106

Slide 106 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts… ➤ We were able to use Compute Engine persistent disks (shared storage) rather than local SSDs

Slide 107

Slide 107 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts… ➤ We were able to use Compute Engine persistent disks (shared storage) rather than local SSDs ➤ Simulate failure

Slide 108

Slide 108 text

MIGRATING PRODUCTION ARCHITECTURE ➤ Get your hands dirty! ➤ Simulate common maintenance tasks ➤ Benchmark for performance ➤ kafka-producer-perf-test, kafka-consumer-perf-test ➤ Variables: disk type, CPU count, producer record size, producer batch size, Java opts… ➤ We were able to use Compute Engine persistent disks (shared storage) rather than local SSDs ➤ Simulate failure

Slide 109

Slide 109 text

Try to keep it simple!

Slide 110

Slide 110 text

WHY IT WORKS FOR US ( for now, at least!)

Slide 111

Slide 111 text

WHY IT WORKS FOR US

Slide 112

Slide 112 text

WHY IT WORKS FOR US ➤ Increased automation

Slide 113

Slide 113 text

WHY IT WORKS FOR US ➤ Increased automation ➤ Simpler configuration

Slide 114

Slide 114 text

WHY IT WORKS FOR US ➤ Increased automation ➤ Simpler configuration ➤ Efficient resource usage ➤ Bin packing ➤ GKE autoscaling

Slide 115

Slide 115 text

WHY IT WORKS FOR US ➤ Increased automation ➤ Simpler configuration ➤ Efficient resource usage ➤ Bin packing ➤ GKE autoscaling ➤ Improved developer workflows for streaming services ➤ e.g. adding new Kafka Streams applications, Kafka Connect workloads

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

THANK YOU! Twitter: @NikkiThean
 Confluent Slack: @nikki
 Email: [email protected] Thank you to Kamo for drawing inspiration!