Scale in / scale out with Kafka Streams and Kubernetes

Slide 1

Slide 1 text

@Xebiconfr #Xebicon18 @LoicMDivad Build the future Scale in / scale out with Kafka-Streams and Kubernetes Loïc DIVAD Data Engineer @ Xebia France

Slide 2

Slide 2 text

@Xebiconfr #Xebicon18 @LoicMDivad 2

Slide 3

Slide 3 text

@Xebiconfr #Xebicon18 @LoicMDivad Auto-scaling applied to streaming apps Scale out by respecting the workload and resources 3

Slide 4

Slide 4 text

@Xebiconfr #Xebicon18 @LoicMDivad Loïc DIVAD Developer @XebiaFr (also #Data Engineer, #Spark Trainer, @DataXDay #Organiser, #[email protected], #DataLover ) @LoicMDivad 4

Slide 5

Slide 5 text

@Xebiconfr #Xebicon18 @LoicMDivad Streaming Apps Consumer polling

Slide 6

Slide 6 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka clients and the consumer polling system APP 6

Slide 7

Slide 7 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka clients and the consumer polling system APP 7

Slide 8

Slide 8 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka clients and the consumer polling system APP 8

Slide 9

Slide 9 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka clients and the consumer polling system APP 9

Slide 10

Slide 10 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka clients and the consumer polling system APP 10

Slide 11

Slide 11 text

@Xebiconfr #Xebicon18 @LoicMDivad Streaming Apps Kafka-Streams and the consumer protocol

Slide 12

Slide 12 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka-Streams and the consumer protocol topic-partition-0 topic-partition-1 topic-partition-2 topic-partition-N APP ● Every topic in Kafka is split into one or more partitions ● All the streaming tasks are executed through one or multiple threads of the same instance 12

Slide 13

Slide 13 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka-Streams and the consumer protocol topic-partition-0 topic-partition-1 topic-partition-2 topic-partition-N APP APP ● Consumers from the same consumer group cooperate to consume data from topics. ● Every instance by joining the group triggers a partition rebalance. 13

Slide 14

Slide 14 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka-Streams and the consumer protocol topic-partition-0 topic-partition-1 topic-partition-2 topic-partition-N APP APP APP APP ● The maximum parallelism is determined by the number of partitions of the input topic(s) 14

Slide 15

Slide 15 text

@Xebiconfr #Xebicon18 @LoicMDivad Container Orchestration K8s or the state of the art

Slide 16

Slide 16 text

@Xebiconfr #Xebicon18 @LoicMDivad Container Orchestration: K8s or the state of the art ➔ Source: Kubernetes.io Documentation 16

Slide 17

Slide 17 text

@Xebiconfr #Xebicon18 @LoicMDivad Container Orchestration Support for custom metrics

Slide 18

Slide 18 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Support for custom metrics 18 kind: Deployment # deployment.yaml #... template: containers: - name: streaming-app # ... - name: prometheus-to-sd # ... adapter.yaml - name: custom-metrics-sd-adapter Your Streaming App Prometheus to Stackdriver https://gcr.io/google-containers/prometheus-to-sd Metrics Server https://gcr.io/google-containers/custom-metrics-stackdriver-adapter JMX metrics in a Prometheus format Stackdriver

Slide 19

Slide 19 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Support for custom metrics # jmx-exporter.conf --- global: scrape_interval: 1s evaluation_interval: 1s rules: - pattern: "kafka.consumer<>(.*):(.*)" labels: { partition: $2, topic: GAME-FRAME-RS, metric: $3 } name: "consumer_lag_game_frame_rs" type: GAUGE - pattern: "kafka.consumer<>(.*):(.*)" labels: { partition: $2, topic: GAME-FRAME-RQ, metric: $3 } name: "consumer_lag_game_frame_rq" type: GAUGE 19

Slide 20

Slide 20 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Support for custom metrics 20 1 W , D f lolo ➜ ./gradlew dockerPush <=========----> 73% EXECUTING [2s] > :docker … BUILD SUCCESSFUL in 14s 10 actionable tasks: 5 executed, 5 up-to-date

Slide 21

Slide 21 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Support for custom metrics lolo ➜ ./gradlew dockerPush <=========----> 73% EXECUTING [2s] > :docker … BUILD SUCCESSFUL in 14s 10 actionable tasks: 5 executed, 5 up-to-date lolo ➜ terraform apply … + google_container_cluster.primary 21 2 E ! U f f … f

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Support for custom metrics 27 kind: Deployment # deployment.yaml #... template: containers: - name: streaming-app # ... - name: prometheus-to-sd # ... adapter.yaml - name: custom-metrics-sd-adapter Your Streaming App Prometheus to Stackdriver https://gcr.io/google-containers/prometheus-to-sd Metrics Server https://gcr.io/google-containers/custom-metrics-stackdriver-adapter Stackdriver

Slide 28

Slide 28 text

@Xebiconfr #Xebicon18 @LoicMDivad Container Orchestration Horizontal Pod Autoscaler

Slide 29

Slide 29 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Horizontal Pod Autoscaler ➔ Source: Kubernetes.io Documentation 29

Slide 30

Slide 30 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Horizontal Pod Autoscaler - Kubernetes Resource - Periodically adjusts the number of replicas - Base on CPU usage in autoscaling/v1 - Memory and custom metrics are covered by the autoscaling/v2beta1 - Use the metrics.k8s.io API through a metric server ➔ Source: Kubernetes.io Documentation 30

Slide 31

Slide 31 text

@Xebiconfr #Xebicon18 @LoicMDivad Scale In / scale out with kafka-streams and k8s 31

Slide 32

Slide 32 text

@Xebiconfr #Xebicon18 @LoicMDivad 32

Slide 33

Slide 33 text

@Xebiconfr #Xebicon18 @LoicMDivad Scale In / scale out with kafka-streams and k8s 33

Slide 34

Slide 34 text

@Xebiconfr #Xebicon18 @LoicMDivad CONCLUSION States migration, changelog compaction, topology upgrades and k8s StateFull Sets adoption are the next challenges to ease auto-scaling BUILD THE FUTURE 1. Kafka-Streams exposes relevant metrics related to stream processing 2. Consumer-lag is one of the key metrics to monitor in real time application 3. The cloud native trends brings a set of powerful tools on which the Kafka community keep a close look 34

Slide 35

Slide 35 text

@Xebiconfr #Xebicon18 @LoicMDivad MERCI 35

Slide 36

Slide 36 text

@Xebiconfr #Xebicon18 @LoicMDivad 36

Slide 37

Slide 37 text

@Xebiconfr #Xebicon18 @LoicMDivad 37

Slide 38

Slide 38 text

@Xebiconfr #Xebicon18 @LoicMDivad ANNEXES 38

Slide 39

Slide 39 text

@Xebiconfr #Xebicon18 @LoicMDivad The Horizontal Pod Autoscaler algorithm depends on the current metric value and replica number desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )] ➢ A ratio of two will double the number of intances within the respect of maxReplicas ➢ By using targetAverageValue, the metric is computed by taking the average of the given metric across all Pods The number of replicas may fluctuating frequently due to the dynamic nature of the metrics, it’s called trashing ➢ --horizontal-pod-autoscaler-downscale-delay (default 5m0s) ➢ --horizontal-pod-autoscaler-upscale-delay (default 3m0s) Note: Both Kafka-Streams topology modification and HPA makes rolling update imposible HPA & thrashing: “Should I stay or should I Go?” 39

Slide 40

Slide 40 text

@Xebiconfr #Xebicon18 @LoicMDivad Kafka-Streams & persistent storage: “Let’s talk about states baby” 40 Streaming apps Maintains states with Rocksdb States are backuped in change logs topics State full topology nodes has portion of the state attached At startup if an instance has old versions of states it’s more likely to be assigned to the corresponding streaming tasks This reduces the states migration Large states implies high recovery time

Slide 41

Slide 41 text

@Xebiconfr #Xebicon18 @LoicMDivad K8s: Horizontal Pod Autoscaler 41

Slide 42

Slide 42 text

@Xebiconfr #Xebicon18 @LoicMDivad now supports more than 42 https://www.confluent.io/blog/apache-kafka-supports-200k-partitions-per-cluster 200K partitions

Slide 43

Slide 43 text

@Xebiconfr #Xebicon18 @LoicMDivad “Everything is awesome, when you're living in a THE CLOUD” 43

Slide 44

Slide 44 text

@Xebiconfr #Xebicon18 @LoicMDivad Use Case - King Of Fighters: The combos sessionization 44 Streaming App Correlate Flatten Decode Group Produce Back Key => { "ts":1542609460412, "machine":"903071", "zone":"AU" } Value => { "bytes":[ "c3ff8ab19d00d9e5", "e3ff8c72b600d9e5" ]} [{ "impact":0, "key":"X", "direction":"DOWN", "type":"Missed", "level":"Pro", "game":"Neowave" }, ...]

Slide 45

Slide 45 text

@Xebiconfr #Xebicon18 @LoicMDivad Links and references https://www.confluent.io/kafka-summit-sf18/deploying- kafka-streams-applications https://www.youtube.com/watch?v=9cyXXmRlGWQ https://www.confluent.io/blog/apache-kafka-kubernetes -could-you-should-you https://kubernetes.io/blog/2018/04/13/local-persistent -volumes-beta/ https://stackoverflow.com/questions/49482873/how-to -deploy-kafka-stream-applications-on-kubernetes https://www.youtube.com/watch?v=9TOoThIKafo&list= PLhMG-8t0efEvJM5Bt2_zNLNVCEWcfRN0i https://www.confluent.io/blog/streaming-in-the-clouds- where-to-start Ideas from The event oriented architecture Concepts from Streams and tables white paper All the code on github DivLoic/xke-kingof-scaling Pictures: ● Stormtroopers Photo by Corey Motta on Unsplash ● Screen from “Close Rick-counters of the Rick Kind” ● Sreens from https://kubernetes.io/docs/tasks/ ● Coin-op machine by xoxo from the Noun Project ● Arcade game by Icons Producer from the Noun Project ● Mainframe by monkik from the Noun Project 45

Slide 46

Slide 46 text

@Xebiconfr #Xebicon18 @LoicMDivad With special thanks to: 46 Stéphane M. Florent R. Sylvain L. (Teddy Beer)

Slide 47

Slide 47 text

@Xebiconfr #Xebicon18 @LoicMDivad Auto-scaling applied to streaming apps Scale out by respecting the workload and resources 47