Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes performance tuning dilemma: How to solve it with AI

Kubernetes performance tuning dilemma: How to solve it with AI

There is a “dark side” to Kubernetes that makes it difficult to ensure the desired performance and resilience of cloud-native applications, while also keeping their costs under control. Indeed, the combined effect of Kubernetes resource management mechanisms and application runtime heuristics may cause serious performance and resilience risks. See Akamas' AI-powered optimizations solve this!

Presented at MiaPlatform meetup on June 27th, 2023:
https://www.meetup.com/mia-platform-cultura-innovazione-team/events/293631739

Stefano Doni

July 06, 2023
Tweet

More Decks by Stefano Doni

Other Decks in Technology

Transcript

  1. © 2023 Akamas • All Rights Reserved • Confidential Kubernetes

    performance tuning dilemma: How to solve it with AI Stefano Doni, CTO
  2. © 2023 Akamas • All Rights Reserved • Confidential Agenda

    1 The dark side of K8s 2 Why is K8s so hard? A peek under the cover 3 Enter AI-powered optimization 4 Demo
  3. © 2023 Akamas • All Rights Reserved • Confidential •

    Obsessed with performance optimization • 18+ years of capacity & performance work • Conference speaker since 2014 • Co-founder and CTO @ Akamas, the software platform for autonomous optimization, powered by AI Who Am I
  4. © 2023 Akamas • All Rights Reserved • Confidential And

    so K8s was born... “So let me get this straight. You want to build an external version of the Borg task scheduler. One of our most important competitive advantages. The one we don’t even talk about externally. And, on top of that, you want to open source it?” Craig McLuckie Co-founder of Kubernetes and Senior Product Manager at Google 2013 https://cloud.google.com/blog/products/containers-kubernet es/from-google-to-the-world-the-kubernetes-origin-story
  5. © 2023 Akamas • All Rights Reserved • Confidential The

    dark side of Kubernetes youtu.be/watch?v=4CT0cI62YHk youtu.be/QXApVwRBeys Cost efficiency Apps reliability Apps performance Kubernetes FinOps Report, 2021 June Kubernetes failure stories: k8s.af
  6. © 2023 Akamas • All Rights Reserved • Confidential Application

    runtime resource management Kubernetes resource management • Heap memory sizing • Garbage collection • Processor & thread settings • Container resource requests & limits • Number of replicas • Horizontal auto-scaling settings New tuning challenges for cloud-native apps 100s-1000s microservices 10s-100s inter-dependent configurations
  7. © 2023 Akamas • All Rights Reserved • Confidential Why

    is K8s so hard? K8s resource management
  8. © 2023 Akamas • All Rights Reserved • Confidential Pod

    A Resource requests drive K8s cluster costs CPU Memory • Requests are resources the container is guaranteed to get • Cluster capacity is based on pod resource requests - there is no overcommitment! • Resource requests != resource utilization: a cluster can be full even if utilization is 10% Node (4 CPU, 8 GB Memory) Resource requests from pod manifest Pod A 2 cores 2GB Memory Pod A apiVersion: v1 kind: Pod metadata: name: Pod A spec: containers: - name: app image: nginx:1.1 resources: requests: memory: “2Gi” cpu: “2” 2 4 2 4 6 8 Resource used
  9. © 2023 Akamas • All Rights Reserved • Confidential Resource

    limits may strongly impact application performance and stability • A container can consume more resources than it has requested • Resource limits allow to specify the maximum resources a container can use (e.g. CPU = 2) • When a container hits its resource limits bad things can happen Container CPU limit Container Memory limit K8s throttle container CPU -> Application performance slowdown When hitting Memory Limits When hitting CPU Limits K8s kills the container -> Application stability issues X CPU Usage Memory Usage
  10. © 2023 Akamas • All Rights Reserved • Confidential So

    achieving cost effective, performant & reliable apps on K8s is EASY right? 10 … hell NO! YES sure… image generated with Midjourney
  11. © 2023 Akamas • All Rights Reserved • Confidential Get

    ready for some real-world K8s horror stories…
  12. © 2023 Akamas • All Rights Reserved • Confidential CPU

    limits don’t work the way you think! Surprising impacts on app performance Dev / SRE Significant CPU throttling… … with CPU < 40% “The container's CPU use is being throttled, because the container is attempting to use more CPU resources than its limit” https://kubernetes.io/docs/tasks/configure-pod- container/assign-cpu-resource Why do I have CPU throttling if I’m using less than 40% of my CPU limit? Must be a K8s infrastructure issue… Performance Impact
  13. © 2023 Akamas • All Rights Reserved • Confidential 13

    • CPU limits act on CPU time - your container can access all of the CPUs of the node • There are no universally good thresholds for CPU throttling • Experiment with your different K8s and app runtime settings and monitor app performance! period (100 ms) threads thread 1 Single-threaded app period (100 ms) threads thread 1 Multi-threaded app thread 2 thread 3 thread 4 CPU Throttling - app is stalled! CPU quota: 100 ms CPU quota: 100 ms Example: How CPU throttling works with CPU limit = 1 core How CPU limits & throttling actually work Key Takeaways
  14. © 2021 Akamas • All Rights Reserved • Confidential Memory

    limits don’t work the way you think! Surprising impacts on app availability Container Memory limit Container Memory used Container memory usage < 70% A new configuration is recommended to save costs, adapting container memory limits to resource usage New configuration - 10% Mem < 70%
  15. © 2021 Akamas • All Rights Reserved • Confidential Container

    Memory limit New configuration XX Out of memory pod restarts The new configuration causes a misalignment of K8s resources and app runtime (JVM heap size vs mem limits) As a result, the application pods get killed by K8s, causing service availability issues Container Memory used Memory limits don’t work the way you think! Surprising impacts on app availability Availability Impact
  16. © 2023 Akamas • All Rights Reserved • Confidential Rightsizing

    overprovisioned containers is easy, right? 16 Container memory is far from saturation… Context: Java microservice running in K8s container with 6GB memory limit (default JVM settings) SRE Container memory limit Container memory used I can safely save money by reducing memory limits… right? Mem < 35%
  17. © 2023 Akamas • All Rights Reserved • Confidential 17

    App slows down significantly, while container memory utilization stays below 50%! Tuning experiment: Container memory limit is progressively cut from 6 to 2GB (constant load) SRE Container memory limit Container memory used App response time Ouch! App performance severely degraded… Why is that? Rightsizing overprovisioned containers is easy, right? Performance Impact
  18. © 2023 Akamas • All Rights Reserved • Confidential Let’s

    dive deeper in the stack… Application runtime resource management
  19. © 2023 Moviri • All Rights Reserved App runtimes are

    complex engines Heap Thread stack Loaded Classes Compiled code … Memory Class loader Just-in-time compiler Garbage collector Interpreter Execution engine JVM Architecture
  20. © 2023 Akamas • All Rights Reserved • Confidential How

    does the JVM set the max heap? JVM ergonomics in K8s are tricky Source: Microsoft • MaxRAMPercentage default is very conservative: increase it, but watch out for OOM kills by k8s • Do not trust JVM ergonomics: it’s best to explicitly set JVM flags to avoid surprises (-Xmx <max-heap>) • Check your apps: “docker run --memory 1G <image> java -XX:+PrintFlagsFinal 2>&1 | grep -w MaxHeapSize” Key Takeaways Max Heap 256MB Container mem limit: 1 GB
  21. © 2023 Akamas • All Rights Reserved • Confidential 21

    The JVM ergonomics configure heap memory based on container memory (max heap = 25% of mem limit) As the JVM max heap gets reduced, JVM memory pressure builds up, impacting app performance • Rightsizing K8s containers looking only at resource usage can lead to huge perf issues • Understand what’s happening in your application runtime environment • Explicitly set app runtime options and check app performance! Container memory limit JVM heap size JVM heap used Rightsizing overprovisioned containers is hard! JVM memory saturation App response time Key Takeaways
  22. © 2023 Moviri • All Rights Reserved A well tuned

    GC delivers huge cost benefits as well 1500 millicores 600 millicores CPU used App response time G1 GC (-XX:+UseG1GC) Parallel GC (-XX:+UseParallelGC) -60% CPU used ($$$)
  23. © 2023 Akamas • All Rights Reserved • Confidential JVM

    default ergonomics in K8s: garbage collector 2 4 6 8 1 Number of CPUs Memory (MB) 1791 MB Serial GC G1 GC • Default GC selection is based on hard-coded thresholds defined decades ago • You may end up paying the cost of a suboptimal GC, and you may not even know it! • Do not trust JVM ergonomics - always set your JVM options! Key Takeaways
  24. © 2023 Akamas • All Rights Reserved • Confidential What

    about Golang, Node.js and .NET? Kind of the same :) 400 millicores 180 millicores -55% CPU used • Golang: https://tip.golang.org/doc/gc-guide • Node.js/V8: https://flaviocopes.com/node-runtime-v8-options • .NET: https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector Golang microservice GOGC tuning
  25. © 2023 Akamas • All Rights Reserved • Confidential How

    to solve this problem? Enter AI-driven Optimization
  26. © 2023 Akamas • All Rights Reserved • Confidential Optimization

    Studies Live Optimizations The Akamas Platform
  27. © 2023 Akamas • All Rights Reserved • Confidential Reducing

    cost of a Kubernetes microservice, while preserving app performance & reliability Demo
  28. © 2023 Akamas • All Rights Reserved • Confidential Optimization

    in the cloud native platforms https://www.akamas.io/resources/innovation-self-service-dev-portal-kubernetes-optimization/
  29. © 2023 Akamas • All Rights Reserved • Confidential Key

    takeaways • K8s enables unprecedented scalability & efficiency, but it’s not automatic • Tuning is your responsibility - if you don’t tune, you don’t save! • The biggest cost & reliability wins lie in K8s workload and app runtime layers - don’t rely on ergonomics! • AI-powered optimization enables you to automate tuning and achieve savings at scale 1 2 3 4
  30. Contacts [email protected] @AkamasLabs @akamaslabs Italy HQ Via Schiaffino 11 Milan,

    20158 +39-02-4951-7001 USA East 211 Congress Street Boston, MA 02110 +1-617-936-0212 Singapore 5 Temasek Blvd Singapore 038985 USA West 12130 Millennium Drive Los Angeles, CA 90094 +1-323-524-0524 LinkedIn Twitter Email © 2023 Akamas • All Rights Reserved • Confidential