Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes on Spot Instances

Kubernetes on Spot Instances

How Delivery Hero uses Spot Instances in production. The caveats of running on spots can teach you a great deal about the resiliency of your applications.


Vojtěch Vondra

February 27, 2019


  1. a cost-effective and reliable cloud setup Kubernetes on Spot Instances

  2. Own delivery fleets in 40 countries, delivering over a million

    orders a week Workloads across 4 continents in 3 AWS regions Typically a few hundred Kubernetes nodes running for web and worker workloads Logistics Tech in Delivery Hero
  3. Rails and Spring for web workloads, Scala & Akka for

    real- time apps, Python, R and various solvers for batch tasks Mostly Amazon RDS for PostgreSQL and Amazon DynamoDB for persistence Kubernetes deployments currently transitioning from kops to Amazon EKS, usually 1-2 minor versions behind Logistics Tech in Delivery Hero
  4. None
  5. Cloud has been an obvious choice for us sample country

    highlighted: we open the floodgates at 11.30am dealing with peaks and spikiness
  6. ..without spending time tuning to downscale Primary objective: Save money

    on infrastructure
  7. Our typical monthly AWS bill structure EC2 is typically the

    biggest component of our bills
  8. Focusing on big impact savings first Where are the largest

    marginal improvements? à with the biggest cost contributor EC2 EC2
  9. We could only reserve are base load without peaks or

    be less elastic The business it too volatile for capacity predictions Our workloads change over time unpredictably (memory vs. CPU intensive) … We use them for Kubernetes masters (at least until we’re migrated to EKS) Why not Reserved Instances?
  10. Cost per order Order growth vs cost growth Impact on

    our AWS bill
  11. Cost per order Order growth vs cost growth Impact on

    our AWS bill Spots introduced
  12. Example regional workload

  13. Overview of Spot Instances Refresher on how the Spot Instance

    Market works
  14. Refresher on spot fleets Instead of a fixed list price,

    there is a floating market price at which instance types are available. This price is typically less than 50% cheaper than the list price. The catch? 1. AWS can take the instance anytime from you with a 2min warning, if it’s needed elsewhere. 2. Some of your preferred instance types can be sold out.
  15. Refresher on spot fleets Spot instance markets are defined by

    an AZ and instance type. If you choose 2 instance types in 3 AZs, you are bidding in 6 different instance spot markets (pools). The more pools you specify, the lower the chance of not being able to maintain target capacity of a fleet.
  16. A lot of recent effort went into spot instance usability

  17. Provisioning spot fleets today Good console experience Well supported on

    infra automation
  18. Adapting to spot fleets for a more resilient architecture General

    challenges with spots
  19. Termination handling Close all connections Finish any long polling Stop

    in-progress worker jobs Terminate your pods Remove from Load Balancer Re-scale to target capacity curl
  20. - That’s like Chaos Monkey, right? Yes, but it happens

    24/7, not just business hours, and several instances might disappear at the same time. - My CTO’s never going to risk that! Termination handling
  21. Actual issues arising from the volatility Applications not terminating gracefully

    à abruptly terminated connections, stuck jobs Too much of target capacity being collocated on terminated nodes à too many pods of a deployment being affected New capacity not starting fast enough à a lot of apps starting at the same time can cause CPU starvation
  22. We spent some time making Java pods w/ Spring boot

    up more efficiently They have an ugly pattern of using 100% CPU until all classes are loaded and then idle under load Some help: -XX:TieredStopAtLevel=1 and remove bytecode instrumenting APM monitoring Case study: Spring Boot behavior on boot
  23. Kubernetes native handling of spot instance terminations How to approach

    all the challenges using Kubernetes
  24. DaemonSet running on all nodes which drains the node immediately

    upon seeing the notice. Optional Slack notification to give you a log to correlate monitoring noise / job disruptions with terminations. github.com/helm/charts/tree/master/incubator/kube-spot-termination-notice-handler (don’t worry, you’ll find all the links on the last slide) Spot Termination Notice Handler DaemonSet
  25. Slack Notifications

  26. Spot instances can stick around for a long time (~1

    year no problem) Pods will pile up on those nodes, Kubernetes won’t reschedule by itself. Descheduler
  27. Prevent pods of same deployment running on same node Target

    node CPU/memory utilization à redistributes pods to less utilized nodes github.com/kubernetes-incubator/descheduler Descheduler
  28. Goal: always have enough capacity to launch new pods Multiple

    strategies in Delivery Hero: 1. Scaling spot fleet based on CPU/RAM reservations, not usage 2. Overprovisioning using PodPriority github.com/helm/charts/tree/master/stable/cluster-overprovisioner Auto-scaling strategy
  29. Auto-scaling spot fleet based on custom metrics DaemonSet AWS CloudWatch

    % CPU/RAM reserved of node Spot Fleet Autoscaling policy 80% CPU reserved 50% CPU reserved
  30. Auto-scaling spot fleet based on custom metrics

  31. Our dispatching algo runs on Akka It’s a stateful, actor-based

    framework for distributed and concurrent apps The volatility of spots forced us to fix very broken cluster formation and split brain situations Case Study: beyond stateless apps, resiliency with stateful components
  32. Take it step by step: move your most stateless, fast

    to boot up pods over first, then continue one by one and monitor for noise. We migrated from on-demand over to spot within 6 months. No need to rush it our current fleet composition
  33. Thanks! Questions? Vojtěch Vondra (@vvondra) tech.deliveryhero.com