Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming your AWS bill – embracing ARM and other tips

Taming your AWS bill – embracing ARM and other tips

Recording: https://youtu.be/Md5J59XMdsw?si=8lTCDWQTZfDlNDpi

Clouds are great enablers of growth – including to your cloud bill. Dext is running predominantly on AWS. It's been an ongoing challenge to keep our AWS cloud costs at bay for the past 4-5 years, with lots of learnings along the way. The most recent one has been our journey from x86 to ARM for which we will share lessons learned. I plan to also mention what our general approach to cost saving in AWS is – how we look at our AWS costs, what we do for Kubernetes workloads (a kubecost-inspired approach but without kubecost) and some other tips and tricks that are only mentioned in passing the fine print of some AWS documentation somewhere, or not mentioned at all. We'll cover S3, EC2, compute in general, traffic, load-balaners and others.

Dimitar Dimitrov

November 02, 2023

More Decks by Dimitar Dimitrov

Other Decks in Programming


  1. Taming your AWS bill Embracing ARM and other tips Dimitar

    Dimitrov AWS Community Day Bulgaria 2023
  2. Leading the infrastructure team at Dext Been there for 7

    years now (since Oct 2016). Background in full-stack web development. I love solving problems and diving deep in technical issues. I won’t object if you call me a scrooge. 💸 Who am I AWS Community Day Bulgaria 2023
  3. • A bit of context and a disclaimer • What

    we’ve achieved • How we did what we did • Lessons learned (+ practical tips) • FAQ (and your questions) at the end What I’ll share AWS Community Day Bulgaria 2023
  4. Our approach is far from perfect: There are still many

    pain points in our cost optimisation tooling and processes; we achieve quite a few savings almost by accident, and thus, still have lots of room for improvement. Disclaimer AWS Community Day Bulgaria 2023
  5. A bit of context • Dext is a product company

    using AWS almost exclusively • Mostly web apps, plus a bunch of machine learning • Stateless bits are containerised and run in Kubernetes • The largest app handles ~15 mil requests/day • The infrastructure team is a dozen skilled engineers • Business grows 20-30% YoY, AWS costs are mostly flat AWS Community Day Bulgaria 2023
  6. A quick overview of our AWS costs historically. What we’ve

    achieved AWS Community Day Bulgaria 2023
  7. The secret sauce: while true; do How much are we

    paying for this? Why? What can we do? Do it! done AWS Community Day Bulgaria 2023
  8. How: a combination of… 1. A systematic, continuous, never-ending effort

    2. Measuring, questioning and understanding our usage 3. Purely infrastructural changes (little to no app changes) 4. Commercial measures (commitments and discounts) 5. Application changes (potentially the biggest impact) AWS Community Day Bulgaria 2023
  9. • It’s never “done” • The single most important lesson

    for me so far • While you’re doing R&D, you should be working on cost too • Estimate costs during design & development, and… • Look at costs after things have been running for a while 1. A never-ending effort AWS Community Day Bulgaria 2023
  10. • You’ll need to slice and dice costs in various

    dimensions • You’ll need to have a look at your historical data • You’ll likely use a variety of tools and techniques • The earlier you put probes in, the better for the future you 2. Measure & question AWS Community Day Bulgaria 2023
  11. 1. Tag your resources (Terraform default tags are helpful) 2.

    Define your billing tags under AWS Billing > Cost allocation tags 3. Define a Cost and usage report to collect detailed cost data in S3 4. Learn to use the built-in tools: ◦ Cost Explorer is very useful, learn to use it ◦ Use the cost anomaly detection and budgets ◦ AWS Trusted Advisor helps with rightsizing ◦ AWS Cost Categories sound promising, but are limited Measure & question (pt. 2) AWS Community Day Bulgaria 2023
  12. We use the following – lots of legacy in these

    names: KubernetesCluster, Name, Service, region, env, owner, application, system Billing tags get baked in historical data, so think ahead. Measure: tagging AWS Community Day Bulgaria 2023
  13. • We collect cost data in an S3 bucket via

    AWS Cost and usage reports with hourly granularity, with all the default data + resource IDs. • We get a collection of large, gzipped CSVs which we then process and split into our own categories. Measure: custom breakdowns AWS Community Day Bulgaria 2023
  14. • A cron pulls in daily the cost reports from

    S3 • Custom logic for assigning each cost line to a category – helpful for the messy reality where uniform conventions are but a mirage • It spits up a single static HTML page – the cost breakdown • The UI allows filtering and interactivity via client-side JS • A very useful tool, but we’ve outgrown it – needs a rewrite Measure: custom breakdowns AWS Community Day Bulgaria 2023
  15. • It provides a useful high-level overview • It breaks

    down Kubernetes workloads as well (in a kubecost-style way), based on the used CPU and memory of containers • Useful for devs as well: ◦ Filters are encoded in the URL which enables sharing • We still resort to the AWS Cost Explorer for details, though Custom breakdowns AWS Community Day Bulgaria 2023
  16. Kubernetes cost breakdowns We fetch CPU and memory requests and

    actual usage, take the higher of [requested, used] and then calculate the proportion of EC2 cost: cost_ weight = cpu_occupied * CPU_TO_RAM_COST_RATIO + memory_occupied It’s not rocket science but better use kubecost if it works for you. AWS Community Day Bulgaria 2023
  17. • Traffic data is hard to understand and break down

    • New untagged resources are a never-ending chase ◦ Feeling brave? You can impose a restriction within AWS • Our custom breakdown tool is hard to maintain • Devs still don’t own (or even understand well) their costs Measuring: challenges AWS Community Day Bulgaria 2023
  18. 3. Purely* infra changes • Rightsizing • Autoscaling • Using

    spot instances • Reducing traffic, especially cross-AZ • Self-hosting stuff (vs. using a managed service) • Using recent versions of services (e.g. gp3, m6 and m7 EC2s…) • Using ARM instead of x86! AWS Community Day Bulgaria 2023
  19. Infra changes: rightsizing • Depends on resource utilisation monitoring •

    Lots of instance types to choose from – including AMD-based ones; get creative, test, measure and optimise • Containerisation allows for more flexibility (but is non-trivial) • Our setup: containerised apps running in a single, shared pool of Kubernetes nodes; the challenge – isolation between apps AWS Community Day Bulgaria 2023
  20. Rightsizing: the fine print • Don’t forget disk and network

    IO (and their limits) • gp3 > gp2 for smaller disks; but gp3 has a 125 MiB/s baseline, while gp2 is 250 MiB/s when volume is >170 GB • EC2s have a different cap on EBS bandwidth (key for DBs – watch the EBSIOBalance% and EBSByteBalance% metrics) • Don’t forget Optimized HDD (st1) and Cold HDD (sc1). sc1 is 65% the price of S3 standard per GB. AWS Community Day Bulgaria 2023
  21. • Complicated pricing model, invest time to grok it •

    Don't be wasteful (in number of objects, total size and requests) – things accumulate over time; our biggest expense is S3 currently • Storage classes are great, make use of those • S3 Glacier Instant Retrieval is very promising (<20% of S3 Standard!) • Understand your usage needs (there's a storage analysis tool in the console that's helpful to an extent, though a bit cryptic to understand) Rightsizing: S3 AWS Community Day Bulgaria 2023
  22. Infra changes: autoscaling • Use the native predictive autoscaling for

    EC2s ASGs • Apps must be able to gracefully handle shutdown • Minimal to no provisioning at instance launch time • We use the K8s HPA on top of custom utilisation metrics – usually the ratio of busy vs currently available workers • Timing in autoscaling is key. It usually needs 2-3 mins to react. Tricky to get right in some cases AWS Community Day Bulgaria 2023
  23. Autoscaling Start simple, then observe and improve as time allows:

    Even a cron that scales down at 30% at night and weekends is great and deliver 80% of the value. AWS Community Day Bulgaria 2023
  24. Infra changes: spot instances • Similar to autoscaling: apps must

    be able to gracefully handle shutdown within 2 minutes • Take care to implement and test the spot termination signal handling • Some instance families are less volatile than others AWS Community Day Bulgaria 2023
  25. Infra changes: reducing traffic • Cross-AZ traffic is $0.02/GB –

    same cost as cross-region traffic. This is really hurting high availability in AWS • Keep chatty resources in the same AZ • Compress payloads before sending over the network (there’s usually sufficient extra CPU) – gzip, br, etc. • Cache stuff • Fewer cross-VPC peerings (and do not use Transit Gateway) AWS Community Day Bulgaria 2023
  26. Infra changes: self-hosting • Most AWS managed services are pricey,

    especially when you scale up • We self-host a number of PostgreSQL DBs (on EC2s) for both price (probably <30% of the managed RDS) as well as feature reasons; we still run a few smaller RDSes, though • Depends on team capacity and expertise AWS Community Day Bulgaria 2023
  27. Infra changes: x86 → ARM • Graviton2: 15% faster and

    20% cheaper than x86 on avg • Graviton3: 30% faster and 15% cheaper than x86 on avg • But the single most important feature: vCPUs = cores • x86 instances can use up to 60-65% CPU before workloads start suffering (depending on the workload) • ARM instances can tolerate 100% CPU load without issues AWS Community Day Bulgaria 2023
  28. First, let’s look at containerised workloads only (running in Kubernetes).

    We’ll talk about the process and challenges after that. ARM migration: overview AWS Community Day Bulgaria 2023
  29. We gradually moved the remaining non-Kubernetes services to Kubernetes. We

    also migrated them from x86 to ARM along the way. ARM migration: all resources AWS Community Day Bulgaria 2023
  30. ARM migration: overall • We use mostly scripting languages, but

    also a bunch of OS packages and a few proprietary binaries here and there • Trickier than it looks – took us 6+ months to migrate • The investment was worth it – workloads are cheaper and more stable AWS Community Day Bulgaria 2023
  31. ARM migration: approach • Using Graviton mainly in containerised apps,

    but also for our EC2 DBs • We had to build multi-arch Docker images in the transition period (where we had parts of an app running on one architecture and the rest running on the other) • Cross-arch image building is tricky, avoid it if possible • We used separate per-architecture node groups in Kubernetes • We had to update the base OS and many packages AWS Community Day Bulgaria 2023
  32. • ARM could be a bit slower core-per-core for some

    workloads (5-10%) • Some workloads are not ARM-optimized and are slower • Proprietary binaries are PITA, often don’t support ARM • Mostly “Get X to compile/install” and “Verify app still works with the new version of package Y” ARM migration: challenges AWS Community Day Bulgaria 2023
  33. There was a bug in one app due to ImageMagick

    issues and we had a hard time reproducing and troubleshooting it. We needed it to load a locally-compiled /usr/local/lib/libmagic.so.1 file, but it wasn’t doing that on ARM. Finally, we were able to track down the only difference between the working x86 image and the non-working ARM one: ARM migration: gotchas AWS Community Day Bulgaria 2023
  34. Lexicographic ordering. It confused the ld shared object loader. ARM

    migration: gotchas AWS Community Day Bulgaria 2023
  35. 4. Commercial measures • Private pricing and discount programs (get

    a discount % in exchange for a spend commitment over time) • Reservations • Compute savings plans have been great, reducing compute bills with at least 20-30% • We purchase 3-year no-upfront and aim to have ~80% coverage at peak times, leaving room for optimisations AWS Community Day Bulgaria 2023
  36. 5. Application changes • Not gonna lie – the biggest

    savings come from these. Two examples: ◦ We’re reorganising our document processing and how we store docs in S3. We expect about $15k/mo savings from this (over 60% savings from our current S3 costs!) ◦ Once, we discovered that we were paying $3k/mo more for traffic. Turned out someone shipped a “small change” that was pumping unnecessary cross-AZ traffic from the app to a Redis instance. AWS Community Day Bulgaria 2023
  37. • Should I care at all? • Outsourcing cost-saving? •

    Can I treat apps like black-boxes? • Self-host vs managed services? E.g. RDS vs DB-on-EC2 FAQ AWS Community Day Bulgaria 2023