Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming your AWS bill – embracing ARM and other tips

Taming your AWS bill – embracing ARM and other tips

Recording: https://youtu.be/Md5J59XMdsw?si=8lTCDWQTZfDlNDpi

Clouds are great enablers of growth – including to your cloud bill. Dext is running predominantly on AWS. It's been an ongoing challenge to keep our AWS cloud costs at bay for the past 4-5 years, with lots of learnings along the way. The most recent one has been our journey from x86 to ARM for which we will share lessons learned. I plan to also mention what our general approach to cost saving in AWS is – how we look at our AWS costs, what we do for Kubernetes workloads (a kubecost-inspired approach but without kubecost) and some other tips and tricks that are only mentioned in passing the fine print of some AWS documentation somewhere, or not mentioned at all. We'll cover S3, EC2, compute in general, traffic, load-balaners and others.

Dimitar Dimitrov

November 02, 2023
Tweet

More Decks by Dimitar Dimitrov

Other Decks in Programming

Transcript

  1. Taming your AWS bill
    Embracing ARM and other tips
    Dimitar Dimitrov
    AWS Community Day Bulgaria
    2023

    View full-size slide

  2. 2010
    AWS Community Day Bulgaria
    2023

    View full-size slide

  3. Heroku
    AWS Community Day Bulgaria
    2023

    View full-size slide

  4. ~$200 /mo
    AWS Community Day Bulgaria
    2023

    View full-size slide

  5. Source: http://tcwd.net/vblog/2022/03/16/operation-homelab-time-to-get-organized/
    AWS Community Day Bulgaria 2023

    View full-size slide

  6. 2015
    AWS Community Day Bulgaria
    2023

    View full-size slide

  7. AWS Community Day Bulgaria
    2023

    View full-size slide

  8. AWS Community Day Bulgaria
    2023

    View full-size slide

  9. ~$80k+ /mo 🤯
    AWS Community Day Bulgaria
    2023

    View full-size slide

  10. Leading the infrastructure team at Dext
    Been there for 7 years now (since Oct 2016).
    Background in full-stack web development.
    I love solving problems and diving deep in
    technical issues.
    I won’t object if you call me a scrooge. 💸
    Who am I
    AWS Community Day Bulgaria
    2023

    View full-size slide

  11. ● A bit of context and a disclaimer
    ● What we’ve achieved
    ● How we did what we did
    ● Lessons learned (+ practical tips)
    ● FAQ (and your questions) at the end
    What I’ll share
    AWS Community Day Bulgaria
    2023

    View full-size slide

  12. Our approach is far from perfect:
    There are still many pain points in our cost optimisation
    tooling and processes; we achieve quite a few savings
    almost by accident, and thus, still have lots of room for
    improvement.
    Disclaimer
    AWS Community Day Bulgaria
    2023

    View full-size slide

  13. A bit of context
    ● Dext is a product company using AWS almost exclusively
    ● Mostly web apps, plus a bunch of machine learning
    ● Stateless bits are containerised and run in Kubernetes
    ● The largest app handles ~15 mil requests/day
    ● The infrastructure team is a dozen skilled engineers
    ● Business grows 20-30% YoY, AWS costs are mostly flat
    AWS Community Day Bulgaria
    2023

    View full-size slide

  14. A quick overview of our AWS costs historically.
    What we’ve achieved
    AWS Community Day Bulgaria
    2023

    View full-size slide

  15. 2020 2021 2022 2023
    AWS Community Day Bulgaria
    2023

    View full-size slide

  16. How?
    AWS Community Day Bulgaria
    2023

    View full-size slide

  17. The secret sauce:
    while true; do
    How much are we paying for this?
    Why?
    What can we do?
    Do it!
    done
    AWS Community Day Bulgaria
    2023

    View full-size slide

  18. How: a combination of…
    1. A systematic, continuous, never-ending effort
    2. Measuring, questioning and understanding our usage
    3. Purely infrastructural changes (little to no app changes)
    4. Commercial measures (commitments and discounts)
    5. Application changes (potentially the biggest impact)
    AWS Community Day Bulgaria
    2023

    View full-size slide

  19. ● It’s never “done”
    ● The single most important lesson for me so far
    ● While you’re doing R&D, you should be working on cost too
    ● Estimate costs during design & development, and…
    ● Look at costs after things have been running for a while
    1. A never-ending effort
    AWS Community Day Bulgaria
    2023

    View full-size slide

  20. ● You’ll need to slice and dice costs in various dimensions
    ● You’ll need to have a look at your historical data
    ● You’ll likely use a variety of tools and techniques
    ● The earlier you put probes in, the better for the future you
    2. Measure & question
    AWS Community Day Bulgaria
    2023

    View full-size slide

  21. 1. Tag your resources (Terraform default tags are helpful)
    2. Define your billing tags under AWS Billing > Cost allocation tags
    3. Define a Cost and usage report to collect detailed cost data in S3
    4. Learn to use the built-in tools:
    ○ Cost Explorer is very useful, learn to use it
    ○ Use the cost anomaly detection and budgets
    ○ AWS Trusted Advisor helps with rightsizing
    ○ AWS Cost Categories sound promising, but are limited
    Measure & question (pt. 2)
    AWS Community Day Bulgaria
    2023

    View full-size slide

  22. We use the following – lots of legacy in these names:
    KubernetesCluster, Name, Service,
    region, env, owner, application, system
    Billing tags get baked in historical data, so think ahead.
    Measure: tagging
    AWS Community Day Bulgaria
    2023

    View full-size slide

  23. ● We collect cost data in an S3 bucket via AWS Cost and
    usage reports with hourly granularity, with all the default
    data + resource IDs.
    ● We get a collection of large, gzipped CSVs which we then
    process and split into our own categories.
    Measure: custom breakdowns
    AWS Community Day Bulgaria
    2023

    View full-size slide

  24. Custom
    costs
    hierarchy
    Various
    filters
    Longer
    history
    Per-category
    month-over-month diffs
    AWS Community Day Bulgaria
    2023

    View full-size slide

  25. ● A cron pulls in daily the cost reports from S3
    ● Custom logic for assigning each cost line to a category – helpful
    for the messy reality where uniform conventions are but a mirage
    ● It spits up a single static HTML page – the cost breakdown
    ● The UI allows filtering and interactivity via client-side JS
    ● A very useful tool, but we’ve outgrown it – needs a rewrite
    Measure: custom breakdowns
    AWS Community Day Bulgaria
    2023

    View full-size slide

  26. ● It provides a useful high-level overview
    ● It breaks down Kubernetes workloads as well (in a kubecost-style
    way), based on the used CPU and memory of containers
    ● Useful for devs as well:
    ○ Filters are encoded in the URL which enables sharing
    ● We still resort to the AWS Cost Explorer for details, though
    Custom breakdowns
    AWS Community Day Bulgaria
    2023

    View full-size slide

  27. Kubernetes cost breakdowns
    We fetch CPU and memory requests and actual usage, take the higher
    of [requested, used] and then calculate the proportion of EC2 cost:
    cost_ weight =
    cpu_occupied * CPU_TO_RAM_COST_RATIO + memory_occupied
    It’s not rocket science but better use kubecost if it works for you.
    AWS Community Day Bulgaria
    2023

    View full-size slide

  28. Measuring: Infracost
    AWS Community Day Bulgaria
    2023

    View full-size slide

  29. ● Traffic data is hard to understand and break down
    ● New untagged resources are a never-ending chase
    ○ Feeling brave? You can impose a restriction within AWS
    ● Our custom breakdown tool is hard to maintain
    ● Devs still don’t own (or even understand well) their costs
    Measuring: challenges
    AWS Community Day Bulgaria
    2023

    View full-size slide

  30. 3. Purely* infra changes
    ● Rightsizing
    ● Autoscaling
    ● Using spot instances
    ● Reducing traffic, especially cross-AZ
    ● Self-hosting stuff (vs. using a managed service)
    ● Using recent versions of services (e.g. gp3, m6 and m7 EC2s…)
    ● Using ARM instead of x86!
    AWS Community Day Bulgaria
    2023

    View full-size slide

  31. Infra changes: rightsizing
    ● Depends on resource utilisation monitoring
    ● Lots of instance types to choose from – including AMD-based
    ones; get creative, test, measure and optimise
    ● Containerisation allows for more flexibility (but is non-trivial)
    ● Our setup: containerised apps running in a single, shared pool of
    Kubernetes nodes; the challenge – isolation between apps
    AWS Community Day Bulgaria
    2023

    View full-size slide

  32. Rightsizing: the fine print
    ● Don’t forget disk and network IO (and their limits)
    ● gp3 > gp2 for smaller disks; but gp3 has a 125 MiB/s baseline, while
    gp2 is 250 MiB/s when volume is >170 GB
    ● EC2s have a different cap on EBS bandwidth (key for DBs – watch the
    EBSIOBalance% and EBSByteBalance% metrics)
    ● Don’t forget Optimized HDD (st1) and Cold HDD (sc1). sc1 is 65% the
    price of S3 standard per GB.
    AWS Community Day Bulgaria
    2023

    View full-size slide

  33. ● Complicated pricing model, invest time to grok it
    ● Don't be wasteful (in number of objects, total size and requests) –
    things accumulate over time; our biggest expense is S3 currently
    ● Storage classes are great, make use of those
    ● S3 Glacier Instant Retrieval is very promising (<20% of S3 Standard!)
    ● Understand your usage needs (there's a storage analysis tool in the
    console that's helpful to an extent, though a bit cryptic to understand)
    Rightsizing: S3
    AWS Community Day Bulgaria
    2023

    View full-size slide

  34. Infra changes: autoscaling
    ● Use the native predictive autoscaling for EC2s ASGs
    ● Apps must be able to gracefully handle shutdown
    ● Minimal to no provisioning at instance launch time
    ● We use the K8s HPA on top of custom utilisation metrics – usually the
    ratio of busy vs currently available workers
    ● Timing in autoscaling is key. It usually needs 2-3 mins to react. Tricky to
    get right in some cases
    AWS Community Day Bulgaria
    2023

    View full-size slide

  35. Autoscaling
    Start simple, then observe and improve as time allows:
    Even a cron that scales down at 30% at night and weekends
    is great and deliver 80% of the value.
    AWS Community Day Bulgaria
    2023

    View full-size slide

  36. Infra changes: spot instances
    ● Similar to autoscaling: apps must be able to gracefully
    handle shutdown within 2 minutes
    ● Take care to implement and test the spot termination signal
    handling
    ● Some instance families are less volatile than others
    AWS Community Day Bulgaria
    2023

    View full-size slide

  37. Infra changes: reducing traffic
    ● Cross-AZ traffic is $0.02/GB – same cost as cross-region traffic. This is
    really hurting high availability in AWS
    ● Keep chatty resources in the same AZ
    ● Compress payloads before sending over the network (there’s usually
    sufficient extra CPU) – gzip, br, etc.
    ● Cache stuff
    ● Fewer cross-VPC peerings (and do not use Transit Gateway)
    AWS Community Day Bulgaria
    2023

    View full-size slide

  38. Infra changes: self-hosting
    ● Most AWS managed services are pricey, especially when you
    scale up
    ● We self-host a number of PostgreSQL DBs (on EC2s) for both
    price (probably <30% of the managed RDS) as well as feature
    reasons; we still run a few smaller RDSes, though
    ● Depends on team capacity and expertise
    AWS Community Day Bulgaria
    2023

    View full-size slide

  39. Infra changes: x86 → ARM
    ● Graviton2: 15% faster and 20% cheaper than x86 on avg
    ● Graviton3: 30% faster and 15% cheaper than x86 on avg
    ● But the single most important feature: vCPUs = cores
    ● x86 instances can use up to 60-65% CPU before workloads start
    suffering (depending on the workload)
    ● ARM instances can tolerate 100% CPU load without issues
    AWS Community Day Bulgaria
    2023

    View full-size slide

  40. First, let’s look at containerised workloads only (running in
    Kubernetes).
    We’ll talk about the process and challenges after that.
    ARM migration: overview
    AWS Community Day Bulgaria
    2023

    View full-size slide

  41. ARM migration start
    ARM migration done
    AWS Community Day Bulgaria
    2023

    View full-size slide

  42. ARM migration start
    ARM migration done:
    35-40% savings
    AWS Community Day Bulgaria
    2023

    View full-size slide

  43. AWS Community Day Bulgaria
    2023

    View full-size slide

  44. We gradually moved the remaining non-Kubernetes services
    to Kubernetes. We also migrated them from x86 to ARM
    along the way.
    ARM migration: all resources
    AWS Community Day Bulgaria
    2023

    View full-size slide

  45. AWS Community Day Bulgaria
    2023

    View full-size slide

  46. AWS Community Day Bulgaria
    2023

    View full-size slide

  47. AWS Community Day Bulgaria
    2023

    View full-size slide

  48. ARM migration: overall
    ● We use mostly scripting languages, but also a bunch of OS
    packages and a few proprietary binaries here and there
    ● Trickier than it looks – took us 6+ months to migrate
    ● The investment was worth it – workloads are cheaper and
    more stable
    AWS Community Day Bulgaria
    2023

    View full-size slide

  49. ARM migration: approach
    ● Using Graviton mainly in containerised apps, but also for our EC2 DBs
    ● We had to build multi-arch Docker images in the transition period
    (where we had parts of an app running on one architecture and the rest
    running on the other)
    ● Cross-arch image building is tricky, avoid it if possible
    ● We used separate per-architecture node groups in Kubernetes
    ● We had to update the base OS and many packages
    AWS Community Day Bulgaria
    2023

    View full-size slide

  50. ● ARM could be a bit slower core-per-core for some workloads
    (5-10%)
    ● Some workloads are not ARM-optimized and are slower
    ● Proprietary binaries are PITA, often don’t support ARM
    ● Mostly “Get X to compile/install” and “Verify app still works with
    the new version of package Y”
    ARM migration: challenges
    AWS Community Day Bulgaria
    2023

    View full-size slide

  51. There was a bug in one app due to ImageMagick issues and we had a
    hard time reproducing and troubleshooting it. We needed it to load a
    locally-compiled /usr/local/lib/libmagic.so.1 file, but it wasn’t
    doing that on ARM.
    Finally, we were able to track down the only difference between the
    working x86 image and the non-working ARM one:
    ARM migration: gotchas
    AWS Community Day Bulgaria
    2023

    View full-size slide

  52. x86 (amd64):
    /usr/local/lib/libmagic.so.1
    /lib/x86_64-linux-gnu/libmagic.so.1
    /usr/local/lib/libmagic.so
    ARM migration: gotchas
    AWS Community Day Bulgaria
    2023

    View full-size slide

  53. arm64:
    /lib/aarch64-linux-gnu/libmagic.so.1
    /usr/local/lib/libmagic.so.1
    /usr/local/lib/libmagic.so
    ARM migration: gotchas
    AWS Community Day Bulgaria
    2023

    View full-size slide

  54. Lexicographic ordering.
    It confused the ld shared object loader.
    ARM migration: gotchas
    AWS Community Day Bulgaria
    2023

    View full-size slide

  55. 4. Commercial measures
    ● Private pricing and discount programs (get a discount % in exchange
    for a spend commitment over time)
    ● Reservations
    ● Compute savings plans have been great, reducing compute bills with at
    least 20-30%
    ● We purchase 3-year no-upfront and aim to have ~80% coverage at
    peak times, leaving room for optimisations
    AWS Community Day Bulgaria
    2023

    View full-size slide

  56. 5. Application changes
    ● Not gonna lie – the biggest savings come from these. Two examples:
    ○ We’re reorganising our document processing and how we store docs
    in S3. We expect about $15k/mo savings from this (over 60% savings
    from our current S3 costs!)
    ○ Once, we discovered that we were paying $3k/mo more for traffic.
    Turned out someone shipped a “small change” that was pumping
    unnecessary cross-AZ traffic from the app to a Redis instance.
    AWS Community Day Bulgaria
    2023

    View full-size slide

  57. ● Should I care at all?
    ● Outsourcing cost-saving?
    ● Can I treat apps like black-boxes?
    ● Self-host vs managed services? E.g. RDS vs DB-on-EC2
    FAQ
    AWS Community Day Bulgaria
    2023

    View full-size slide

  58. Thank You
    [email protected]
    https://ddimitrov.name
    https://github.com/mitio
    https://twitter.com/mitio
    https://linkedin.com/in/mitio
    AWS Community Day Bulgaria
    2023

    View full-size slide