Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A needle in the haystack: optimizing cloud configurations for price-performance

A needle in the haystack: optimizing cloud configurations for price-performance

AWS has 200+ EC2 instance types, each one offering a different mix of CPU and memory size, CPU model and speed, not to mention the various flavors of EBS block storage available. What is best for your workload: high-memory or high-cpu instances? Few high-speed Intel CPUs or more lower-power AMD or ARM processors? What about EBS storage options offering guaranteed IOPS, are they worth their price?

In this talk I presented the results of a study where we tackled the problem using automated performance tests and AI to smartly navigate the sheer number of cloud configurations. The goal was to achieve maximum application performance with minimum cloud costs.

Stefano Doni

January 12, 2024

More Decks by Stefano Doni

Other Decks in Technology


  1. Cloud compute services offer overwhelming choices EC2 instances cost ranges

    from $3.4 to $19482 per month (on demand) https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-ec2-instances-performance-optimization-best-practices-cmp307r1-aw s-reinvent-2018
  2. The experimental approach, aka load test your app “There is

    no substitute for measuring the performance of your entire application, because application performance can be impacted by the underlying infrastructure or by software and architectural limitations. We recommend application-level testing, including the use of application profiling and load testing tools and services” https://aws.amazon.com/ec2/instance-types/
  3. A bigger problem: same specs, different performance across different cloud

    providers “CockroachDB 2.1 achieves 40% more throughput (tpmC) on TPC-C when tested on AWS using c5d.4xlarge than on GCP via n1-standard-16. We were shocked that AWS offered such superior performance” Cockroach Labs https://www.cockroachlabs.com/blog/2018_cloud_report/
  4. Why current approaches can not assure optimal application performance and

    low costs? • May not consider end to end application performance • May not capture hidden bottlenecks • May not capture unique application / workload behaviour • May not factor in cloud-specific platforms and implementations (e.g. hypervisors, CPU architectures) • Can’t scale to the sheer complexity of cloud options
  5. The use case Goal Minimize price/performance of a MongoDB database

    hosted on AWS Performance is throughput of the database (queries/sec), price is monthly AWS price for the provisioned resources Scenario Akamas driving automated optimization including application load tests Workflow to provision AWS EC2 and EBS resources as suggested by AI engine Optimization scope AWS EC2 instances and EBS storage volumes powering MongoDB
  6. Modeling the cloud cost-optimization problem c5d.2xlarge Instance family Instance generation

    Additional capabilities Volume type Instance size Volume size Volume IOPS io1 70 GB 1000 IOPS EC2 EBS
  7. AI-driven price-performance optimization results Baseline configuration: price/performance of r4.large, gp2

    70GB Best configuration: -68% price/performance after 18 experiments or approx 22 hours
  8. Best configuration: for the same price, 3x throughput and -90%

    latency Price: - 2.9% 65.52 (best) vs 67.48 (baseline) €/month Throughput: +205% 7605 (best) vs 2493 (baseline) query/sec Latency (avg): -90% 1330 (best) vs 14575 (baseline) milliseconds
  9. How did AI achieve that? A look at the best

    configuration Instance Name Use cases vCPUs Memory (GiB) Instance Storage Block Storage (EBS) r4.large (baseline) Memory optimized 2 x Intel Xeon E5-2686 15.25 - gp2 70GB m5d.large (best) General purpose 2 x Custom Intel Xeon Platinum 8175M 8 1 x 150 GB NVMe SSD n/a The best configuration for this workload is: m5d.large HW specs comparison
  10. AI can find unusual configurations: AMD CPUs with half memory

    can cut costs and still improve throughput The cheapest configuration for this workload is m5a.large -24% cost with +12% throughput Instance Name Use cases vCPUs Memory (GiB) Instance Storage Block Storage (EBS) r4.large (baseline) Memory optimized 2 x Intel Xeon E5-2686 15.25 - gp2 70 GB m5a.large (cheapest) Memory optimized 2 x AMD EPYC 8 - gp2 114 GB HW specs comparison Searching instances with EBS storage Top 5 best configurations
  11. r4.large m5a.large Memory used r4.large m5a.large Throughput Debunking a common

    myth: high resource usage != application performance bottleneck … despite m5a.large (cheapest) having half the memory of r4.large (baseline) Throughput +12% higher for the m5a.large (cheapest) vs r4.large (baseline) instance ...
  12. Takeaways • Technology landscape is becoming more and more complex

    • Traditional approaches are not effective and can’t scale - significant optimization opportunities are left on the table • AI for IT optimization is required and can reach previously unthinkable benefits, beyond what human experts can do • In the cloud, 70% price/performance improvements are possible by properly exploiting choices we have • Cloud rightsizing recommendations may suggest higher price options