Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

A single line of code can shape an organization's financial future. Drawing inspiration from five real examples of million-dollar lines of code, we will challenge conventional views on engineering's pivotal role in cloud cost optimization.

Prepare for a plot twist, however: premature cost optimization can become a distraction and inadvertently hinder innovation. While unchecked cloud costs can cause another kind of distraction (one where you run out of money), pursuing cost reduction incorrectly or at the wrong time can be fatal to market growth and finding that elusive product-market fit.

Done right, success lies in cultivating a deeper understanding of the economic implications of our efforts as engineers, allowing us to find the balance between user delight, innovation, and cost-efficient code. By venturing into the heart of these million-dollar lines of code and the forces that created them, we uncover the right timing and approach for engineering cost optimization and how to use cost efficiency metrics as powerful constraints that drive innovation, accelerate growth, and engineer profit.

https://qconsf.com/presentation/oct2023/million-dollar-lines-code-engineering-perspective-cloud-cost-optimization

https://www.infoq.com/news/2023/10/engineering-optimize-cost/?p13nId=65ee8c45-2632-483b-b3c3-85ad4829b4b1&p13nType=content

Erik Peterson

October 04, 2023
Tweet

More Decks by Erik Peterson

Other Decks in Technology

Transcript

  1. @silvexis Million Dollar Lines of Code An engineering perspective on

    Cloud Cost Optimization 1 Erik Peterson FOUNDER & CTO @silvexis [email protected] www.erikpeterson.com
  2. @silvexis 2 I’m Erik, I’m the CTO and Founder of

    CloudZero. I’ve been building on AWS since 2006. I’m a Serverless believer and startup addict. I founded CloudZero in 2016 to empower engineers to build profitable cloud software. @silvexis [email protected] www.erikpeterson.com
  3. @silvexis “People sometimes forget that 90-plus percent of global IT

    spend is still on premise, and if you believe that equation is going to flip — which we do — it’s going to move to the cloud.” – Andy Jassy, CEO, Amazon The $4.6 trillion* potential shift Sources: *Gartner 2022 Global IT spend, https://www.gartner.com/en/newsroom/press-releases/2022-10-19-gartner-forecasts-worldwide-it-spending-to-grow-5-percent-in-2023 Financial Times, Amazon says cloud growth slowed as customers cut costs, April 27 2023, https://on.ft.com/3Vfk0A7
  4. @silvexis 9 The Engineer's Role in Software Profitability • Cost-efficiency

    often reflects system quality. • One line of code can dictate a company's profit margin. • Challenge: Which metric best measures cost-efficiency?
  5. @silvexis Disclaimer • The following are all real examples that

    did or would likely result in over a million dollars in cloud costs if not caught in time. • The code has been refactored and/or rewritten in Python for consistency, to focus on the issue and protect the innocent. 11
  6. @silvexis 12 Death by Even DevOps costs money The situation:

    • AWS lambda function with a average monthly cost of $628.00 • AWS CloudWatch costs with an average monthly cost of $31,000.00 • Total cost of this single function since deployment 3 years ago: $1.1 million dollars
  7. @silvexis 13 from aws_lambda_powertools import Logger logger = Logger() def

    something_important(really_big_list_of_big_files): # This is a really important function that does a lot of stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) logger.debug("Processed line", extra={"line": line "result": result}) results.append(result) logger.info("Finished processing files") return results
  8. @silvexis 14 from aws_lambda_powertools import Logger logger = Logger() def

    something_important(really_big_list_of_big_files): # This is a really important function that does a lot of stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) logger.debug("Processed line", extra={"line": line, "result": result}) results.append(result) logger.info("Finished processing files") return results
  9. @silvexis 15 from aws_lambda_powertools import Logger logger = Logger() def

    something_important(really_big_list_of_big_files): # This is a really important function that does important stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) results.append(result) logger.info("Finished processing files") return results THE FIX
  10. @silvexis 16 The API Costs Money? The situation: • A

    MVP/Prototype that found its way into production. • Now years later, the product is creating billions of api requisitions to S3 • Total cost of this code over 1 year: $1.3 million dollars
  11. @silvexis 17 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count":

    len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines
  12. @silvexis 18 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count":

    len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines
  13. @silvexis 19 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count":

    len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines
  14. @silvexis 20 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count":

    len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") paginator = client.get_paginator('list_objects_v2') flare_reduction_data = get_flare_reduction_data(Bucket="my-bucket", Prefix="solar/flare-reduction-maps")) reticulated_splines = [] for spline in splines_to_reticulate: result = reticulate(spline, configuration) reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_data)) return reticulated_splines + re-factor to use preloaded data here as well THE FIX
  15. @silvexis 21 Know thy limits Or how to 2x your

    DynamoDB write costs with only a few bytes The situation: • A developer was asked to add a datetime stamp to every write to DynamoDB • The code change took seconds to write, but doubled their DynamoDB costs • The total cost to run? ◦ Ok, i'm not 100% sure :-) ◦ I learned of this from @ksshams and loved it so much I had to include it ◦ Get the full story here https://youtu.be/Niy1TZe4dVg
  16. @silvexis 22 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record

    to DynamoDB and add a timestamp k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat() dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data})
  17. @silvexis 23 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record

    to DynamoDB and add a timestamp k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat() dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data}) 1000 + 32 + 9 = 1041 Bytes Over the 1024 limit per WCU by 17 bytes Write costs are now 2x
  18. @silvexis 24 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record

    to DynamoDB and add a timestamp k_data['ts'] = datetime.utcnow().isoformat(timespec='seconds') + 'Z' dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data}) THE FIX 1000 + 20 + 2 = 1022 Bytes Under the 1024 limit per WCU
  19. @silvexis 25 Leaking Infra-as-code Terraform Edition The situation: • Terraform

    template creating ASGs that scaled up clusters of hundreds of EC2 systems • System refreshed each instance every 24 hours, but for safety, EBS volumes were not automatically deleted • Average Cost of this system for 1 year: $1.1 million dollars
  20. @silvexis 26 resource "aws_launch_template" "my_lt" { name = "my-launch-template" image_id

    = data.aws_ami.amzlinux2.id instance_type = "c7gd.xlarge" vpc_security_group_ids = [module.private_sg.security_group_id] key_name = "my-ssh-key" user_data = filebase64("${path.module}/install.sh") ebs_optimized = true update_default_version = true block_device_mappings { device_name = "/dev/sda1" ebs { volume_size = 100 delete_on_termination = false volume_type = "gp3" } } } resource "aws_autoscaling_group" "leaky_asg" { name_prefix = "leaky-" desired_capacity = 10 max_size = 1000 min_size = 10 max_instance_lifetime = 86400 vpc_zone_identifier = module.vpc.private_subnets target_group_arns = module.alb.target_group_arns health_check_type = "EC2" launch_template { id = aws_launch_template.my_lt.id version = aws_launch_template.my_lt.latest_version } instance_refresh { strategy = "Rolling" preferences { instance_warmup = 300 } triggers = ["launch_template","desired_capacity"] } } AWS Launch Template AWS Autoscaling group
  21. @silvexis 27 resource "aws_launch_template" "my_lt" { name = "my-launch-template" image_id

    = data.aws_ami.amzlinux2.id instance_type = "c7gd.xlarge" vpc_security_group_ids = [module.private_sg.security_group_id] key_name = "my-ssh-key" user_data = filebase64("${path.module}/install.sh") ebs_optimized = true update_default_version = true block_device_mappings { device_name = "/dev/sda1" ebs { volume_size = 100 delete_on_termination = false volume_type = "gp3" } } } resource "aws_autoscaling_group" "leaky_asg" { name_prefix = "leaky-" desired_capacity = 10 max_size = 1000 min_size = 10 max_instance_lifetime = 86400 vpc_zone_identifier = module.vpc.private_subnets target_group_arns = module.alb.target_group_arns health_check_type = "EC2" launch_template { id = aws_launch_template.my_lt.id version = aws_launch_template.my_lt.latest_version } instance_refresh { strategy = "Rolling" preferences { instance_warmup = 300 } triggers = ["launch_template","desired_capacity"] } } AWS Launch Template AWS Autoscaling group
  22. @silvexis 29 Cost Delivery Network It's as if 2.3 million

    devices all cried out at once…asking for money Software bug on device “phoned home” repeatedly causing massive jump in network costs • After ~14 hours reached a steady state of about $4.5k an hour • Issue was resolved after ~6 days • Total cost of incident: ~$648,000.00 • Potential annual impact: $39,420,00.00
  23. @silvexis 30 def check_for_update() -> bool: # return true if

    update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash)
  24. @silvexis 31 def check_for_update() -> bool: # return true if

    update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash)
  25. @silvexis 32 def check_for_update() -> bool: # return true if

    update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash) THE (real) FIX
  26. @silvexis 33 What did we learn? • Storage is still

    cheap, but calling APIs cost money • I've got infinite scale, but apparently not infinite wallet • CDN's are very good at eating traffic…and money • Engineers now need to add "agonizing over the cost of their code" to their list of things to do?
  27. @silvexis “We should forget about small efficiencies, say about 97%

    of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” – Donald Knuth Sources: "Structured Programming with go to Statements", Kunth, 1974
  28. @silvexis 35 "Premature optimization is the root of all evil."

    • All of these examples are only problems at scale • Cloud software engineers should think about these questions, but iteratively, and over time, not all at once. ◦ Can it be done? ◦ Is this the best way to do it, as a team? ◦ What happens if this thing becomes popular? ◦ How much money should it cost to run? • The metric to answering the money question however is not $$$, it's to track your desired Cloud Efficiency Rate
  29. @silvexis 36 Cloud Efficiency Rate (CER) How and when to

    optimize Revenue – Cloud Costs Revenue https://www.cloudzero.com/blog/cloud-efficiency-rate Example • Company A’s annual revenue is $100M and annual cloud costs are $20M. • Company A’s cost per dollar of revenue is $0.20 cents with a Cloud Efficiency Rate of 80%
  30. @silvexis Cloud Efficiency Rate in practice Cloud Efficiency Rate (CER)

    should become a non-functional requirement for any cloud project, with defined stages to aid prioritization: • Research and Development: A negative CER is acceptable! • Version 1 / MVP: Break even or low CER (0% to 25%) • Product market fit (PMF): Acceptable margins are conceivable (25% to 50%) • Scaling: Demonstrable Q over Q path to healthy margins (50% to 80%) • Steady state: Healthy margins = Healthy Business (CER is +80%) 37 As a rule of thumb, you should target a CER of 90% https://www.cloudzero.com/blog/cloud-efficiency-rate
  31. @silvexis 38 "I call it my billion-dollar mistake. It was

    the invention of the null reference in 1965." "This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years." – Sir Tony Hoare https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/ The Billion Dollar Mistake - QCon London Final thoughts
  32. @silvexis Remember to vote and share feedback on the QCon

    App. Please vote and leave feedback! Any questions? Let’s continue the conversation! • @silvexis • [email protected]