Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

@silvexis Million Dollar Lines of Code An engineering perspective on
Cloud Cost Optimization 1 Erik Peterson FOUNDER & CTO @silvexis [email protected] www.erikpeterson.com

@silvexis 2 I’m Erik, I’m the CTO and Founder of
CloudZero. I’ve been building on AWS since 2006. I’m a Serverless believer and startup addict. I founded CloudZero in 2016 to empower engineers to build proﬁtable cloud software. @silvexis [email protected] www.erikpeterson.com

@silvexis 3 EVERY ENGINEERING DECISION IS A BUYING DECISION

@silvexis 4

@silvexis 5 FINANCES PROBLEM NOW WORKED FINE FOR DEV @silvexis

@silvexis “People sometimes forget that 90-plus percent of global IT
spend is still on premise, and if you believe that equation is going to flip — which we do — it’s going to move to the cloud.” – Andy Jassy, CEO, Amazon The $4.6 trillion* potential shift Sources: *Gartner 2022 Global IT spend, https://www.gartner.com/en/newsroom/press-releases/2022-10-19-gartner-forecasts-worldwide-it-spending-to-grow-5-percent-in-2023 Financial Times, Amazon says cloud growth slowed as customers cut costs, April 27 2023, https://on.ft.com/3Vfk0A7

@silvexis 7 FYI, he’s horribly wrong… Some people are pretty
convinced that won't happen however

8 I've checked the numbers, it does make economic sense,
but only if we build differently

@silvexis 9 The Engineer's Role in Software Profitability • Cost-efficiency
often reflects system quality. • One line of code can dictate a company's profit margin. • Challenge: Which metric best measures cost-efficiency?

10 Million Dollar Lines of Code

@silvexis Disclaimer • The following are all real examples that
did or would likely result in over a million dollars in cloud costs if not caught in time. • The code has been refactored and/or rewritten in Python for consistency, to focus on the issue and protect the innocent. 11

@silvexis 12 Death by Even DevOps costs money The situation:
• AWS lambda function with a average monthly cost of $628.00 • AWS CloudWatch costs with an average monthly cost of $31,000.00 • Total cost of this single function since deployment 3 years ago: $1.1 million dollars

@silvexis 13 from aws_lambda_powertools import Logger logger = Logger() def
something_important(really_big_list_of_big_files): # This is a really important function that does a lot of stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) logger.debug("Processed line", extra={"line": line "result": result}) results.append(result) logger.info("Finished processing files") return results

something_important(really_big_list_of_big_files): # This is a really important function that does a lot of stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) logger.debug("Processed line", extra={"line": line, "result": result}) results.append(result) logger.info("Finished processing files") return results

something_important(really_big_list_of_big_files): # This is a really important function that does important stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) results.append(result) logger.info("Finished processing files") return results THE FIX

@silvexis 16 The API Costs Money? The situation: • A
MVP/Prototype that found its way into production. • Now years later, the product is creating billions of api requisitions to S3 • Total cost of this code over 1 year: $1.3 million dollars

@silvexis 17 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count":
len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines

len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") paginator = client.get_paginator('list_objects_v2') flare_reduction_data = get_flare_reduction_data(Bucket="my-bucket", Prefix="solar/flare-reduction-maps")) reticulated_splines = [] for spline in splines_to_reticulate: result = reticulate(spline, configuration) reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_data)) return reticulated_splines + re-factor to use preloaded data here as well THE FIX

@silvexis 21 Know thy limits Or how to 2x your
DynamoDB write costs with only a few bytes The situation: • A developer was asked to add a datetime stamp to every write to DynamoDB • The code change took seconds to write, but doubled their DynamoDB costs • The total cost to run? ◦ Ok, i'm not 100% sure :-) ◦ I learned of this from @ksshams and loved it so much I had to include it ◦ Get the full story here https://youtu.be/Niy1TZe4dVg

@silvexis 22 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record
to DynamoDB and add a timestamp k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat() dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data})

to DynamoDB and add a timestamp k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat() dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data}) 1000 + 32 + 9 = 1041 Bytes Over the 1024 limit per WCU by 17 bytes Write costs are now 2x

to DynamoDB and add a timestamp k_data['ts'] = datetime.utcnow().isoformat(timespec='seconds') + 'Z' dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data}) THE FIX 1000 + 20 + 2 = 1022 Bytes Under the 1024 limit per WCU

@silvexis 25 Leaking Infra-as-code Terraform Edition The situation: • Terraform
template creating ASGs that scaled up clusters of hundreds of EC2 systems • System refreshed each instance every 24 hours, but for safety, EBS volumes were not automatically deleted • Average Cost of this system for 1 year: $1.1 million dollars

@silvexis 26 resource "aws_launch_template" "my_lt" { name = "my-launch-template" image_id
= data.aws_ami.amzlinux2.id instance_type = "c7gd.xlarge" vpc_security_group_ids = [module.private_sg.security_group_id] key_name = "my-ssh-key" user_data = filebase64("${path.module}/install.sh") ebs_optimized = true update_default_version = true block_device_mappings { device_name = "/dev/sda1" ebs { volume_size = 100 delete_on_termination = false volume_type = "gp3" } } } resource "aws_autoscaling_group" "leaky_asg" { name_prefix = "leaky-" desired_capacity = 10 max_size = 1000 min_size = 10 max_instance_lifetime = 86400 vpc_zone_identifier = module.vpc.private_subnets target_group_arns = module.alb.target_group_arns health_check_type = "EC2" launch_template { id = aws_launch_template.my_lt.id version = aws_launch_template.my_lt.latest_version } instance_refresh { strategy = "Rolling" preferences { instance_warmup = 300 } triggers = ["launch_template","desired_capacity"] } } AWS Launch Template AWS Autoscaling group

@silvexis 27 resource "aws_launch_template" "my_lt" { name = "my-launch-template" image_id
= data.aws_ami.amzlinux2.id instance_type = "c7gd.xlarge" vpc_security_group_ids = [module.private_sg.security_group_id] key_name = "my-ssh-key" user_data = filebase64("${path.module}/install.sh") ebs_optimized = true update_default_version = true block_device_mappings { device_name = "/dev/sda1" ebs { volume_size = 100 delete_on_termination = false volume_type = "gp3" } } } resource "aws_autoscaling_group" "leaky_asg" { name_prefix = "leaky-" desired_capacity = 10 max_size = 1000 min_size = 10 max_instance_lifetime = 86400 vpc_zone_identifier = module.vpc.private_subnets target_group_arns = module.alb.target_group_arns health_check_type = "EC2" launch_template { id = aws_launch_template.my_lt.id version = aws_launch_template.my_lt.latest_version } instance_refresh { strategy = "Rolling" preferences { instance_warmup = 300 } triggers = ["launch_template","desired_capacity"] } } AWS Launch Template AWS Autoscaling group

@silvexis 28 THE FIX Beware well intentioned infrastructure as code

@silvexis 29 Cost Delivery Network It's as if 2.3 million
devices all cried out at once…asking for money Software bug on device “phoned home” repeatedly causing massive jump in network costs • After ~14 hours reached a steady state of about $4.5k an hour • Issue was resolved after ~6 days • Total cost of incident: ~$648,000.00 • Potential annual impact: $39,420,00.00

@silvexis 30 def check_for_update() -> bool: # return true if
update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash)

update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash)

update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash) THE (real) FIX

@silvexis 33 What did we learn? • Storage is still
cheap, but calling APIs cost money • I've got infinite scale, but apparently not infinite wallet • CDN's are very good at eating traffic…and money • Engineers now need to add "agonizing over the cost of their code" to their list of things to do?

@silvexis “We should forget about small efﬁciencies, say about 97%
of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” – Donald Knuth Sources: "Structured Programming with go to Statements", Kunth, 1974

@silvexis 35 "Premature optimization is the root of all evil."
• All of these examples are only problems at scale • Cloud software engineers should think about these questions, but iteratively, and over time, not all at once. ◦ Can it be done? ◦ Is this the best way to do it, as a team? ◦ What happens if this thing becomes popular? ◦ How much money should it cost to run? • The metric to answering the money question however is not $$$, it's to track your desired Cloud Eﬃciency Rate

@silvexis 36 Cloud Eﬃciency Rate (CER) How and when to
optimize Revenue – Cloud Costs Revenue https://www.cloudzero.com/blog/cloud-efficiency-rate Example • Company A’s annual revenue is $100M and annual cloud costs are $20M. • Company A’s cost per dollar of revenue is $0.20 cents with a Cloud Eﬃciency Rate of 80%

@silvexis Cloud Efficiency Rate in practice Cloud Efficiency Rate (CER)
should become a non-functional requirement for any cloud project, with defined stages to aid prioritization: • Research and Development: A negative CER is acceptable! • Version 1 / MVP: Break even or low CER (0% to 25%) • Product market fit (PMF): Acceptable margins are conceivable (25% to 50%) • Scaling: Demonstrable Q over Q path to healthy margins (50% to 80%) • Steady state: Healthy margins = Healthy Business (CER is +80%) 37 As a rule of thumb, you should target a CER of 90% https://www.cloudzero.com/blog/cloud-efficiency-rate

@silvexis 38 "I call it my billion-dollar mistake. It was
the invention of the null reference in 1965." "This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years." – Sir Tony Hoare https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/ The Billion Dollar Mistake - QCon London Final thoughts

@silvexis 39 EVERY ENGINEERING DECISION IS A BUYING DECISION

@silvexis Remember to vote and share feedback on the QCon
App. Please vote and leave feedback! Any questions? Let’s continue the conversation! • @silvexis • [email protected]

Million Dollar Lines of Code: An Engineering Pe...

Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

More Decks by Erik Peterson

Other Decks in Technology

Featured

Transcript