Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

Slide 1

Slide 1 text

@silvexis Million Dollar Lines of Code An engineering perspective on Cloud Cost Optimization 1 Erik Peterson FOUNDER & CTO @silvexis [email protected] www.erikpeterson.com

Slide 2

Slide 2 text

@silvexis 2 I’m Erik, I’m the CTO and Founder of CloudZero. I’ve been building on AWS since 2006. I’m a Serverless believer and startup addict. I founded CloudZero in 2016 to empower engineers to build proﬁtable cloud software. @silvexis [email protected] www.erikpeterson.com

Slide 3

Slide 3 text

@silvexis 3 EVERY ENGINEERING DECISION IS A BUYING DECISION

Slide 4

Slide 4 text

@silvexis 4

Slide 5

Slide 5 text

@silvexis 5 FINANCES PROBLEM NOW WORKED FINE FOR DEV @silvexis

Slide 6

Slide 6 text

@silvexis “People sometimes forget that 90-plus percent of global IT spend is still on premise, and if you believe that equation is going to flip — which we do — it’s going to move to the cloud.” – Andy Jassy, CEO, Amazon The $4.6 trillion* potential shift Sources: *Gartner 2022 Global IT spend, https://www.gartner.com/en/newsroom/press-releases/2022-10-19-gartner-forecasts-worldwide-it-spending-to-grow-5-percent-in-2023 Financial Times, Amazon says cloud growth slowed as customers cut costs, April 27 2023, https://on.ft.com/3Vfk0A7

Slide 7

Slide 7 text

@silvexis 7 FYI, he’s horribly wrong… Some people are pretty convinced that won't happen however

Slide 8

Slide 8 text

8 I've checked the numbers, it does make economic sense, but only if we build differently

Slide 9

Slide 9 text

@silvexis 9 The Engineer's Role in Software Profitability • Cost-efficiency often reflects system quality. • One line of code can dictate a company's profit margin. • Challenge: Which metric best measures cost-efficiency?

Slide 10

Slide 10 text

10 Million Dollar Lines of Code

Slide 11

Slide 11 text

@silvexis Disclaimer ● The following are all real examples that did or would likely result in over a million dollars in cloud costs if not caught in time. ● The code has been refactored and/or rewritten in Python for consistency, to focus on the issue and protect the innocent. 11

Slide 12

Slide 12 text

@silvexis 12 Death by Even DevOps costs money The situation: ● AWS lambda function with a average monthly cost of $628.00 ● AWS CloudWatch costs with an average monthly cost of $31,000.00 ● Total cost of this single function since deployment 3 years ago: $1.1 million dollars

Slide 13

Slide 13 text

@silvexis 13 from aws_lambda_powertools import Logger logger = Logger() def something_important(really_big_list_of_big_files): # This is a really important function that does a lot of stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) logger.debug("Processed line", extra={"line": line "result": result}) results.append(result) logger.info("Finished processing files") return results

Slide 14

Slide 14 text

@silvexis 14 from aws_lambda_powertools import Logger logger = Logger() def something_important(really_big_list_of_big_files): # This is a really important function that does a lot of stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) logger.debug("Processed line", extra={"line": line, "result": result}) results.append(result) logger.info("Finished processing files") return results

Slide 15

Slide 15 text

@silvexis 15 from aws_lambda_powertools import Logger logger = Logger() def something_important(really_big_list_of_big_files): # This is a really important function that does important stuff results = [] for file in really_big_list_of_big_files: with open(file) as f: for line in f: result = do_important_something_with_line(line) results.append(result) logger.info("Finished processing files") return results THE FIX

Slide 16

Slide 16 text

@silvexis 16 The API Costs Money? The situation: ● A MVP/Prototype that found its way into production. ● Now years later, the product is creating billions of api requisitions to S3 ● Total cost of this code over 1 year: $1.3 million dollars

Slide 17

Slide 17 text

@silvexis 17 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count": len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines

Slide 18

Slide 18 text

@silvexis 18 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count": len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines

Slide 19

Slide 19 text

@silvexis 19 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count": len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") reticulated_splines = [] for spline in splines_to_reticulate: configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") result = reticulate(spline, configuration) paginator = client.get_paginator('list_objects_v2') flare_reduction_maps = paginator.paginate(Bucket="my-bucket", Prefix="solar/flare-reduction-maps") reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_maps)) return reticulated_splines

Slide 20

Slide 20 text

@silvexis 20 def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int): logger.info("Reticulating splines", extra={"spline_count": len(splines_to_reticulate), "solar_flare_count": solar_flare_count}) client = boto3.client("s3") configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json") paginator = client.get_paginator('list_objects_v2') flare_reduction_data = get_flare_reduction_data(Bucket="my-bucket", Prefix="solar/flare-reduction-maps")) reticulated_splines = [] for spline in splines_to_reticulate: result = reticulate(spline, configuration) reticulated_splines.append(adjust_for_solar_flares(result, solar_flare_count, flare_reduction_data)) return reticulated_splines + re-factor to use preloaded data here as well THE FIX

Slide 21

Slide 21 text

@silvexis 21 Know thy limits Or how to 2x your DynamoDB write costs with only a few bytes The situation: ● A developer was asked to add a datetime stamp to every write to DynamoDB ● The code change took seconds to write, but doubled their DynamoDB costs ● The total cost to run? ○ Ok, i'm not 100% sure :-) ○ I learned of this from @ksshams and loved it so much I had to include it ○ Get the full story here https://youtu.be/Niy1TZe4dVg

Slide 22

Slide 22 text

@silvexis 22 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record to DynamoDB and add a timestamp k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat() dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data})

Slide 23

Slide 23 text

@silvexis 23 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record to DynamoDB and add a timestamp k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat() dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data}) 1000 + 32 + 9 = 1041 Bytes Over the 1024 limit per WCU by 17 bytes Write costs are now 2x

Slide 24

Slide 24 text

@silvexis 24 def write_data_to_dynamodb(k_data: dict): # write 1000 byte record to DynamoDB and add a timestamp k_data['ts'] = datetime.utcnow().isoformat(timespec='seconds') + 'Z' dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('my-table') table.put_item(Item={'data': k_data}) THE FIX 1000 + 20 + 2 = 1022 Bytes Under the 1024 limit per WCU

Slide 25

Slide 25 text

@silvexis 25 Leaking Infra-as-code Terraform Edition The situation: ● Terraform template creating ASGs that scaled up clusters of hundreds of EC2 systems ● System refreshed each instance every 24 hours, but for safety, EBS volumes were not automatically deleted ● Average Cost of this system for 1 year: $1.1 million dollars

Slide 26

Slide 26 text

@silvexis 26 resource "aws_launch_template" "my_lt" { name = "my-launch-template" image_id = data.aws_ami.amzlinux2.id instance_type = "c7gd.xlarge" vpc_security_group_ids = [module.private_sg.security_group_id] key_name = "my-ssh-key" user_data = filebase64("${path.module}/install.sh") ebs_optimized = true update_default_version = true block_device_mappings { device_name = "/dev/sda1" ebs { volume_size = 100 delete_on_termination = false volume_type = "gp3" } } } resource "aws_autoscaling_group" "leaky_asg" { name_prefix = "leaky-" desired_capacity = 10 max_size = 1000 min_size = 10 max_instance_lifetime = 86400 vpc_zone_identifier = module.vpc.private_subnets target_group_arns = module.alb.target_group_arns health_check_type = "EC2" launch_template { id = aws_launch_template.my_lt.id version = aws_launch_template.my_lt.latest_version } instance_refresh { strategy = "Rolling" preferences { instance_warmup = 300 } triggers = ["launch_template","desired_capacity"] } } AWS Launch Template AWS Autoscaling group

Slide 27

Slide 27 text

@silvexis 27 resource "aws_launch_template" "my_lt" { name = "my-launch-template" image_id = data.aws_ami.amzlinux2.id instance_type = "c7gd.xlarge" vpc_security_group_ids = [module.private_sg.security_group_id] key_name = "my-ssh-key" user_data = filebase64("${path.module}/install.sh") ebs_optimized = true update_default_version = true block_device_mappings { device_name = "/dev/sda1" ebs { volume_size = 100 delete_on_termination = false volume_type = "gp3" } } } resource "aws_autoscaling_group" "leaky_asg" { name_prefix = "leaky-" desired_capacity = 10 max_size = 1000 min_size = 10 max_instance_lifetime = 86400 vpc_zone_identifier = module.vpc.private_subnets target_group_arns = module.alb.target_group_arns health_check_type = "EC2" launch_template { id = aws_launch_template.my_lt.id version = aws_launch_template.my_lt.latest_version } instance_refresh { strategy = "Rolling" preferences { instance_warmup = 300 } triggers = ["launch_template","desired_capacity"] } } AWS Launch Template AWS Autoscaling group

Slide 28

Slide 28 text

@silvexis 28 THE FIX Beware well intentioned infrastructure as code

Slide 29

Slide 29 text

@silvexis 29 Cost Delivery Network It's as if 2.3 million devices all cried out at once…asking for money Software bug on device “phoned home” repeatedly causing massive jump in network costs ● After ~14 hours reached a steady state of about $4.5k an hour ● Issue was resolved after ~6 days ● Total cost of incident: ~$648,000.00 ● Potential annual impact: $39,420,00.00

Slide 30

Slide 30 text

@silvexis 30 def check_for_update() -> bool: # return true if update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash)

Slide 31

Slide 31 text

@silvexis 31 def check_for_update() -> bool: # return true if update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash)

Slide 32

Slide 32 text

@silvexis 32 def check_for_update() -> bool: # return true if update available, also, now calling this every hour logger.info("Checking for updates") update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata" update_image_url = "https://update.cdn.mycompany.com/latest_image" metadata, response_code = download_update_metadata(update_metadata_url) if response_code == 200: return bool(MY_VERSION < metadata['latest_version']) else: # fallback file hash check, we don't want a bug to brick our devices filename, hash = download_latest_image(update_image_url) return bool(MY_HASH != hash) THE (real) FIX

Slide 33

Slide 33 text

@silvexis 33 What did we learn? ● Storage is still cheap, but calling APIs cost money ● I've got infinite scale, but apparently not infinite wallet ● CDN's are very good at eating traffic…and money ● Engineers now need to add "agonizing over the cost of their code" to their list of things to do?

Slide 34

Slide 34 text

@silvexis “We should forget about small efﬁciencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” – Donald Knuth Sources: "Structured Programming with go to Statements", Kunth, 1974

Slide 35

Slide 35 text

@silvexis 35 "Premature optimization is the root of all evil." ● All of these examples are only problems at scale ● Cloud software engineers should think about these questions, but iteratively, and over time, not all at once. ○ Can it be done? ○ Is this the best way to do it, as a team? ○ What happens if this thing becomes popular? ○ How much money should it cost to run? ● The metric to answering the money question however is not $$$, it's to track your desired Cloud Eﬃciency Rate

Slide 36

Slide 36 text

@silvexis 36 Cloud Eﬃciency Rate (CER) How and when to optimize Revenue – Cloud Costs Revenue https://www.cloudzero.com/blog/cloud-efficiency-rate Example ● Company A’s annual revenue is $100M and annual cloud costs are $20M. ● Company A’s cost per dollar of revenue is $0.20 cents with a Cloud Eﬃciency Rate of 80%

Slide 37

Slide 37 text

@silvexis Cloud Efficiency Rate in practice Cloud Efficiency Rate (CER) should become a non-functional requirement for any cloud project, with defined stages to aid prioritization: ● Research and Development: A negative CER is acceptable! ● Version 1 / MVP: Break even or low CER (0% to 25%) ● Product market fit (PMF): Acceptable margins are conceivable (25% to 50%) ● Scaling: Demonstrable Q over Q path to healthy margins (50% to 80%) ● Steady state: Healthy margins = Healthy Business (CER is +80%) 37 As a rule of thumb, you should target a CER of 90% https://www.cloudzero.com/blog/cloud-efficiency-rate

Slide 38

Slide 38 text

@silvexis 38 "I call it my billion-dollar mistake. It was the invention of the null reference in 1965." "This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years." – Sir Tony Hoare https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/ The Billion Dollar Mistake - QCon London Final thoughts

Slide 39

Slide 39 text

@silvexis 39 EVERY ENGINEERING DECISION IS A BUYING DECISION

Slide 40

Slide 40 text

@silvexis Remember to vote and share feedback on the QCon App. Please vote and leave feedback! Any questions? Let’s continue the conversation! ● @silvexis ● [email protected]