Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

Million Dollar Lines of Code: An Engineering Perspective on Cloud Cost Optimization

A single line of code can shape an organization's financial future. Drawing inspiration from five real examples of million-dollar lines of code, we will challenge conventional views on engineering's pivotal role in cloud cost optimization.

Prepare for a plot twist, however: premature cost optimization can become a distraction and inadvertently hinder innovation. While unchecked cloud costs can cause another kind of distraction (one where you run out of money), pursuing cost reduction incorrectly or at the wrong time can be fatal to market growth and finding that elusive product-market fit.

Done right, success lies in cultivating a deeper understanding of the economic implications of our efforts as engineers, allowing us to find the balance between user delight, innovation, and cost-efficient code. By venturing into the heart of these million-dollar lines of code and the forces that created them, we uncover the right timing and approach for engineering cost optimization and how to use cost efficiency metrics as powerful constraints that drive innovation, accelerate growth, and engineer profit.

https://qconsf.com/presentation/oct2023/million-dollar-lines-code-engineering-perspective-cloud-cost-optimization

https://www.infoq.com/news/2023/10/engineering-optimize-cost/?p13nId=65ee8c45-2632-483b-b3c3-85ad4829b4b1&p13nType=content

Erik Peterson

October 04, 2023
Tweet

More Decks by Erik Peterson

Other Decks in Technology

Transcript

  1. @silvexis
    Million Dollar Lines of Code
    An engineering perspective on Cloud Cost Optimization
    1
    Erik Peterson FOUNDER & CTO
    @silvexis [email protected] www.erikpeterson.com

    View full-size slide

  2. @silvexis
    2
    I’m Erik, I’m the CTO and Founder of CloudZero.
    I’ve been building on AWS since 2006. I’m a
    Serverless believer and startup addict. I founded
    CloudZero in 2016 to empower engineers to build
    profitable cloud software.
    @silvexis [email protected] www.erikpeterson.com

    View full-size slide

  3. @silvexis 3
    EVERY ENGINEERING
    DECISION IS A BUYING
    DECISION

    View full-size slide

  4. @silvexis 5
    FINANCES PROBLEM NOW
    WORKED FINE FOR DEV
    @silvexis

    View full-size slide

  5. @silvexis
    “People sometimes forget that 90-plus
    percent of global IT spend is still on
    premise, and if you believe that
    equation is going to flip — which we
    do — it’s going to move to the cloud.”
    – Andy Jassy, CEO, Amazon
    The $4.6 trillion* potential shift
    Sources:
    *Gartner 2022 Global IT spend, https://www.gartner.com/en/newsroom/press-releases/2022-10-19-gartner-forecasts-worldwide-it-spending-to-grow-5-percent-in-2023
    Financial Times, Amazon says cloud growth slowed as customers cut costs, April 27 2023, https://on.ft.com/3Vfk0A7

    View full-size slide

  6. @silvexis
    7
    FYI, he’s
    horribly
    wrong…
    Some people are pretty convinced that won't happen however

    View full-size slide

  7. 8
    I've checked the numbers, it does make
    economic sense, but only if we build
    differently

    View full-size slide

  8. @silvexis
    9
    The Engineer's Role in Software Profitability
    • Cost-efficiency often reflects
    system quality.
    • One line of code can dictate a
    company's profit margin.
    • Challenge: Which metric best
    measures cost-efficiency?

    View full-size slide

  9. 10
    Million Dollar Lines of Code

    View full-size slide

  10. @silvexis
    Disclaimer
    ● The following are all real examples that did or would likely result
    in over a million dollars in cloud costs if not caught in time.
    ● The code has been refactored and/or rewritten in Python for
    consistency, to focus on the issue and protect the innocent.
    11

    View full-size slide

  11. @silvexis
    12
    Death by
    Even DevOps costs money
    The situation:
    ● AWS lambda function with a average
    monthly cost of $628.00
    ● AWS CloudWatch costs with an
    average monthly cost of $31,000.00
    ● Total cost of this single function since
    deployment 3 years ago:
    $1.1 million dollars

    View full-size slide

  12. @silvexis
    13
    from aws_lambda_powertools import Logger
    logger = Logger()
    def something_important(really_big_list_of_big_files):
    # This is a really important function that does a lot of stuff
    results = []
    for file in really_big_list_of_big_files:
    with open(file) as f:
    for line in f:
    result = do_important_something_with_line(line)
    logger.debug("Processed line", extra={"line": line
    "result": result})
    results.append(result)
    logger.info("Finished processing files")
    return results

    View full-size slide

  13. @silvexis
    14
    from aws_lambda_powertools import Logger
    logger = Logger()
    def something_important(really_big_list_of_big_files):
    # This is a really important function that does a lot of stuff
    results = []
    for file in really_big_list_of_big_files:
    with open(file) as f:
    for line in f:
    result = do_important_something_with_line(line)
    logger.debug("Processed line", extra={"line": line,
    "result": result})
    results.append(result)
    logger.info("Finished processing files")
    return results

    View full-size slide

  14. @silvexis
    15
    from aws_lambda_powertools import Logger
    logger = Logger()
    def something_important(really_big_list_of_big_files):
    # This is a really important function that does important stuff
    results = []
    for file in really_big_list_of_big_files:
    with open(file) as f:
    for line in f:
    result = do_important_something_with_line(line)
    results.append(result)
    logger.info("Finished processing files")
    return results
    THE FIX

    View full-size slide

  15. @silvexis
    16
    The API Costs
    Money?
    The situation:
    ● A MVP/Prototype that found its
    way into production.
    ● Now years later, the product is
    creating billions of api
    requisitions to S3
    ● Total cost of this code over 1 year:
    $1.3 million dollars

    View full-size slide

  16. @silvexis
    17
    def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int):
    logger.info("Reticulating splines",
    extra={"spline_count": len(splines_to_reticulate),
    "solar_flare_count": solar_flare_count})
    client = boto3.client("s3")
    reticulated_splines = []
    for spline in splines_to_reticulate:
    configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json")
    result = reticulate(spline, configuration)
    paginator = client.get_paginator('list_objects_v2')
    flare_reduction_maps = paginator.paginate(Bucket="my-bucket",
    Prefix="solar/flare-reduction-maps")
    reticulated_splines.append(adjust_for_solar_flares(result,
    solar_flare_count,
    flare_reduction_maps))
    return reticulated_splines

    View full-size slide

  17. @silvexis
    18
    def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int):
    logger.info("Reticulating splines",
    extra={"spline_count": len(splines_to_reticulate),
    "solar_flare_count": solar_flare_count})
    client = boto3.client("s3")
    reticulated_splines = []
    for spline in splines_to_reticulate:
    configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json")
    result = reticulate(spline, configuration)
    paginator = client.get_paginator('list_objects_v2')
    flare_reduction_maps = paginator.paginate(Bucket="my-bucket",
    Prefix="solar/flare-reduction-maps")
    reticulated_splines.append(adjust_for_solar_flares(result,
    solar_flare_count,
    flare_reduction_maps))
    return reticulated_splines

    View full-size slide

  18. @silvexis
    19
    def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int):
    logger.info("Reticulating splines",
    extra={"spline_count": len(splines_to_reticulate),
    "solar_flare_count": solar_flare_count})
    client = boto3.client("s3")
    reticulated_splines = []
    for spline in splines_to_reticulate:
    configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json")
    result = reticulate(spline, configuration)
    paginator = client.get_paginator('list_objects_v2')
    flare_reduction_maps = paginator.paginate(Bucket="my-bucket",
    Prefix="solar/flare-reduction-maps")
    reticulated_splines.append(adjust_for_solar_flares(result,
    solar_flare_count,
    flare_reduction_maps))
    return reticulated_splines

    View full-size slide

  19. @silvexis
    20
    def reticulate_splines(splines_to_reticulate: list, solar_flare_count: int):
    logger.info("Reticulating splines",
    extra={"spline_count": len(splines_to_reticulate),
    "solar_flare_count": solar_flare_count})
    client = boto3.client("s3")
    configuration = client.get_object(Bucket="my-bucket", Key="my-config-file.json")
    paginator = client.get_paginator('list_objects_v2')
    flare_reduction_data = get_flare_reduction_data(Bucket="my-bucket",
    Prefix="solar/flare-reduction-maps"))
    reticulated_splines = []
    for spline in splines_to_reticulate:
    result = reticulate(spline, configuration)
    reticulated_splines.append(adjust_for_solar_flares(result,
    solar_flare_count,
    flare_reduction_data))
    return reticulated_splines
    + re-factor to use preloaded data here as well
    THE FIX

    View full-size slide

  20. @silvexis
    21
    Know thy limits
    Or how to 2x your DynamoDB
    write costs with only a few bytes
    The situation:
    ● A developer was asked to add a
    datetime stamp to every write to
    DynamoDB
    ● The code change took seconds to
    write, but doubled their DynamoDB
    costs
    ● The total cost to run?
    ○ Ok, i'm not 100% sure :-)
    ○ I learned of this from
    @ksshams and loved it so
    much I had to include it
    ○ Get the full story here
    https://youtu.be/Niy1TZe4dVg

    View full-size slide

  21. @silvexis
    22
    def write_data_to_dynamodb(k_data: dict):
    # write 1000 byte record to DynamoDB and add a timestamp
    k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat()
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('my-table')
    table.put_item(Item={'data': k_data})

    View full-size slide

  22. @silvexis
    23
    def write_data_to_dynamodb(k_data: dict):
    # write 1000 byte record to DynamoDB and add a timestamp
    k_data['timestamp'] = datetime.now().astimezone(tz=timezone.utc).isoformat()
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('my-table')
    table.put_item(Item={'data': k_data})
    1000 + 32 + 9 = 1041 Bytes
    Over the 1024 limit per WCU by 17 bytes
    Write costs are now 2x

    View full-size slide

  23. @silvexis
    24
    def write_data_to_dynamodb(k_data: dict):
    # write 1000 byte record to DynamoDB and add a timestamp
    k_data['ts'] = datetime.utcnow().isoformat(timespec='seconds') + 'Z'
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('my-table')
    table.put_item(Item={'data': k_data})
    THE FIX
    1000 + 20 + 2 = 1022 Bytes
    Under the 1024 limit per WCU

    View full-size slide

  24. @silvexis
    25
    Leaking
    Infra-as-code
    Terraform Edition
    The situation:
    ● Terraform template creating ASGs that
    scaled up clusters of hundreds of EC2
    systems
    ● System refreshed each instance every
    24 hours, but for safety, EBS volumes
    were not automatically deleted
    ● Average Cost of this system for 1 year:
    $1.1 million dollars

    View full-size slide

  25. @silvexis
    26
    resource "aws_launch_template" "my_lt" {
    name = "my-launch-template"
    image_id = data.aws_ami.amzlinux2.id
    instance_type = "c7gd.xlarge"
    vpc_security_group_ids =
    [module.private_sg.security_group_id]
    key_name = "my-ssh-key"
    user_data =
    filebase64("${path.module}/install.sh")
    ebs_optimized = true
    update_default_version = true
    block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
    volume_size = 100
    delete_on_termination = false
    volume_type = "gp3"
    }
    }
    }
    resource "aws_autoscaling_group" "leaky_asg" {
    name_prefix = "leaky-"
    desired_capacity = 10
    max_size = 1000
    min_size = 10
    max_instance_lifetime = 86400
    vpc_zone_identifier = module.vpc.private_subnets
    target_group_arns = module.alb.target_group_arns
    health_check_type = "EC2"
    launch_template {
    id = aws_launch_template.my_lt.id
    version = aws_launch_template.my_lt.latest_version
    }
    instance_refresh {
    strategy = "Rolling"
    preferences { instance_warmup = 300 }
    triggers = ["launch_template","desired_capacity"]
    }
    }
    AWS Launch Template AWS Autoscaling group

    View full-size slide

  26. @silvexis
    27
    resource "aws_launch_template" "my_lt" {
    name = "my-launch-template"
    image_id = data.aws_ami.amzlinux2.id
    instance_type = "c7gd.xlarge"
    vpc_security_group_ids =
    [module.private_sg.security_group_id]
    key_name = "my-ssh-key"
    user_data =
    filebase64("${path.module}/install.sh")
    ebs_optimized = true
    update_default_version = true
    block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
    volume_size = 100
    delete_on_termination = false
    volume_type = "gp3"
    }
    }
    }
    resource "aws_autoscaling_group" "leaky_asg" {
    name_prefix = "leaky-"
    desired_capacity = 10
    max_size = 1000
    min_size = 10
    max_instance_lifetime = 86400
    vpc_zone_identifier = module.vpc.private_subnets
    target_group_arns = module.alb.target_group_arns
    health_check_type = "EC2"
    launch_template {
    id = aws_launch_template.my_lt.id
    version = aws_launch_template.my_lt.latest_version
    }
    instance_refresh {
    strategy = "Rolling"
    preferences { instance_warmup = 300 }
    triggers = ["launch_template","desired_capacity"]
    }
    }
    AWS Launch Template AWS Autoscaling group

    View full-size slide

  27. @silvexis
    28
    THE FIX
    Beware well intentioned infrastructure as code

    View full-size slide

  28. @silvexis
    29
    Cost Delivery
    Network
    It's as if 2.3 million devices all
    cried out at once…asking for
    money
    Software bug on device “phoned home”
    repeatedly causing massive jump in
    network costs
    ● After ~14 hours reached a steady
    state of about $4.5k an hour
    ● Issue was resolved after ~6 days
    ● Total cost of incident:
    ~$648,000.00
    ● Potential annual impact:
    $39,420,00.00

    View full-size slide

  29. @silvexis
    30
    def check_for_update() -> bool:
    # return true if update available, also, now calling this every hour
    logger.info("Checking for updates")
    update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata"
    update_image_url = "https://update.cdn.mycompany.com/latest_image"
    metadata, response_code = download_update_metadata(update_metadata_url)
    if response_code == 200:
    return bool(MY_VERSION < metadata['latest_version'])
    else:
    # fallback file hash check, we don't want a bug to brick our devices
    filename, hash = download_latest_image(update_image_url)
    return bool(MY_HASH != hash)

    View full-size slide

  30. @silvexis
    31
    def check_for_update() -> bool:
    # return true if update available, also, now calling this every hour
    logger.info("Checking for updates")
    update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata"
    update_image_url = "https://update.cdn.mycompany.com/latest_image"
    metadata, response_code = download_update_metadata(update_metadata_url)
    if response_code == 200:
    return bool(MY_VERSION < metadata['latest_version'])
    else:
    # fallback file hash check, we don't want a bug to brick our devices
    filename, hash = download_latest_image(update_image_url)
    return bool(MY_HASH != hash)

    View full-size slide

  31. @silvexis
    32
    def check_for_update() -> bool:
    # return true if update available, also, now calling this every hour
    logger.info("Checking for updates")
    update_metadata_url = "https://update.cdn.mycmpany.com/update_metadata"
    update_image_url = "https://update.cdn.mycompany.com/latest_image"
    metadata, response_code = download_update_metadata(update_metadata_url)
    if response_code == 200:
    return bool(MY_VERSION < metadata['latest_version'])
    else:
    # fallback file hash check, we don't want a bug to brick our devices
    filename, hash = download_latest_image(update_image_url)
    return bool(MY_HASH != hash)
    THE (real) FIX

    View full-size slide

  32. @silvexis
    33
    What did we learn?
    ● Storage is still cheap, but calling APIs cost money
    ● I've got infinite scale, but apparently not infinite wallet
    ● CDN's are very good at eating traffic…and money
    ● Engineers now need to add "agonizing over the
    cost of their code" to their list of things to do?

    View full-size slide

  33. @silvexis
    “We should forget about small
    efficiencies, say about 97% of the time:
    premature optimization is the root of
    all evil. Yet we should not pass up our
    opportunities in that critical 3%.”
    – Donald Knuth
    Sources:
    "Structured Programming with go to Statements", Kunth, 1974

    View full-size slide

  34. @silvexis
    35
    "Premature optimization is the root of all evil."
    ● All of these examples are only problems at scale
    ● Cloud software engineers should think about these questions, but
    iteratively, and over time, not all at once.
    ○ Can it be done?
    ○ Is this the best way to do it, as a team?
    ○ What happens if this thing becomes popular?
    ○ How much money should it cost to run?
    ● The metric to answering the money question however is not $$$,
    it's to track your desired Cloud Efficiency Rate

    View full-size slide

  35. @silvexis
    36
    Cloud Efficiency Rate (CER)
    How and when to optimize
    Revenue – Cloud Costs
    Revenue
    https://www.cloudzero.com/blog/cloud-efficiency-rate
    Example
    ● Company A’s annual revenue is $100M and annual cloud costs are $20M.
    ● Company A’s cost per dollar of revenue is $0.20 cents with a Cloud Efficiency
    Rate of 80%

    View full-size slide

  36. @silvexis
    Cloud Efficiency Rate in practice
    Cloud Efficiency Rate (CER) should become a non-functional requirement for any
    cloud project, with defined stages to aid prioritization:
    ● Research and Development: A negative CER is acceptable!
    ● Version 1 / MVP: Break even or low CER (0% to 25%)
    ● Product market fit (PMF): Acceptable margins are conceivable (25% to 50%)
    ● Scaling: Demonstrable Q over Q path to healthy margins (50% to 80%)
    ● Steady state: Healthy margins = Healthy Business (CER is +80%)
    37
    As a rule of thumb, you should target a CER of 90%
    https://www.cloudzero.com/blog/cloud-efficiency-rate

    View full-size slide

  37. @silvexis
    38
    "I call it my billion-dollar mistake. It was the invention of
    the null reference in 1965."
    "This has led to innumerable errors, vulnerabilities, and
    system crashes, which have probably caused a billion
    dollars of pain and damage in the last forty years."
    – Sir Tony Hoare
    https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/
    The Billion Dollar Mistake - QCon London
    Final thoughts

    View full-size slide

  38. @silvexis 39
    EVERY ENGINEERING
    DECISION IS A BUYING
    DECISION

    View full-size slide

  39. @silvexis
    Remember to vote and share feedback on the QCon App.
    Please vote and leave feedback!
    Any questions?
    Let’s continue the
    conversation!
    ● @silvexis
    [email protected]

    View full-size slide