Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

Wait, zombies? Really? Zombies are servers which aren’t doing useful work. They’re everywhere, costing money, eating electricity, and belching carbon. And they’re useless! So how do we get rid of them? In this talk, Holly will explain how utilization and elasticity relate to sustainability. She will also introduce a range of practical zombie-hunting techniques, including absurdly-simple-automation, LightSwitchOps, and FinOps.

Holly Cummins

March 29, 2023
Tweet

More Decks by Holly Cummins

Other Decks in Programming

Transcript

  1. Holly Cummins


    Red Hat


    QCon London | March 29, 2023
    Why Cloud Zombies Are
    Destroying the Planet


    and


    How You Can Stop Them


    View full-size slide

  2. Holly Cummins


    Red Hat


    QCon London | March 29, 2023
    Why Cloud Zombies Are
    Destroying the Planet


    and


    How You Can Stop Them


    View full-size slide

  3. @holly_cummins #RedHat

    View full-size slide

  4. @holly_cummins #RedHat

    View full-size slide

  5. @therealmarkw1, twitter

    View full-size slide

  6. what do these servers do?
    @therealmarkw1, twitter

    View full-size slide

  7. what do these servers do?
    one is a backup for the
    other.
    @therealmarkw1, twitter

    View full-size slide

  8. what do these servers do?
    one is a backup for the
    other.
    yes, but what do they do?
    @therealmarkw1, twitter

    View full-size slide

  9. what do these servers do?
    one is a backup for the
    other.
    yes, but what do they do?
    @therealmarkw1, twitter
    no one has known for a
    couple of decades

    View full-size slide

  10. #RedHat
    @[email protected]
    Hey boss, I created a
    Kubernetes cluster.
    2018

    View full-size slide

  11. #RedHat
    @[email protected]
    Hey boss, I created a
    Kubernetes cluster.
    I forgot it for 2 months.
    2018

    View full-size slide

  12. #RedHat
    @[email protected]
    Hey boss, I created a
    Kubernetes cluster.
    I forgot it for 2 months.
    … and it’s €1000 a month.
    2018

    View full-size slide

  13. #RedHat
    @[email protected]
    Hey boss, while I was
    working on a QCon talk
    about sustainability …
    2023

    View full-size slide

  14. #RedHat
    @[email protected]
    Hey boss, while I was
    working on a QCon talk
    about sustainability …
    I left the Quarkus CI
    on Mac disabled
    2023

    View full-size slide

  15. #RedHat
    @[email protected]
    Hey boss, while I was
    working on a QCon talk
    about sustainability …
    … and the instance is $159 a
    month.
    I left the Quarkus CI
    on Mac disabled
    2023

    View full-size slide

  16. @holly_cummins #RedHat
    “measure, don’t guess”


    (or decide based on stories on the internet)

    View full-size slide

  17. @holly_cummins #RedHat
    actual picture of a zombie


    (it’s invisible)

    View full-size slide

  18. @holly_cummins #RedHat
    actual picture of a zombie


    (it’s invisible)

    View full-size slide

  19. #RedHat
    @[email protected]
    2015 survey


    30%


    of 4,000 servers doing
    no useful work

    View full-size slide

  20. #RedHat
    @[email protected]
    2017 survey


    25%


    of 16,000 servers doing
    no useful work

    View full-size slide

  21. #RedHat
    @[email protected]
    zombie


    “they haven't delivered any information or
    computing services for six months or more”

    View full-size slide

  22. #RedHat
    @[email protected]
    “comatose servers”

    View full-size slide

  23. #RedHat
    @[email protected]
    under-utilised servers

    View full-size slide

  24. #RedHat
    @[email protected]
    “much of the energy consumed by U.S. data
    centers is used to power more than 12
    million servers that do little or no work most
    of the time”


    NRDC

    View full-size slide

  25. #RedHat
    @[email protected]
    the average server:


    12 - 18% of capacity


    30 - 60 % of maximum power
    https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf

    View full-size slide

  26. #RedHat
    @[email protected]
    2014 survey


    29%


    of 4,000 active less than
    5% of the time
    https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf

    View full-size slide

  27. @holly_cummins #RedHat
    https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
    2021 study


    View full-size slide

  28. @holly_cummins #RedHat
    $26.6 billion
    https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
    2021 study


    View full-size slide

  29. @holly_cummins #RedHat
    $26.6 billion
    wasted by always-on
    cloud instances
    https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
    2021 study


    View full-size slide

  30. #RedHat
    @[email protected]
    it’s not just runtime costs

    View full-size slide

  31. #RedHat
    @[email protected]
    embodied carbon
    it’s not just runtime costs

    View full-size slide

  32. #RedHat
    @[email protected]
    why does this happen?

    View full-size slide

  33. @holly_cummins #RedHat
    managing machines is hard

    View full-size slide

  34. @holly_cummins #RedHat
    managing machines is hard

    View full-size slide

  35. #RedHat
    @[email protected]
    “perhaps someone
    forgot to turn them off”


    Antithesis Institute

    View full-size slide

  36. #RedHat
    @[email protected]
    projects ended
    business processes changed

    View full-size slide

  37. #RedHat
    @[email protected]
    projects ended
    business processes changed
    over-provisioning

    View full-size slide

  38. #RedHat
    @[email protected]
    projects ended
    business processes changed
    over-provisioning
    isolation requirements

    View full-size slide

  39. @holly_cummins #RedHat
    risk averse processes

    View full-size slide

  40. @holly_cummins #RedHat
    “we run this as a batch job on weekends,
    but the servers stay up all week”



    View full-size slide

  41. @holly_cummins #RedHat
    “we run this as a batch job on weekends,
    but the servers stay up all week”

    View full-size slide

  42. @holly_cummins #RedHat
    “we only use this system in UK working hours,
    but we leave it running 24/7 ”



    View full-size slide

  43. @holly_cummins #RedHat
    “we only use this system in UK working hours,
    but we leave it running 24/7 ”

    View full-size slide

  44. @holly_cummins #RedHat
    auto-scaling algorithms are optimised for availability

    View full-size slide

  45. @holly_cummins #RedHat
    green computing model: the four vowels

    View full-size slide

  46. @holly_cummins #RedHat
    green computing model: the four vowels

    View full-size slide

  47. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity

    View full-size slide

  48. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation

    View full-size slide

  49. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency

    View full-size slide

  50. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View full-size slide

  51. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View full-size slide

  52. @holly_cummins #RedHat
    application
    utilisation

    View full-size slide

  53. @holly_cummins #RedHat
    application
    utilisation
    high utilisation


    good case

    View full-size slide

  54. @holly_cummins #RedHat
    application
    utilisation
    over-utilisation


    very bad case

    View full-size slide

  55. @holly_cummins #RedHat
    application
    utilisation
    over-utilisation


    very bad case
    under-utilisation


    wasteful case

    View full-size slide

  56. @holly_cummins #RedHat
    application
    elasticity
    high utilisation


    good case
    @holly_cummins

    View full-size slide

  57. @holly_cummins #RedHat
    application
    elasticity
    scale-up


    good utilisation
    @holly_cummins

    View full-size slide

  58. @holly_cummins #RedHat
    application
    elasticity
    scale-down


    good utilisation
    @holly_cummins

    View full-size slide

  59. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View full-size slide

  60. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View full-size slide

  61. @holly_cummins #RedHat
    There is nothing so useless as
    doing efficiently that which
    should not be done at all.


    Peter Drucker
    why utility matters

    View full-size slide

  62. @holly_cummins #RedHat
    “efficient zombies”

    View full-size slide

  63. @holly_cummins #RedHat
    how do we solve the zombie problem?

    View full-size slide

  64. @holly_cummins #RedHat
    how do we solve the zombie problem?
    detection and destruction

    View full-size slide

  65. @holly_cummins #RedHat
    system archaeology
    … is not easy

    View full-size slide

  66. @holly_cummins #RedHat
    scream test

    View full-size slide

  67. @holly_cummins #RedHat
    “eco-monkey”

    View full-size slide

  68. @holly_cummins
    #RedHat
    the scream is real

    View full-size slide

  69. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose

    View full-size slide

  70. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose
    let’s turn it off!

    View full-size slide

  71. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose
    uh … why did the
    backbone of a
    client’s network
    just vanish?
    let’s turn it off!

    View full-size slide

  72. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose
    uh … why did the
    backbone of a
    client’s network
    just vanish?
    let’s turn it off!
    oops.

    View full-size slide

  73. @holly_cummins #RedHat
    IT Department, UK Bank
    let’s figure out what all
    these cloud workloads are,
    since I’m paying for them
    long meetings

    View full-size slide

  74. @holly_cummins #RedHat
    IT Department, UK Bank
    let’s figure out what all
    these cloud workloads are,
    since I’m paying for them
    long meetings

    View full-size slide

  75. @holly_cummins #RedHat
    long emails

    View full-size slide

  76. @holly_cummins #RedHat
    tags

    View full-size slide

  77. @holly_cummins #RedHat
    all the —opses

    View full-size slide

  78. @holly_cummins #RedHat
    GreenOps

    View full-size slide

  79. @holly_cummins #RedHat
    GreenOps
    greenops is a mid-sized trilobite (really)

    View full-size slide

  80. @holly_cummins #RedHat
    FinOps
    figuring out who in your company forgot to turn off their cloud

    View full-size slide

  81. @holly_cummins #RedHat

    View full-size slide

  82. @holly_cummins #RedHat
    backstage.io

    View full-size slide

  83. @holly_cummins #RedHat
    backstage.io
    •cost insights plugin

    View full-size slide

  84. @holly_cummins #RedHat
    backstage.io
    •cost insights plugin
    •cloud carbon footprint plugin

    View full-size slide

  85. • Densify


    • Granulate


    • Turbonomic Application Resource Management


    • TSO Logic


    • etc
    AIOps

    View full-size slide

  86. 21%
    improvement from installing Turbonomic


    in IBM CIO office

    View full-size slide

  87. @holly_cummins #RedHat
    traffic monitoring

    View full-size slide

  88. @holly_cummins #RedHat
    but.


    knowing is only half the battle.

    View full-size slide

  89. @holly_cummins #RedHat
    the ikea effect

    View full-size slide

  90. @holly_cummins #RedHat
    the ikea effect
    labour

    View full-size slide

  91. @holly_cummins #RedHat
    the ikea effect
    labour

    View full-size slide

  92. @holly_cummins #RedHat
    the ikea effect
    labour love

    View full-size slide

  93. @holly_cummins #RedHat
    shut it down?


    but … what if I
    need this
    cluster later?

    View full-size slide

  94. @holly_cummins #RedHat
    elasticity
    native quarkus starts
    faster than a light bulb

    View full-size slide

  95. @holly_cummins
    #RedHat
    ultimate elasticity


    View full-size slide

  96. @holly_cummins #RedHat
    we don’t switch the light off
    because we’re not sure if it will
    come back on

    View full-size slide

  97. @holly_cummins #RedHat
    we don’t switch the server off
    because we’re not sure if it will come
    back on
    happens all the time


    View full-size slide

  98. @holly_cummins #RedHat
    we don’t switch the server off
    because it would be too much work
    to recreate it
    happens all the time


    View full-size slide

  99. @holly_cummins
    #RedHat

    View full-size slide

  100. @holly_cummins
    #RedHat

    View full-size slide

  101. @holly_cummins
    #RedHat
    turning it off and on again must

    View full-size slide

  102. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast

    View full-size slide

  103. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast
    • actually work

    View full-size slide

  104. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast
    • actually work
    • idempotency

    View full-size slide

  105. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast
    • actually work
    • idempotency
    • resiliency

    View full-size slide

  106. @holly_cummins
    #RedHat
    making turning servers off as safe and easy as turning lights off

    View full-size slide

  107. @holly_cummins
    #RedHat
    LightSwitchOps
    making turning servers off as safe and easy as turning lights off

    View full-size slide

  108. @holly_cummins
    #RedHat
    simple scripts
    we used to leave
    our applications
    running all the time
    @darkandnerdy, Chicago DevOpsDays

    View full-size slide

  109. @holly_cummins
    #RedHat
    simple scripts
    we used to leave
    our applications
    running all the time
    when we
    scripted turning
    them off at night,
    we reduced our
    cloud bill by
    30%
    @darkandnerdy, Chicago DevOpsDays

    View full-size slide

  110. @holly_cummins #RedHat

    View full-size slide

  111. @holly_cummins #RedHat
    GitOps

    View full-size slide

  112. @holly_cummins #RedHat
    GitOps
    (infrastructure as code)

    View full-size slide

  113. @holly_cummins #RedHat

    View full-size slide

  114. @holly_cummins #RedHat
    spin it down

    View full-size slide

  115. @holly_cummins #RedHat
    kubectl apply -f all-my-cluster/
    spin it down
    spin it up

    View full-size slide

  116. @holly_cummins #RedHat
    kubectl apply -f all-my-cluster/
    spin it down
    spin it up

    View full-size slide

  117. @holly_cummins #RedHat
    kubectl apply -f all-my-cluster/
    ansible-playbook stuff.yml
    spin it down
    spin it up

    View full-size slide

  118. reducing snowflakes
    reduces redundancy

    View full-size slide

  119. we need to have another
    copy of our expensive cluster in
    another region so we have
    failover!

    View full-size slide

  120. we need to have another
    copy of our expensive cluster in
    another region so we have
    failover!
    uh … sounds
    expensive. are you
    sure about that?

    View full-size slide

  121. rapid recovery does not
    require redundant servers

    View full-size slide

  122. zombie reduction does
    not need to be fancy

    View full-size slide

  123. @holly_cummins #RedHat
    large bank, 2013


    50%


    reduction in CPUs with a
    lease system

    View full-size slide

  124. @holly_cummins #RedHat
    large bank, 2013


    50%


    reduction in CPUs with a
    lease system

    View full-size slide

  125. things that (maybe) don’t help

    View full-size slide

  126. @holly_cummins #RedHat
    things that (maybe) don’t help


    “out of sight, out of mind”
    cloud

    View full-size slide

  127. @holly_cummins #RedHat

    View full-size slide

  128. @holly_cummins #RedHat
    things that (maybe) don’t help


    virtualisation
    2019 survey


    30%


    of virtual servers doing
    no useful work

    View full-size slide

  129. @holly_cummins #RedHat
    things that (maybe) don’t help


    virtualisation
    2019 survey


    30%


    of virtual servers doing
    no useful work


    50%


    of virtual servers active
    less than 5% of the time

    View full-size slide

  130. #RedHat
    @[email protected]
    you still need to remember to
    turn the virtual machine off

    View full-size slide

  131. what about serverless?

    View full-size slide

  132. modernising to serverless is a big lift

    View full-size slide

  133. may not suit latency-sensitive workloads

    View full-size slide

  134. “we solve the cold-start problem by …


    … keeping an instance running but not billing you”

    View full-size slide

  135. @holly_cummins #RedHat
    application
    serverless systems may have high overheads

    View full-size slide

  136. @holly_cummins #RedHat
    control plane
    application
    serverless systems may have high overheads

    View full-size slide

  137. @holly_cummins #RedHat
    control plane
    application
    serverless systems may have high overheads

    View full-size slide

  138. @holly_cummins #RedHat
    control plane
    application
    serverless systems may have high overheads

    View full-size slide

  139. https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf

    View full-size slide

  140. https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf
    virtualisation overheads
    mean each function request
    can use 30x more energy
    than a plain http server

    View full-size slide

  141. are all parts of the system elastic?

    View full-size slide

  142. things that definitely don’t help

    View full-size slide

  143. @holly_cummins #RedHat
    things that don’t help


    prevention

    View full-size slide

  144. @holly_cummins #RedHat
    things that don’t help


    prevention (?!)

    View full-size slide

  145. surely shutting the barn door before
    the horse has left is a good idea?

    View full-size slide

  146. prevention == heavy governance

    View full-size slide

  147. remember the ikea effect?

    View full-size slide

  148. remember the ikea effect?
    people will not surrender
    servers that were hard to get

    View full-size slide

  149. zombies are not just servers

    View full-size slide

  150. zombie packets

    View full-size slide

  151. @holly_cummins #RedHat
    internet background noise

    View full-size slide

  152. @holly_cummins #RedHat
    internet background noise
    5.5 gigabits/s

    View full-size slide

  153. @holly_cummins #RedHat
    unsolved problem == opportunity

    View full-size slide

  154. @holly_cummins #RedHat
    the double-win
    turning things off saves a lot of money

    View full-size slide

  155. @holly_cummins #RedHat

    View full-size slide

  156. @holly_cummins #RedHat
    users …

    View full-size slide

  157. @holly_cummins #RedHat
    up utilisation


    aim for elasticity


    limit kubesprawl


    de-zombify


    know what you’re using


    turn it off
    users …

    View full-size slide

  158. @holly_cummins #RedHat
    1-2%

    View full-size slide

  159. @holly_cummins #RedHat
    tool creators, support
    1-2%

    View full-size slide

  160. @holly_cummins #RedHat
    better utilisation


    elasticity


    multi-tenancy


    de-zombification


    visibility


    disposability
    tool creators, support
    1-2%

    View full-size slide

  161. GreenOps


    FinOps


    AIOps


    GitOps


    LightSwitchOps


    View full-size slide

  162. GreenOps


    FinOps


    AIOps


    GitOps


    LightSwitchOps


    View full-size slide