Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

Why Cloud Zombies Are Destroying the Planet and How You Can Stop Them

Wait, zombies? Really? Zombies are servers which aren’t doing useful work. They’re everywhere, costing money, eating electricity, and belching carbon. And they’re useless! So how do we get rid of them? In this talk, Holly will explain how utilization and elasticity relate to sustainability. She will also introduce a range of practical zombie-hunting techniques, including absurdly-simple-automation, LightSwitchOps, and FinOps.

Holly Cummins

March 29, 2023
Tweet

More Decks by Holly Cummins

Other Decks in Programming

Transcript

  1. Holly Cummins


    Red Hat


    QCon London | March 29, 2023
    Why Cloud Zombies Are
    Destroying the Planet


    and


    How You Can Stop Them


    View Slide

  2. Holly Cummins


    Red Hat


    QCon London | March 29, 2023
    Why Cloud Zombies Are
    Destroying the Planet


    and


    How You Can Stop Them


    View Slide

  3. @holly_cummins #RedHat

    View Slide

  4. @holly_cummins #RedHat

    View Slide

  5. @therealmarkw1, twitter

    View Slide

  6. what do these servers do?
    @therealmarkw1, twitter

    View Slide

  7. what do these servers do?
    one is a backup for the
    other.
    @therealmarkw1, twitter

    View Slide

  8. what do these servers do?
    one is a backup for the
    other.
    yes, but what do they do?
    @therealmarkw1, twitter

    View Slide

  9. what do these servers do?
    one is a backup for the
    other.
    yes, but what do they do?
    @therealmarkw1, twitter
    no one has known for a
    couple of decades

    View Slide

  10. #RedHat
    @[email protected]
    Hey boss, I created a
    Kubernetes cluster.
    2018

    View Slide

  11. #RedHat
    @[email protected]
    Hey boss, I created a
    Kubernetes cluster.
    I forgot it for 2 months.
    2018

    View Slide

  12. #RedHat
    @[email protected]
    Hey boss, I created a
    Kubernetes cluster.
    I forgot it for 2 months.
    … and it’s €1000 a month.
    2018

    View Slide

  13. #RedHat
    @[email protected]
    Hey boss, while I was
    working on a QCon talk
    about sustainability …
    2023

    View Slide

  14. #RedHat
    @[email protected]
    Hey boss, while I was
    working on a QCon talk
    about sustainability …
    I left the Quarkus CI
    on Mac disabled
    2023

    View Slide

  15. #RedHat
    @[email protected]
    Hey boss, while I was
    working on a QCon talk
    about sustainability …
    … and the instance is $159 a
    month.
    I left the Quarkus CI
    on Mac disabled
    2023

    View Slide

  16. @holly_cummins #RedHat
    “measure, don’t guess”


    (or decide based on stories on the internet)

    View Slide

  17. @holly_cummins #RedHat
    actual picture of a zombie


    (it’s invisible)

    View Slide

  18. @holly_cummins #RedHat
    actual picture of a zombie


    (it’s invisible)

    View Slide

  19. #RedHat
    @[email protected]
    2015 survey


    30%


    of 4,000 servers doing
    no useful work

    View Slide

  20. #RedHat
    @[email protected]
    2017 survey


    25%


    of 16,000 servers doing
    no useful work

    View Slide

  21. #RedHat
    @[email protected]
    zombie


    “they haven't delivered any information or
    computing services for six months or more”

    View Slide

  22. #RedHat
    @[email protected]
    “comatose servers”

    View Slide

  23. #RedHat
    @[email protected]
    under-utilised servers

    View Slide

  24. #RedHat
    @[email protected]
    “much of the energy consumed by U.S. data
    centers is used to power more than 12
    million servers that do little or no work most
    of the time”


    NRDC

    View Slide

  25. #RedHat
    @[email protected]
    the average server:


    12 - 18% of capacity


    30 - 60 % of maximum power
    https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf

    View Slide

  26. #RedHat
    @[email protected]
    2014 survey


    29%


    of 4,000 active less than
    5% of the time
    https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf

    View Slide

  27. @holly_cummins #RedHat
    https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
    2021 study


    View Slide

  28. @holly_cummins #RedHat
    $26.6 billion
    https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
    2021 study


    View Slide

  29. @holly_cummins #RedHat
    $26.6 billion
    wasted by always-on
    cloud instances
    https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
    2021 study


    View Slide

  30. #RedHat
    @[email protected]
    it’s not just runtime costs

    View Slide

  31. #RedHat
    @[email protected]
    embodied carbon
    it’s not just runtime costs

    View Slide

  32. #RedHat
    @[email protected]
    why does this happen?

    View Slide

  33. @holly_cummins #RedHat
    managing machines is hard

    View Slide

  34. @holly_cummins #RedHat
    managing machines is hard

    View Slide

  35. View Slide

  36. #RedHat
    @[email protected]
    “perhaps someone
    forgot to turn them off”


    Antithesis Institute

    View Slide

  37. View Slide

  38. #RedHat
    @[email protected]
    projects ended

    View Slide

  39. #RedHat
    @[email protected]
    projects ended
    business processes changed

    View Slide

  40. #RedHat
    @[email protected]
    projects ended
    business processes changed
    over-provisioning

    View Slide

  41. #RedHat
    @[email protected]
    projects ended
    business processes changed
    over-provisioning
    isolation requirements

    View Slide

  42. @holly_cummins #RedHat
    risk averse processes

    View Slide

  43. @holly_cummins #RedHat
    “we run this as a batch job on weekends,
    but the servers stay up all week”



    View Slide

  44. @holly_cummins #RedHat
    “we run this as a batch job on weekends,
    but the servers stay up all week”

    View Slide

  45. @holly_cummins #RedHat
    “we only use this system in UK working hours,
    but we leave it running 24/7 ”



    View Slide

  46. @holly_cummins #RedHat
    “we only use this system in UK working hours,
    but we leave it running 24/7 ”

    View Slide

  47. @holly_cummins #RedHat
    auto-scaling algorithms are optimised for availability

    View Slide

  48. @holly_cummins #RedHat
    green computing model: the four vowels

    View Slide

  49. @holly_cummins #RedHat
    green computing model: the four vowels

    View Slide

  50. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity

    View Slide

  51. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation

    View Slide

  52. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency

    View Slide

  53. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View Slide

  54. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View Slide

  55. @holly_cummins #RedHat
    application
    utilisation

    View Slide

  56. @holly_cummins #RedHat
    application
    utilisation
    high utilisation


    good case

    View Slide

  57. @holly_cummins #RedHat
    application
    utilisation
    over-utilisation


    very bad case

    View Slide

  58. @holly_cummins #RedHat
    application
    utilisation
    over-utilisation


    very bad case
    under-utilisation


    wasteful case

    View Slide

  59. @holly_cummins #RedHat
    application
    elasticity
    high utilisation


    good case
    @holly_cummins

    View Slide

  60. @holly_cummins #RedHat
    application
    elasticity
    scale-up


    good utilisation
    @holly_cummins

    View Slide

  61. @holly_cummins #RedHat
    application
    elasticity
    scale-down


    good utilisation
    @holly_cummins

    View Slide

  62. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View Slide

  63. @holly_cummins #RedHat
    green computing model: the four vowels
    elasticity
    utilisation
    efficiency
    utility

    View Slide

  64. @holly_cummins #RedHat
    There is nothing so useless as
    doing efficiently that which
    should not be done at all.


    Peter Drucker
    why utility matters

    View Slide

  65. @holly_cummins #RedHat
    “efficient zombies”

    View Slide

  66. @holly_cummins #RedHat
    how do we solve the zombie problem?

    View Slide

  67. @holly_cummins #RedHat
    how do we solve the zombie problem?
    detection and destruction

    View Slide

  68. View Slide

  69. @holly_cummins #RedHat
    system archaeology
    … is not easy

    View Slide

  70. @holly_cummins #RedHat
    scream test

    View Slide

  71. @holly_cummins #RedHat
    “eco-monkey”

    View Slide

  72. @holly_cummins
    #RedHat
    the scream is real

    View Slide

  73. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose

    View Slide

  74. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose
    let’s turn it off!

    View Slide

  75. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose
    uh … why did the
    backbone of a
    client’s network
    just vanish?
    let’s turn it off!

    View Slide

  76. @holly_cummins
    #RedHat
    the scream is real
    this internal server
    doesn’t seem to have
    a purpose
    uh … why did the
    backbone of a
    client’s network
    just vanish?
    let’s turn it off!
    oops.

    View Slide

  77. @holly_cummins #RedHat
    IT Department, UK Bank
    let’s figure out what all
    these cloud workloads are,
    since I’m paying for them
    long meetings

    View Slide

  78. @holly_cummins #RedHat
    IT Department, UK Bank
    let’s figure out what all
    these cloud workloads are,
    since I’m paying for them
    long meetings

    View Slide

  79. @holly_cummins #RedHat
    long emails

    View Slide

  80. @holly_cummins #RedHat
    tags

    View Slide

  81. @holly_cummins #RedHat
    all the —opses

    View Slide

  82. @holly_cummins #RedHat
    GreenOps

    View Slide

  83. @holly_cummins #RedHat
    GreenOps
    greenops is a mid-sized trilobite (really)

    View Slide

  84. @holly_cummins #RedHat
    FinOps
    figuring out who in your company forgot to turn off their cloud

    View Slide

  85. @holly_cummins #RedHat

    View Slide

  86. @holly_cummins #RedHat
    backstage.io

    View Slide

  87. @holly_cummins #RedHat
    backstage.io
    •cost insights plugin

    View Slide

  88. @holly_cummins #RedHat
    backstage.io
    •cost insights plugin
    •cloud carbon footprint plugin

    View Slide

  89. • Densify


    • Granulate


    • Turbonomic Application Resource Management


    • TSO Logic


    • etc
    AIOps

    View Slide

  90. 21%
    improvement from installing Turbonomic


    in IBM CIO office

    View Slide

  91. @holly_cummins #RedHat
    traffic monitoring

    View Slide

  92. @holly_cummins #RedHat
    but.


    knowing is only half the battle.

    View Slide

  93. @holly_cummins #RedHat
    the ikea effect

    View Slide

  94. @holly_cummins #RedHat
    the ikea effect
    labour

    View Slide

  95. @holly_cummins #RedHat
    the ikea effect
    labour

    View Slide

  96. @holly_cummins #RedHat
    the ikea effect
    labour love

    View Slide

  97. @holly_cummins #RedHat
    shut it down?


    but … what if I
    need this
    cluster later?

    View Slide

  98. @holly_cummins #RedHat
    elasticity
    native quarkus starts
    faster than a light bulb

    View Slide

  99. @holly_cummins
    #RedHat
    ultimate elasticity


    View Slide

  100. @holly_cummins #RedHat
    we don’t switch the light off
    because we’re not sure if it will
    come back on

    View Slide

  101. @holly_cummins #RedHat
    we don’t switch the server off
    because we’re not sure if it will come
    back on
    happens all the time


    View Slide

  102. @holly_cummins #RedHat
    we don’t switch the server off
    because it would be too much work
    to recreate it
    happens all the time


    View Slide

  103. @holly_cummins
    #RedHat

    View Slide

  104. @holly_cummins
    #RedHat

    View Slide

  105. @holly_cummins
    #RedHat
    turning it off and on again must

    View Slide

  106. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast

    View Slide

  107. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast
    • actually work

    View Slide

  108. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast
    • actually work
    • idempotency

    View Slide

  109. @holly_cummins
    #RedHat
    turning it off and on again must
    • be fast
    • actually work
    • idempotency
    • resiliency

    View Slide

  110. @holly_cummins
    #RedHat
    making turning servers off as safe and easy as turning lights off

    View Slide

  111. @holly_cummins
    #RedHat
    LightSwitchOps
    making turning servers off as safe and easy as turning lights off

    View Slide

  112. @holly_cummins
    #RedHat
    simple scripts
    we used to leave
    our applications
    running all the time
    @darkandnerdy, Chicago DevOpsDays

    View Slide

  113. @holly_cummins
    #RedHat
    simple scripts
    we used to leave
    our applications
    running all the time
    when we
    scripted turning
    them off at night,
    we reduced our
    cloud bill by
    30%
    @darkandnerdy, Chicago DevOpsDays

    View Slide

  114. @holly_cummins #RedHat

    View Slide

  115. @holly_cummins #RedHat
    GitOps

    View Slide

  116. @holly_cummins #RedHat
    GitOps
    (infrastructure as code)

    View Slide

  117. @holly_cummins #RedHat

    View Slide

  118. @holly_cummins #RedHat
    spin it down

    View Slide

  119. @holly_cummins #RedHat
    kubectl apply -f all-my-cluster/
    spin it down
    spin it up

    View Slide

  120. @holly_cummins #RedHat
    kubectl apply -f all-my-cluster/
    spin it down
    spin it up

    View Slide

  121. @holly_cummins #RedHat
    kubectl apply -f all-my-cluster/
    ansible-playbook stuff.yml
    spin it down
    spin it up

    View Slide

  122. reducing snowflakes
    reduces redundancy

    View Slide

  123. we need to have another
    copy of our expensive cluster in
    another region so we have
    failover!

    View Slide

  124. we need to have another
    copy of our expensive cluster in
    another region so we have
    failover!
    uh … sounds
    expensive. are you
    sure about that?

    View Slide

  125. rapid recovery does not
    require redundant servers

    View Slide

  126. zombie reduction does
    not need to be fancy

    View Slide

  127. @holly_cummins #RedHat
    large bank, 2013


    50%


    reduction in CPUs with a
    lease system

    View Slide

  128. @holly_cummins #RedHat
    large bank, 2013


    50%


    reduction in CPUs with a
    lease system

    View Slide

  129. things that (maybe) don’t help

    View Slide

  130. @holly_cummins #RedHat
    things that (maybe) don’t help


    “out of sight, out of mind”
    cloud

    View Slide

  131. @holly_cummins #RedHat

    View Slide

  132. @holly_cummins #RedHat
    things that (maybe) don’t help


    virtualisation
    2019 survey


    30%


    of virtual servers doing
    no useful work

    View Slide

  133. @holly_cummins #RedHat
    things that (maybe) don’t help


    virtualisation
    2019 survey


    30%


    of virtual servers doing
    no useful work


    50%


    of virtual servers active
    less than 5% of the time

    View Slide

  134. #RedHat
    @[email protected]
    you still need to remember to
    turn the virtual machine off

    View Slide

  135. what about serverless?

    View Slide

  136. modernising to serverless is a big lift

    View Slide

  137. may not suit latency-sensitive workloads

    View Slide

  138. “we solve the cold-start problem by …


    … keeping an instance running but not billing you”

    View Slide

  139. @holly_cummins #RedHat
    application
    serverless systems may have high overheads

    View Slide

  140. @holly_cummins #RedHat
    control plane
    application
    serverless systems may have high overheads

    View Slide

  141. @holly_cummins #RedHat
    control plane
    application
    serverless systems may have high overheads

    View Slide

  142. @holly_cummins #RedHat
    control plane
    application
    serverless systems may have high overheads

    View Slide

  143. https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf

    View Slide

  144. https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf
    virtualisation overheads
    mean each function request
    can use 30x more energy
    than a plain http server

    View Slide

  145. are all parts of the system elastic?

    View Slide

  146. things that definitely don’t help

    View Slide

  147. @holly_cummins #RedHat
    things that don’t help


    prevention

    View Slide

  148. @holly_cummins #RedHat
    things that don’t help


    prevention (?!)

    View Slide

  149. surely shutting the barn door before
    the horse has left is a good idea?

    View Slide

  150. prevention == heavy governance

    View Slide

  151. remember the ikea effect?

    View Slide

  152. remember the ikea effect?
    people will not surrender
    servers that were hard to get

    View Slide

  153. zombies are not just servers

    View Slide

  154. data

    View Slide

  155. traffic

    View Slide

  156. zombie packets

    View Slide

  157. @holly_cummins #RedHat
    internet background noise

    View Slide

  158. @holly_cummins #RedHat
    internet background noise
    5.5 gigabits/s

    View Slide

  159. @holly_cummins #RedHat
    unsolved problem == opportunity

    View Slide

  160. @holly_cummins #RedHat
    the double-win
    turning things off saves a lot of money

    View Slide

  161. @holly_cummins #RedHat

    View Slide

  162. @holly_cummins #RedHat
    users …

    View Slide

  163. @holly_cummins #RedHat
    up utilisation


    aim for elasticity


    limit kubesprawl


    de-zombify


    know what you’re using


    turn it off
    users …

    View Slide

  164. @holly_cummins #RedHat
    1-2%

    View Slide

  165. @holly_cummins #RedHat
    tool creators, support
    1-2%

    View Slide

  166. @holly_cummins #RedHat
    better utilisation


    elasticity


    multi-tenancy


    de-zombification


    visibility


    disposability
    tool creators, support
    1-2%

    View Slide

  167. GreenOps


    FinOps


    AIOps


    GitOps


    LightSwitchOps


    View Slide

  168. GreenOps


    FinOps


    AIOps


    GitOps


    LightSwitchOps


    View Slide

  169. thank you


    @[email protected]
    slides

    View Slide