Holly Cummins
Red Hat
QCon London | March 29, 2023
Why Cloud Zombies Are
Destroying the Planet
and
How You Can Stop Them
Slide 2
Slide 2 text
Holly Cummins
Red Hat
QCon London | March 29, 2023
Why Cloud Zombies Are
Destroying the Planet
and
How You Can Stop Them
Slide 3
Slide 3 text
@holly_cummins #RedHat
Slide 4
Slide 4 text
@holly_cummins #RedHat
Slide 5
Slide 5 text
@therealmarkw1, twitter
Slide 6
Slide 6 text
what do these servers do?
@therealmarkw1, twitter
Slide 7
Slide 7 text
what do these servers do?
one is a backup for the
other.
@therealmarkw1, twitter
Slide 8
Slide 8 text
what do these servers do?
one is a backup for the
other.
yes, but what do they do?
@therealmarkw1, twitter
Slide 9
Slide 9 text
what do these servers do?
one is a backup for the
other.
yes, but what do they do?
@therealmarkw1, twitter
no one has known for a
couple of decades
Slide 10
Slide 10 text
#RedHat
@[email protected]
Hey boss, I created a
Kubernetes cluster.
2018
Slide 11
Slide 11 text
#RedHat
@[email protected]
Hey boss, I created a
Kubernetes cluster.
I forgot it for 2 months.
2018
Slide 12
Slide 12 text
#RedHat
@[email protected]
Hey boss, I created a
Kubernetes cluster.
I forgot it for 2 months.
… and it’s €1000 a month.
2018
Slide 13
Slide 13 text
#RedHat
@[email protected]
Hey boss, while I was
working on a QCon talk
about sustainability …
2023
Slide 14
Slide 14 text
#RedHat
@[email protected]
Hey boss, while I was
working on a QCon talk
about sustainability …
I left the Quarkus CI
on Mac disabled
2023
Slide 15
Slide 15 text
#RedHat
@[email protected]
Hey boss, while I was
working on a QCon talk
about sustainability …
… and the instance is $159 a
month.
I left the Quarkus CI
on Mac disabled
2023
Slide 16
Slide 16 text
@holly_cummins #RedHat
“measure, don’t guess”
(or decide based on stories on the internet)
Slide 17
Slide 17 text
@holly_cummins #RedHat
actual picture of a zombie
(it’s invisible)
Slide 18
Slide 18 text
@holly_cummins #RedHat
actual picture of a zombie
(it’s invisible)
Slide 19
Slide 19 text
#RedHat
@[email protected]
2015 survey
30%
of 4,000 servers doing
no useful work
Slide 20
Slide 20 text
#RedHat
@[email protected]
2017 survey
25%
of 16,000 servers doing
no useful work
Slide 21
Slide 21 text
#RedHat
@[email protected]
zombie
“they haven't delivered any information or
computing services for six months or more”
#RedHat
@[email protected]
“much of the energy consumed by U.S. data
centers is used to power more than 12
million servers that do little or no work most
of the time”
NRDC
Slide 25
Slide 25 text
#RedHat
@[email protected]
the average server:
12 - 18% of capacity
30 - 60 % of maximum power
https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf
Slide 26
Slide 26 text
#RedHat
@[email protected]
2014 survey
29%
of 4,000 active less than
5% of the time
https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf
Slide 27
Slide 27 text
@holly_cummins #RedHat
https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
2021 study
Slide 28
Slide 28 text
@holly_cummins #RedHat
$26.6 billion
https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
2021 study
Slide 29
Slide 29 text
@holly_cummins #RedHat
$26.6 billion
wasted by always-on
cloud instances
https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
2021 study
@holly_cummins #RedHat
“we run this as a batch job on weekends,
but the servers stay up all week”
“
Slide 44
Slide 44 text
@holly_cummins #RedHat
“we run this as a batch job on weekends,
but the servers stay up all week”
Slide 45
Slide 45 text
@holly_cummins #RedHat
“we only use this system in UK working hours,
but we leave it running 24/7 ”
“
Slide 46
Slide 46 text
@holly_cummins #RedHat
“we only use this system in UK working hours,
but we leave it running 24/7 ”
Slide 47
Slide 47 text
@holly_cummins #RedHat
auto-scaling algorithms are optimised for availability
Slide 48
Slide 48 text
@holly_cummins #RedHat
green computing model: the four vowels
Slide 49
Slide 49 text
@holly_cummins #RedHat
green computing model: the four vowels
Slide 50
Slide 50 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
Slide 51
Slide 51 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
utilisation
Slide 52
Slide 52 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
utilisation
efficiency
Slide 53
Slide 53 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
utilisation
efficiency
utility
Slide 54
Slide 54 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
utilisation
efficiency
utility
Slide 55
Slide 55 text
@holly_cummins #RedHat
application
utilisation
Slide 56
Slide 56 text
@holly_cummins #RedHat
application
utilisation
high utilisation
good case
Slide 57
Slide 57 text
@holly_cummins #RedHat
application
utilisation
over-utilisation
very bad case
Slide 58
Slide 58 text
@holly_cummins #RedHat
application
utilisation
over-utilisation
very bad case
under-utilisation
wasteful case
Slide 59
Slide 59 text
@holly_cummins #RedHat
application
elasticity
high utilisation
good case
@holly_cummins
Slide 60
Slide 60 text
@holly_cummins #RedHat
application
elasticity
scale-up
good utilisation
@holly_cummins
Slide 61
Slide 61 text
@holly_cummins #RedHat
application
elasticity
scale-down
good utilisation
@holly_cummins
Slide 62
Slide 62 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
utilisation
efficiency
utility
Slide 63
Slide 63 text
@holly_cummins #RedHat
green computing model: the four vowels
elasticity
utilisation
efficiency
utility
Slide 64
Slide 64 text
@holly_cummins #RedHat
There is nothing so useless as
doing efficiently that which
should not be done at all.
Peter Drucker
why utility matters
Slide 65
Slide 65 text
@holly_cummins #RedHat
“efficient zombies”
Slide 66
Slide 66 text
@holly_cummins #RedHat
how do we solve the zombie problem?
Slide 67
Slide 67 text
@holly_cummins #RedHat
how do we solve the zombie problem?
detection and destruction
Slide 68
Slide 68 text
No content
Slide 69
Slide 69 text
@holly_cummins #RedHat
system archaeology
… is not easy
Slide 70
Slide 70 text
@holly_cummins #RedHat
scream test
Slide 71
Slide 71 text
@holly_cummins #RedHat
“eco-monkey”
Slide 72
Slide 72 text
@holly_cummins
#RedHat
the scream is real
Slide 73
Slide 73 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
Slide 74
Slide 74 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
let’s turn it off!
Slide 75
Slide 75 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
uh … why did the
backbone of a
client’s network
just vanish?
let’s turn it off!
Slide 76
Slide 76 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
uh … why did the
backbone of a
client’s network
just vanish?
let’s turn it off!
oops.
Slide 77
Slide 77 text
@holly_cummins #RedHat
IT Department, UK Bank
let’s figure out what all
these cloud workloads are,
since I’m paying for them
long meetings
Slide 78
Slide 78 text
@holly_cummins #RedHat
IT Department, UK Bank
let’s figure out what all
these cloud workloads are,
since I’m paying for them
long meetings
Slide 79
Slide 79 text
@holly_cummins #RedHat
long emails
Slide 80
Slide 80 text
@holly_cummins #RedHat
tags
Slide 81
Slide 81 text
@holly_cummins #RedHat
all the —opses
Slide 82
Slide 82 text
@holly_cummins #RedHat
GreenOps
Slide 83
Slide 83 text
@holly_cummins #RedHat
GreenOps
greenops is a mid-sized trilobite (really)
Slide 84
Slide 84 text
@holly_cummins #RedHat
FinOps
figuring out who in your company forgot to turn off their cloud
21%
improvement from installing Turbonomic
in IBM CIO office
Slide 91
Slide 91 text
@holly_cummins #RedHat
traffic monitoring
Slide 92
Slide 92 text
@holly_cummins #RedHat
but.
knowing is only half the battle.
Slide 93
Slide 93 text
@holly_cummins #RedHat
the ikea effect
Slide 94
Slide 94 text
@holly_cummins #RedHat
the ikea effect
labour
Slide 95
Slide 95 text
@holly_cummins #RedHat
the ikea effect
labour
Slide 96
Slide 96 text
@holly_cummins #RedHat
the ikea effect
labour love
Slide 97
Slide 97 text
@holly_cummins #RedHat
shut it down?
but … what if I
need this
cluster later?
Slide 98
Slide 98 text
@holly_cummins #RedHat
elasticity
native quarkus starts
faster than a light bulb
Slide 99
Slide 99 text
@holly_cummins
#RedHat
ultimate elasticity
Slide 100
Slide 100 text
@holly_cummins #RedHat
we don’t switch the light off
because we’re not sure if it will
come back on
Slide 101
Slide 101 text
@holly_cummins #RedHat
we don’t switch the server off
because we’re not sure if it will come
back on
happens all the time
Slide 102
Slide 102 text
@holly_cummins #RedHat
we don’t switch the server off
because it would be too much work
to recreate it
happens all the time
Slide 103
Slide 103 text
@holly_cummins
#RedHat
Slide 104
Slide 104 text
@holly_cummins
#RedHat
Slide 105
Slide 105 text
@holly_cummins
#RedHat
turning it off and on again must
Slide 106
Slide 106 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
Slide 107
Slide 107 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
• actually work
Slide 108
Slide 108 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
• actually work
• idempotency
Slide 109
Slide 109 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
• actually work
• idempotency
• resiliency
Slide 110
Slide 110 text
@holly_cummins
#RedHat
making turning servers off as safe and easy as turning lights off
Slide 111
Slide 111 text
@holly_cummins
#RedHat
LightSwitchOps
making turning servers off as safe and easy as turning lights off
Slide 112
Slide 112 text
@holly_cummins
#RedHat
simple scripts
we used to leave
our applications
running all the time
@darkandnerdy, Chicago DevOpsDays
Slide 113
Slide 113 text
@holly_cummins
#RedHat
simple scripts
we used to leave
our applications
running all the time
when we
scripted turning
them off at night,
we reduced our
cloud bill by
30%
@darkandnerdy, Chicago DevOpsDays
Slide 114
Slide 114 text
@holly_cummins #RedHat
Slide 115
Slide 115 text
@holly_cummins #RedHat
GitOps
Slide 116
Slide 116 text
@holly_cummins #RedHat
GitOps
(infrastructure as code)
Slide 117
Slide 117 text
@holly_cummins #RedHat
Slide 118
Slide 118 text
@holly_cummins #RedHat
spin it down
Slide 119
Slide 119 text
@holly_cummins #RedHat
kubectl apply -f all-my-cluster/
spin it down
spin it up
Slide 120
Slide 120 text
@holly_cummins #RedHat
kubectl apply -f all-my-cluster/
spin it down
spin it up
Slide 121
Slide 121 text
@holly_cummins #RedHat
kubectl apply -f all-my-cluster/
ansible-playbook stuff.yml
spin it down
spin it up
Slide 122
Slide 122 text
reducing snowflakes
reduces redundancy
Slide 123
Slide 123 text
we need to have another
copy of our expensive cluster in
another region so we have
failover!
Slide 124
Slide 124 text
we need to have another
copy of our expensive cluster in
another region so we have
failover!
uh … sounds
expensive. are you
sure about that?
Slide 125
Slide 125 text
rapid recovery does not
require redundant servers
Slide 126
Slide 126 text
zombie reduction does
not need to be fancy
Slide 127
Slide 127 text
@holly_cummins #RedHat
large bank, 2013
50%
reduction in CPUs with a
lease system
Slide 128
Slide 128 text
@holly_cummins #RedHat
large bank, 2013
50%
reduction in CPUs with a
lease system
Slide 129
Slide 129 text
things that (maybe) don’t help
Slide 130
Slide 130 text
@holly_cummins #RedHat
things that (maybe) don’t help
“out of sight, out of mind”
cloud
Slide 131
Slide 131 text
@holly_cummins #RedHat
Slide 132
Slide 132 text
@holly_cummins #RedHat
things that (maybe) don’t help
virtualisation
2019 survey
30%
of virtual servers doing
no useful work
Slide 133
Slide 133 text
@holly_cummins #RedHat
things that (maybe) don’t help
virtualisation
2019 survey
30%
of virtual servers doing
no useful work
50%
of virtual servers active
less than 5% of the time
Slide 134
Slide 134 text
#RedHat
@[email protected]
you still need to remember to
turn the virtual machine off
Slide 135
Slide 135 text
what about serverless?
Slide 136
Slide 136 text
modernising to serverless is a big lift
Slide 137
Slide 137 text
may not suit latency-sensitive workloads
Slide 138
Slide 138 text
“we solve the cold-start problem by …
… keeping an instance running but not billing you”
Slide 139
Slide 139 text
@holly_cummins #RedHat
application
serverless systems may have high overheads
Slide 140
Slide 140 text
@holly_cummins #RedHat
control plane
application
serverless systems may have high overheads
Slide 141
Slide 141 text
@holly_cummins #RedHat
control plane
application
serverless systems may have high overheads
Slide 142
Slide 142 text
@holly_cummins #RedHat
control plane
application
serverless systems may have high overheads
Slide 143
Slide 143 text
https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf
Slide 144
Slide 144 text
https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf
virtualisation overheads
mean each function request
can use 30x more energy
than a plain http server
Slide 145
Slide 145 text
are all parts of the system elastic?
Slide 146
Slide 146 text
things that definitely don’t help
Slide 147
Slide 147 text
@holly_cummins #RedHat
things that don’t help
prevention
Slide 148
Slide 148 text
@holly_cummins #RedHat
things that don’t help
prevention (?!)
Slide 149
Slide 149 text
surely shutting the barn door before
the horse has left is a good idea?
Slide 150
Slide 150 text
prevention == heavy governance
Slide 151
Slide 151 text
remember the ikea effect?
Slide 152
Slide 152 text
remember the ikea effect?
people will not surrender
servers that were hard to get
Slide 153
Slide 153 text
zombies are not just servers
Slide 154
Slide 154 text
data
Slide 155
Slide 155 text
traffic
Slide 156
Slide 156 text
zombie packets
Slide 157
Slide 157 text
@holly_cummins #RedHat
internet background noise
Slide 158
Slide 158 text
@holly_cummins #RedHat
internet background noise
5.5 gigabits/s
Slide 159
Slide 159 text
@holly_cummins #RedHat
unsolved problem == opportunity
Slide 160
Slide 160 text
@holly_cummins #RedHat
the double-win
turning things off saves a lot of money
Slide 161
Slide 161 text
@holly_cummins #RedHat
Slide 162
Slide 162 text
@holly_cummins #RedHat
users …
Slide 163
Slide 163 text
@holly_cummins #RedHat
up utilisation
aim for elasticity
limit kubesprawl
de-zombify
know what you’re using
turn it off
users …
Slide 164
Slide 164 text
@holly_cummins #RedHat
1-2%
Slide 165
Slide 165 text
@holly_cummins #RedHat
tool creators, support
1-2%