Holly Cummins
Red Hat
KubeCon | CloudNativeCon
April 3, 2025
Practical Zombie Hunting
for
Kubernetes Users
Slide 2
Slide 2 text
#RedHat
@hollycummins.com
Slide 3
Slide 3 text
#RedHat
@hollycummins.com
Hey boss, I created a
Kubernetes cluster.
2018
Slide 4
Slide 4 text
#RedHat
@hollycummins.com
Hey boss, I created a
Kubernetes cluster.
I forgot it for 2 months.
2018
Slide 5
Slide 5 text
#RedHat
@hollycummins.com
Hey boss, I created a
Kubernetes cluster.
I forgot it for 2 months.
… and it’s €1000 a month.
2018
Slide 6
Slide 6 text
#RedHat
@hollycummins.com
Slide 7
Slide 7 text
@therealmarkw1, twitter
Slide 8
Slide 8 text
what do these servers do?
@therealmarkw1, twitter
Slide 9
Slide 9 text
what do these servers do?
one is a backup for the
other.
@therealmarkw1, twitter
Slide 10
Slide 10 text
what do these servers do?
one is a backup for the
other.
yes, but what do they do?
@therealmarkw1, twitter
Slide 11
Slide 11 text
what do these servers do?
one is a backup for the
other.
yes, but what do they do?
@therealmarkw1, twitter
no one has known for a
couple of decades
Slide 12
Slide 12 text
#RedHat
@hollycummins.com
“measure, don’t guess”
(or decide based on stories on the internet)
Slide 13
Slide 13 text
#RedHat
@hollycummins.com
actual picture of a zombie
(it’s invisible)
Slide 14
Slide 14 text
#RedHat
@hollycummins.com
actual picture of a zombie
(it’s invisible)
Slide 15
Slide 15 text
#RedHat
@hollycummins.com
2015 survey
30%
of 4,000 servers doing
no useful work
Slide 16
Slide 16 text
#RedHat
@hollycummins.com
2017 survey
25%
of 16,000 servers doing
no useful work
Slide 17
Slide 17 text
#RedHat
@hollycummins.com
zombie
“they haven't delivered any information or
computing services for six months or more”
Slide 18
Slide 18 text
#RedHat
@hollycummins.com
“comatose servers”
Slide 19
Slide 19 text
#RedHat
@hollycummins.com
under-utilised servers
Slide 20
Slide 20 text
#RedHat
@hollycummins.com
“we run this as a batch job on weekends,
but the servers stay up all week”
“
Slide 21
Slide 21 text
#RedHat
@hollycummins.com
“we run this as a batch job on weekends,
but the servers stay up all week”
Slide 22
Slide 22 text
#RedHat
@hollycummins.com
“we only use this system in UK working hours,
but we leave it running 24/7 ”
“
Slide 23
Slide 23 text
#RedHat
@hollycummins.com
“we only use this system in UK working hours,
but we leave it running 24/7 ”
Slide 24
Slide 24 text
#RedHat
@hollycummins.com
2014 survey
29%
of 4,000 active less than
5% of the time
https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf
Slide 25
Slide 25 text
#RedHat
@hollycummins.com
the average server:
12 - 18% of capacity
30 - 60 % of maximum power
https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf
Slide 26
Slide 26 text
#RedHat
@hollycummins.com
https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
2021 study
Slide 27
Slide 27 text
#RedHat
@hollycummins.com
$26.6 billion
https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
2021 study
Slide 28
Slide 28 text
#RedHat
@hollycummins.com
$26.6 billion
wasted by always-on
cloud instances
https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033
2021 study
Slide 29
Slide 29 text
#RedHat
@hollycummins.com
the big (green) picture
Slide 30
Slide 30 text
#RedHat
@hollycummins.com
the big (green) picture
green principles
Slide 31
Slide 31 text
#RedHat
@hollycummins.com
green software foundation: principles
Slide 32
Slide 32 text
#RedHat
@hollycummins.com
carbon
awareness
green software foundation: principles
Slide 33
Slide 33 text
#RedHat
@hollycummins.com
carbon
awareness
green software foundation: principles
where
Slide 34
Slide 34 text
#RedHat
@hollycummins.com
carbon
awareness
green software foundation: principles
where
when
Slide 35
Slide 35 text
#RedHat
@hollycummins.com
carbon
awareness
green software foundation: principles
hardware
efficiency
where
when
Slide 36
Slide 36 text
#RedHat
@hollycummins.com
carbon
awareness
green software foundation: principles
hardware
efficiency
electricity
efficiency
where
when
Slide 37
Slide 37 text
#RedHat
@hollycummins.com
algorithms
carbon
awareness
green software foundation: principles
hardware
efficiency
electricity
efficiency
where
when
Slide 38
Slide 38 text
#RedHat
@hollycummins.com
algorithms
stack
carbon
awareness
green software foundation: principles
hardware
efficiency
electricity
efficiency
where
when
Slide 39
Slide 39 text
#RedHat
@hollycummins.com
algorithms
stack
carbon
awareness
green software foundation: principles
hardware
efficiency
electricity
efficiency
where
when
quarkus!
#RedHat
@hollycummins.com
“to avoid CRD conflicts, we use the
cluster as the unit of deployment”
Slide 70
Slide 70 text
#RedHat
@hollycummins.com
auto-scaling algorithms are optimised for availability
Slide 71
Slide 71 text
#RedHat
@hollycummins.com
application
utilisation
Slide 72
Slide 72 text
#RedHat
@hollycummins.com
application
utilisation
high utilisation
good case
Slide 73
Slide 73 text
#RedHat
@hollycummins.com
application
utilisation
over-utilisation
very bad case
Slide 74
Slide 74 text
#RedHat
@hollycummins.com
application
utilisation
over-utilisation
very bad case
under-utilisation
wasteful case
Slide 75
Slide 75 text
#RedHat
@hollycummins.com
application
elasticity
high utilisation
good case
@holly_cummins
Slide 76
Slide 76 text
#RedHat
@hollycummins.com
application
elasticity
scale-up
good utilisation
@holly_cummins
Slide 77
Slide 77 text
#RedHat
@hollycummins.com
application
elasticity
scale-down
good utilisation
@holly_cummins
Slide 78
Slide 78 text
#RedHat
@hollycummins.com
how do we solve the zombie problem?
Slide 79
Slide 79 text
#RedHat
@hollycummins.com
how do we solve the zombie problem?
detection and destruction
Slide 80
Slide 80 text
No content
Slide 81
Slide 81 text
#RedHat
@hollycummins.com
system archaeology
… is not easy
Slide 82
Slide 82 text
#RedHat
@hollycummins.com
IT Department, UK Bank
let’s figure out what all
these cloud workloads are,
since I’m paying for them
long meetings
Slide 83
Slide 83 text
#RedHat
@hollycummins.com
IT Department, UK Bank
let’s figure out what all
these cloud workloads are,
since I’m paying for them
long meetings
zzzz
zzzzzzz
zz
zzzzzz
Slide 84
Slide 84 text
#RedHat
@hollycummins.com
big spreadsheets
Slide 85
Slide 85 text
#RedHat
@hollycummins.com
big spreadsheets
Slide 86
Slide 86 text
#RedHat
@hollycummins.com
big spreadsheets
Slide 87
Slide 87 text
#RedHat
@hollycummins.com
long emails
Slide 88
Slide 88 text
#RedHat
@hollycummins.com
tags
Slide 89
Slide 89 text
#RedHat
@hollycummins.com
scream test
Slide 90
Slide 90 text
#RedHat
@hollycummins.com
“eco-monkey”
Slide 91
Slide 91 text
@holly_cummins
#RedHat
the scream is real
Slide 92
Slide 92 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
Slide 93
Slide 93 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
let’s turn it off!
Slide 94
Slide 94 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
uh … why did the
backbone of a
client’s network
just vanish?
let’s turn it off!
Slide 95
Slide 95 text
@holly_cummins
#RedHat
the scream is real
this internal server
doesn’t seem to have
a purpose
uh … why did the
backbone of a
client’s network
just vanish?
let’s turn it off!
oops.
Slide 96
Slide 96 text
#RedHat
@hollycummins.com
all the —opses
Slide 97
Slide 97 text
#RedHat
@hollycummins.com
GreenOps
Slide 98
Slide 98 text
#RedHat
@hollycummins.com
GreenOps
greenops is a mid-sized trilobite (really)
Slide 99
Slide 99 text
#RedHat
@hollycummins.com
FinOps
figuring out who in your company forgot to turn off their cloud
#RedHat
@hollycummins.com
but.
detecting is only half the battle.
Slide 106
Slide 106 text
#RedHat
@hollycummins.com
the ikea effect
Slide 107
Slide 107 text
#RedHat
@hollycummins.com
the ikea effect
labour
Slide 108
Slide 108 text
#RedHat
@hollycummins.com
the ikea effect
labour
Slide 109
Slide 109 text
#RedHat
@hollycummins.com
the ikea effect
labour love
Slide 110
Slide 110 text
#RedHat
@hollycummins.com
shut it down?
but … what if I
need this
cluster later?
Slide 111
Slide 111 text
@holly_cummins
#RedHat
ultimate elasticity
Slide 112
Slide 112 text
#RedHat
@hollycummins.com
we don’t switch the light off
because we’re not sure if it will
come back on
Slide 113
Slide 113 text
#RedHat
@hollycummins.com
we don’t switch the server off
because we’re not sure if it will come
back on
happens all the time
Slide 114
Slide 114 text
#RedHat
@hollycummins.com
we don’t switch the server off
because it would be too much work
to recreate it
happens all the time
Slide 115
Slide 115 text
@holly_cummins
#RedHat
Slide 116
Slide 116 text
@holly_cummins
#RedHat
Slide 117
Slide 117 text
@holly_cummins
#RedHat
turning it off and on again must
Slide 118
Slide 118 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
Slide 119
Slide 119 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
• actually work
Slide 120
Slide 120 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
• actually work
• idempotency
Slide 121
Slide 121 text
@holly_cummins
#RedHat
turning it off and on again must
• be fast
• actually work
• idempotency
• resiliency
Slide 122
Slide 122 text
@holly_cummins
#RedHat
making turning servers off as safe and easy as turning lights off
Slide 123
Slide 123 text
@holly_cummins
#RedHat
LightSwitchOps
making turning servers off as safe and easy as turning lights off
Slide 124
Slide 124 text
@holly_cummins
#RedHat
step 1: no more scary state
Slide 125
Slide 125 text
#RedHat
@hollycummins.com
GitOps
(infrastructure as code)
Slide 126
Slide 126 text
#RedHat
@hollycummins.com
GitOps
(infrastructure as code)
Slide 127
Slide 127 text
#RedHat
@hollycummins.com
Slide 128
Slide 128 text
#RedHat
@hollycummins.com
spin it down
Slide 129
Slide 129 text
#RedHat
@hollycummins.com
kubectl apply -f all-my-cluster/
spin it down
spin it up
Slide 130
Slide 130 text
#RedHat
@hollycummins.com
kubectl apply -f all-my-cluster/
spin it down
spin it up
Slide 131
Slide 131 text
#RedHat
@hollycummins.com
kubectl apply -f all-my-cluster/
ansible-playbook stuff.yml
spin it down
spin it up
Slide 132
Slide 132 text
@holly_cummins
#RedHat
step 1: no more scary state
step 2: automate, automate
Slide 133
Slide 133 text
zombie reduction does
not need to be fancy
Slide 134
Slide 134 text
#RedHat
@hollycummins.com
large UK bank, 2013
50%
reduction in CPUs with a
lease system
self-destructing instances
Slide 135
Slide 135 text
#RedHat
@hollycummins.com
large UK bank, 2013
50%
reduction in CPUs with a
lease system
self-destructing instances
Slide 136
Slide 136 text
@holly_cummins
#RedHat
timed shutoff
we used to leave
our applications
running all the time
@darkandnerdy, Chicago DevOpsDays
Slide 137
Slide 137 text
@holly_cummins
#RedHat
timed shutoff
we used to leave
our applications
running all the time
when we
scripted turning
them off at night,
we reduced our
cloud bill by
30%
@darkandnerdy, Chicago DevOpsDays
Slide 138
Slide 138 text
@holly_cummins
#RedHat
my shell script to power
down machines overnight
saved my school €12,000
absurdly simple timed shutoff
Slide 139
Slide 139 text
@holly_cummins
#RedHat
fancier scripts …
with a frontend and a backend
Slide 140
Slide 140 text
@holly_cummins
#RedHat
commercial products
Slide 141
Slide 141 text
- Kruize Autotune
- PEAKS (Power Efficiency Aware
Kubernetes Scheduler)
- OpenShift Cost Management
Open Source utilization optimisation
21%
cost savings from installing Turbonomic
in IBM CIO office
Slide 144
Slide 144 text
things that (maybe) don’t help
Slide 145
Slide 145 text
things that (maybe) don’t help
#RedHat
@hollycummins.com
“out of sight, out of mind”
cloud
Slide 146
Slide 146 text
things that (maybe) don’t help
#RedHat
@hollycummins.com
virtualisation
2019 survey
30%
of virtual servers doing
no useful work
Slide 147
Slide 147 text
things that (maybe) don’t help
#RedHat
@hollycummins.com
virtualisation
2019 survey
30%
of virtual servers doing
no useful work
50%
of virtual servers active
less than 5% of the time
Slide 148
Slide 148 text
#RedHat
@hollycummins.com
you still need to remember to
turn the virtual machine off
Slide 149
Slide 149 text
what about serverless?
Slide 150
Slide 150 text
modernising to serverless is a big lift
Slide 151
Slide 151 text
may not suit latency-sensitive workloads
Slide 152
Slide 152 text
“we solve the cold-start problem by …
… keeping an instance running but not billing you”
Slide 153
Slide 153 text
#RedHat
@hollycummins.com
application
serverless systems may have high overheads
Slide 154
Slide 154 text
#RedHat
@hollycummins.com
control plane
application
serverless systems may have high overheads
Slide 155
Slide 155 text
#RedHat
@hollycummins.com
control plane
application
serverless systems may have high overheads
Slide 156
Slide 156 text
#RedHat
@hollycummins.com
control plane
application
serverless systems may have high overheads
Slide 157
Slide 157 text
https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf
virtualisation overheads
mean each function request
can use 30x more energy
than a plain http server
Slide 158
Slide 158 text
things that definitely don’t help
Slide 159
Slide 159 text
things that don’t help
#RedHat
@hollycummins.com
prevention
Slide 160
Slide 160 text
things that don’t help
#RedHat
@hollycummins.com
prevention (?!)
Slide 161
Slide 161 text
surely shutting the barn door before
the horse has left is a good idea?
Slide 162
Slide 162 text
prevention == heavy governance
Slide 163
Slide 163 text
remember the ikea effect?
Slide 164
Slide 164 text
remember the ikea effect?
people will not surrender
servers that were hard to get
Slide 165
Slide 165 text
#RedHat
@hollycummins.com
unsolved problem == opportunity
Slide 166
Slide 166 text
#RedHat
@hollycummins.com
the double-win
turning things off saves a lot of money
Slide 167
Slide 167 text
#RedHat
@hollycummins.com
Slide 168
Slide 168 text
#RedHat
@hollycummins.com
users …
Slide 169
Slide 169 text
#RedHat
@hollycummins.com
up utilisation
aim for elasticity
limit kubesprawl
de-zombify
know what you’re using
turn it off
users …
Slide 170
Slide 170 text
#RedHat
@hollycummins.com
1-2%
Slide 171
Slide 171 text
#RedHat
@hollycummins.com
tool creators, support
1-2%