Slide 1

Slide 1 text

Holly Cummins Red Hat KubeCon | CloudNativeCon April 3, 2025 Practical Zombie Hunting for Kubernetes Users

Slide 2

Slide 2 text

#RedHat @hollycummins.com

Slide 3

Slide 3 text

#RedHat @hollycummins.com Hey boss, I created a Kubernetes cluster. 2018

Slide 4

Slide 4 text

#RedHat @hollycummins.com Hey boss, I created a Kubernetes cluster. I forgot it for 2 months. 2018

Slide 5

Slide 5 text

#RedHat @hollycummins.com Hey boss, I created a Kubernetes cluster. I forgot it for 2 months. … and it’s €1000 a month. 2018

Slide 6

Slide 6 text

#RedHat @hollycummins.com

Slide 7

Slide 7 text

@therealmarkw1, twitter

Slide 8

Slide 8 text

what do these servers do? @therealmarkw1, twitter

Slide 9

Slide 9 text

what do these servers do? one is a backup for the other. @therealmarkw1, twitter

Slide 10

Slide 10 text

what do these servers do? one is a backup for the other. yes, but what do they do? @therealmarkw1, twitter

Slide 11

Slide 11 text

what do these servers do? one is a backup for the other. yes, but what do they do? @therealmarkw1, twitter no one has known for a couple of decades

Slide 12

Slide 12 text

#RedHat @hollycummins.com “measure, don’t guess” (or decide based on stories on the internet)

Slide 13

Slide 13 text

#RedHat @hollycummins.com actual picture of a zombie (it’s invisible)

Slide 14

Slide 14 text

#RedHat @hollycummins.com actual picture of a zombie (it’s invisible)

Slide 15

Slide 15 text

#RedHat @hollycummins.com 2015 survey 30% of 4,000 servers doing no useful work

Slide 16

Slide 16 text

#RedHat @hollycummins.com 2017 survey 25% of 16,000 servers doing no useful work

Slide 17

Slide 17 text

#RedHat @hollycummins.com zombie “they haven't delivered any information or computing services for six months or more”

Slide 18

Slide 18 text

#RedHat @hollycummins.com “comatose servers”

Slide 19

Slide 19 text

#RedHat @hollycummins.com under-utilised servers

Slide 20

Slide 20 text

#RedHat @hollycummins.com “we run this as a batch job on weekends, but the servers stay up all week” “

Slide 21

Slide 21 text

#RedHat @hollycummins.com “we run this as a batch job on weekends, but the servers stay up all week”

Slide 22

Slide 22 text

#RedHat @hollycummins.com “we only use this system in UK working hours, but we leave it running 24/7 ” “

Slide 23

Slide 23 text

#RedHat @hollycummins.com “we only use this system in UK working hours, but we leave it running 24/7 ”

Slide 24

Slide 24 text

#RedHat @hollycummins.com 2014 survey 29% of 4,000 active less than 5% of the time https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf

Slide 25

Slide 25 text

#RedHat @hollycummins.com the average server: 12 - 18% of capacity 30 - 60 % of maximum power https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf

Slide 26

Slide 26 text

#RedHat @hollycummins.com https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033 2021 study

Slide 27

Slide 27 text

#RedHat @hollycummins.com $26.6 billion https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033 2021 study

Slide 28

Slide 28 text

#RedHat @hollycummins.com $26.6 billion wasted by always-on cloud instances https://www.business2community.com/cloud-computing/overprovisioning-always-on-resources-lead-to-26-6-billion-in-public-cloud-waste-expected-in-2021-02381033 2021 study

Slide 29

Slide 29 text

#RedHat @hollycummins.com the big (green) picture

Slide 30

Slide 30 text

#RedHat @hollycummins.com the big (green) picture green principles

Slide 31

Slide 31 text

#RedHat @hollycummins.com green software foundation: principles

Slide 32

Slide 32 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles

Slide 33

Slide 33 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles where

Slide 34

Slide 34 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles where when

Slide 35

Slide 35 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles hardware efficiency where when

Slide 36

Slide 36 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles hardware efficiency electricity efficiency where when

Slide 37

Slide 37 text

#RedHat @hollycummins.com algorithms carbon awareness green software foundation: principles hardware efficiency electricity efficiency where when

Slide 38

Slide 38 text

#RedHat @hollycummins.com algorithms stack carbon awareness green software foundation: principles hardware efficiency electricity efficiency where when

Slide 39

Slide 39 text

#RedHat @hollycummins.com algorithms stack carbon awareness green software foundation: principles hardware efficiency electricity efficiency where when quarkus!

Slide 40

Slide 40 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles hardware efficiency electricity efficiency

Slide 41

Slide 41 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles hardware efficiency electricity efficiency elasticity

Slide 42

Slide 42 text

#RedHat @hollycummins.com carbon awareness green software foundation: principles hardware efficiency electricity efficiency elasticity utilisation

Slide 43

Slide 43 text

#RedHat @hollycummins.com it’s not just money

Slide 44

Slide 44 text

#RedHat @hollycummins.com it’s not just electricity it’s not just money

Slide 45

Slide 45 text

#RedHat @hollycummins.com it’s not just electricity it’s water it’s not just money

Slide 46

Slide 46 text

#RedHat @hollycummins.com it’s not just electricity it’s water it’s e-waste it’s not just money

Slide 47

Slide 47 text

#RedHat @hollycummins.com it’s not just electricity it’s water it’s e-waste it’s not just money it’s embodied carbon

Slide 48

Slide 48 text

#RedHat @hollycummins.com why does this happen?

Slide 49

Slide 49 text

#RedHat @hollycummins.com why does this happen?

Slide 50

Slide 50 text

#RedHat @hollycummins.com why does this happen? forgetfulness

Slide 51

Slide 51 text

#RedHat @hollycummins.com laziness why does this happen? forgetfulness

Slide 52

Slide 52 text

#RedHat @hollycummins.com laziness why does this happen? forgetfulness fear

Slide 53

Slide 53 text

#RedHat @hollycummins.com why does this happen?

Slide 54

Slide 54 text

#RedHat @hollycummins.com why does this happen? lack of institutional memory

Slide 55

Slide 55 text

#RedHat @hollycummins.com competing priorities why does this happen? lack of institutional memory

Slide 56

Slide 56 text

#RedHat @hollycummins.com competing priorities why does this happen? lack of institutional memory risk aversion

Slide 57

Slide 57 text

#RedHat @hollycummins.com

Slide 58

Slide 58 text

#RedHat @hollycummins.com “perhaps someone forgot to turn them off” Anthesis Institute

Slide 59

Slide 59 text

#RedHat @hollycummins.com managing machines is hard

Slide 60

Slide 60 text

#RedHat @hollycummins.com managing machines is hard

Slide 61

Slide 61 text

#RedHat @hollycummins.com

Slide 62

Slide 62 text

#RedHat @hollycummins.com projects ended

Slide 63

Slide 63 text

#RedHat @hollycummins.com projects ended business processes changed

Slide 64

Slide 64 text

#RedHat @hollycummins.com risk averse processes

Slide 65

Slide 65 text

#RedHat @hollycummins.com

Slide 66

Slide 66 text

#RedHat @hollycummins.com

Slide 67

Slide 67 text

#RedHat @hollycummins.com over-provisioning

Slide 68

Slide 68 text

#RedHat @hollycummins.com over-provisioning isolation requirements

Slide 69

Slide 69 text

#RedHat @hollycummins.com “to avoid CRD conflicts, we use the cluster as the unit of deployment”

Slide 70

Slide 70 text

#RedHat @hollycummins.com auto-scaling algorithms are optimised for availability

Slide 71

Slide 71 text

#RedHat @hollycummins.com application utilisation

Slide 72

Slide 72 text

#RedHat @hollycummins.com application utilisation high utilisation good case

Slide 73

Slide 73 text

#RedHat @hollycummins.com application utilisation over-utilisation very bad case

Slide 74

Slide 74 text

#RedHat @hollycummins.com application utilisation over-utilisation very bad case under-utilisation wasteful case

Slide 75

Slide 75 text

#RedHat @hollycummins.com application elasticity high utilisation good case @holly_cummins

Slide 76

Slide 76 text

#RedHat @hollycummins.com application elasticity scale-up good utilisation @holly_cummins

Slide 77

Slide 77 text

#RedHat @hollycummins.com application elasticity scale-down good utilisation @holly_cummins

Slide 78

Slide 78 text

#RedHat @hollycummins.com how do we solve the zombie problem?

Slide 79

Slide 79 text

#RedHat @hollycummins.com how do we solve the zombie problem? detection and destruction

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

#RedHat @hollycummins.com system archaeology … is not easy

Slide 82

Slide 82 text

#RedHat @hollycummins.com IT Department, UK Bank let’s figure out what all these cloud workloads are, since I’m paying for them long meetings

Slide 83

Slide 83 text

#RedHat @hollycummins.com IT Department, UK Bank let’s figure out what all these cloud workloads are, since I’m paying for them long meetings zzzz zzzzzzz zz zzzzzz

Slide 84

Slide 84 text

#RedHat @hollycummins.com big spreadsheets

Slide 85

Slide 85 text

#RedHat @hollycummins.com big spreadsheets

Slide 86

Slide 86 text

#RedHat @hollycummins.com big spreadsheets

Slide 87

Slide 87 text

#RedHat @hollycummins.com long emails

Slide 88

Slide 88 text

#RedHat @hollycummins.com tags

Slide 89

Slide 89 text

#RedHat @hollycummins.com scream test

Slide 90

Slide 90 text

#RedHat @hollycummins.com “eco-monkey”

Slide 91

Slide 91 text

@holly_cummins #RedHat the scream is real

Slide 92

Slide 92 text

@holly_cummins #RedHat the scream is real this internal server doesn’t seem to have a purpose

Slide 93

Slide 93 text

@holly_cummins #RedHat the scream is real this internal server doesn’t seem to have a purpose let’s turn it off!

Slide 94

Slide 94 text

@holly_cummins #RedHat the scream is real this internal server doesn’t seem to have a purpose uh … why did the backbone of a client’s network just vanish? let’s turn it off!

Slide 95

Slide 95 text

@holly_cummins #RedHat the scream is real this internal server doesn’t seem to have a purpose uh … why did the backbone of a client’s network just vanish? let’s turn it off! oops.

Slide 96

Slide 96 text

#RedHat @hollycummins.com all the —opses

Slide 97

Slide 97 text

#RedHat @hollycummins.com GreenOps

Slide 98

Slide 98 text

#RedHat @hollycummins.com GreenOps greenops is a mid-sized trilobite (really)

Slide 99

Slide 99 text

#RedHat @hollycummins.com FinOps figuring out who in your company forgot to turn off their cloud

Slide 100

Slide 100 text

#RedHat @hollycummins.com image credit: RedMonk

Slide 101

Slide 101 text

#RedHat @hollycummins.com backstage.io image credit: RedMonk

Slide 102

Slide 102 text

#RedHat @hollycummins.com backstage.io •cost insights plugin image credit: RedMonk

Slide 103

Slide 103 text

#RedHat @hollycummins.com backstage.io •cost insights plugin •cloud carbon footprint plugin image credit: RedMonk

Slide 104

Slide 104 text

#RedHat @hollycummins.com traffic monitoring

Slide 105

Slide 105 text

#RedHat @hollycummins.com but. detecting is only half the battle.

Slide 106

Slide 106 text

#RedHat @hollycummins.com the ikea effect

Slide 107

Slide 107 text

#RedHat @hollycummins.com the ikea effect labour

Slide 108

Slide 108 text

#RedHat @hollycummins.com the ikea effect labour

Slide 109

Slide 109 text

#RedHat @hollycummins.com the ikea effect labour love

Slide 110

Slide 110 text

#RedHat @hollycummins.com shut it down? but … what if I need this cluster later?

Slide 111

Slide 111 text

@holly_cummins #RedHat ultimate elasticity

Slide 112

Slide 112 text

#RedHat @hollycummins.com we don’t switch the light off because we’re not sure if it will come back on

Slide 113

Slide 113 text

#RedHat @hollycummins.com we don’t switch the server off because we’re not sure if it will come back on happens all the time

Slide 114

Slide 114 text

#RedHat @hollycummins.com we don’t switch the server off because it would be too much work to recreate it happens all the time

Slide 115

Slide 115 text

@holly_cummins #RedHat

Slide 116

Slide 116 text

@holly_cummins #RedHat

Slide 117

Slide 117 text

@holly_cummins #RedHat turning it off and on again must

Slide 118

Slide 118 text

@holly_cummins #RedHat turning it off and on again must • be fast

Slide 119

Slide 119 text

@holly_cummins #RedHat turning it off and on again must • be fast • actually work

Slide 120

Slide 120 text

@holly_cummins #RedHat turning it off and on again must • be fast • actually work • idempotency

Slide 121

Slide 121 text

@holly_cummins #RedHat turning it off and on again must • be fast • actually work • idempotency • resiliency

Slide 122

Slide 122 text

@holly_cummins #RedHat making turning servers off as safe and easy as turning lights off

Slide 123

Slide 123 text

@holly_cummins #RedHat LightSwitchOps making turning servers off as safe and easy as turning lights off

Slide 124

Slide 124 text

@holly_cummins #RedHat step 1: no more scary state

Slide 125

Slide 125 text

#RedHat @hollycummins.com GitOps (infrastructure as code)

Slide 126

Slide 126 text

#RedHat @hollycummins.com GitOps (infrastructure as code)

Slide 127

Slide 127 text

#RedHat @hollycummins.com

Slide 128

Slide 128 text

#RedHat @hollycummins.com spin it down

Slide 129

Slide 129 text

#RedHat @hollycummins.com kubectl apply -f all-my-cluster/ spin it down spin it up

Slide 130

Slide 130 text

#RedHat @hollycummins.com kubectl apply -f all-my-cluster/ spin it down spin it up

Slide 131

Slide 131 text

#RedHat @hollycummins.com kubectl apply -f all-my-cluster/ ansible-playbook stuff.yml spin it down spin it up

Slide 132

Slide 132 text

@holly_cummins #RedHat step 1: no more scary state step 2: automate, automate

Slide 133

Slide 133 text

zombie reduction does not need to be fancy

Slide 134

Slide 134 text

#RedHat @hollycummins.com large UK bank, 2013 50% reduction in CPUs with a lease system self-destructing instances

Slide 135

Slide 135 text

#RedHat @hollycummins.com large UK bank, 2013 50% reduction in CPUs with a lease system self-destructing instances

Slide 136

Slide 136 text

@holly_cummins #RedHat timed shutoff we used to leave our applications running all the time @darkandnerdy, Chicago DevOpsDays

Slide 137

Slide 137 text

@holly_cummins #RedHat timed shutoff we used to leave our applications running all the time when we scripted turning them off at night, we reduced our cloud bill by 30% @darkandnerdy, Chicago DevOpsDays

Slide 138

Slide 138 text

@holly_cummins #RedHat my shell script to power down machines overnight saved my school €12,000 absurdly simple timed shutoff

Slide 139

Slide 139 text

@holly_cummins #RedHat fancier scripts … with a frontend and a backend

Slide 140

Slide 140 text

@holly_cummins #RedHat commercial products

Slide 141

Slide 141 text

- Kruize Autotune - PEAKS (Power Efficiency Aware Kubernetes Scheduler) - OpenShift Cost Management Open Source utilization optimisation

Slide 142

Slide 142 text

- Densify - Granulate - Turbonomic Application Resource Management - TSO Logic - etc Commercial AIOps

Slide 143

Slide 143 text

21% cost savings from installing Turbonomic in IBM CIO office

Slide 144

Slide 144 text

things that (maybe) don’t help

Slide 145

Slide 145 text

things that (maybe) don’t help #RedHat @hollycummins.com “out of sight, out of mind” cloud

Slide 146

Slide 146 text

things that (maybe) don’t help #RedHat @hollycummins.com virtualisation 2019 survey 30% of virtual servers doing no useful work

Slide 147

Slide 147 text

things that (maybe) don’t help #RedHat @hollycummins.com virtualisation 2019 survey 30% of virtual servers doing no useful work 50% of virtual servers active less than 5% of the time

Slide 148

Slide 148 text

#RedHat @hollycummins.com you still need to remember to turn the virtual machine off

Slide 149

Slide 149 text

what about serverless?

Slide 150

Slide 150 text

modernising to serverless is a big lift

Slide 151

Slide 151 text

may not suit latency-sensitive workloads

Slide 152

Slide 152 text

“we solve the cold-start problem by … … keeping an instance running but not billing you”

Slide 153

Slide 153 text

#RedHat @hollycummins.com application serverless systems may have high overheads

Slide 154

Slide 154 text

#RedHat @hollycummins.com control plane application serverless systems may have high overheads

Slide 155

Slide 155 text

#RedHat @hollycummins.com control plane application serverless systems may have high overheads

Slide 156

Slide 156 text

#RedHat @hollycummins.com control plane application serverless systems may have high overheads

Slide 157

Slide 157 text

https://hotcarbon.org/pdf/hotcarbon22-sharma.pdf virtualisation overheads mean each function request can use 30x more energy than a plain http server

Slide 158

Slide 158 text

things that definitely don’t help

Slide 159

Slide 159 text

things that don’t help #RedHat @hollycummins.com prevention

Slide 160

Slide 160 text

things that don’t help #RedHat @hollycummins.com prevention (?!)

Slide 161

Slide 161 text

surely shutting the barn door before the horse has left is a good idea?

Slide 162

Slide 162 text

prevention == heavy governance

Slide 163

Slide 163 text

remember the ikea effect?

Slide 164

Slide 164 text

remember the ikea effect? people will not surrender servers that were hard to get

Slide 165

Slide 165 text

#RedHat @hollycummins.com unsolved problem == opportunity

Slide 166

Slide 166 text

#RedHat @hollycummins.com the double-win turning things off saves a lot of money

Slide 167

Slide 167 text

#RedHat @hollycummins.com

Slide 168

Slide 168 text

#RedHat @hollycummins.com users …

Slide 169

Slide 169 text

#RedHat @hollycummins.com up utilisation aim for elasticity limit kubesprawl de-zombify know what you’re using turn it off users …

Slide 170

Slide 170 text

#RedHat @hollycummins.com 1-2%

Slide 171

Slide 171 text

#RedHat @hollycummins.com tool creators, support 1-2%

Slide 172

Slide 172 text

#RedHat @hollycummins.com better utilisation elasticity multi-tenancy de-zombification visibility disposability tool creators, support 1-2%

Slide 173

Slide 173 text

thank you slides @hollycummins.com