Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Nabarun Pal & Madhav Jivrajani, VMware Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community

Slide 3

Slide 3 text

Code of Conduct Remember the Golden Rule: Treat others as you would want to be treated - with kindness and respect Scan the QR code to access and review the CNCF Code of Conduct:

Slide 4

Slide 4 text

Virtual Audience Closed Captioning Closed captioning for the virtual audience is available during each session through Wordly. The Wordly functionality can be found under the “Translations” tab on the session page. Wordly will default to English. If another language is needed, simply click the dropdown at the bottom of the “Translations” tab and choose from one of 26+ languages available so you don’t miss a beat from our presenters. *Note: Closed captioning is ONLY available during the scheduled live sessions and will not be available for the recordings on-demand within the virtual conference platform.

Slide 5

Slide 5 text

Who Are We? Madhav Jivrajani @MadhavJivrajani Kubernetes SIG ContribEx Technical Lead Nabarun Pal @theonlynabarun Kubernetes Steering Committee / SIG ContribEx Chair

Slide 6

Slide 6 text

Before We Start… @MadhavJivrajani & @theonlynabarun

Slide 7

Slide 7 text

registry.k8s.io is GA!🎉 🚨❄k8s.gcr.io is frozen❄🚨 More info on https://k8s.io/image-registry-redirect Also see: k8s.gcr.io Redirect to registry.k8s.io - What You Need to Know @MadhavJivrajani & @theonlynabarun

Slide 8

Slide 8 text

Agenda ● Timeline of a Kubernetes Release ● Introduction and Setting the context ● Why were the releases delayed? ● What went right? ● What could be done better? ● Takeaways @MadhavJivrajani & @theonlynabarun

Slide 9

Slide 9 text

Prelude: Timeline of a Kubernetes Release Cadence: Every ~4 months @MadhavJivrajani & @theonlynabarun

Slide 10

Slide 10 text

Prelude: Timeline of a Kubernetes Release Elaborate song and dance of People and Processes @MadhavJivrajani & @theonlynabarun

Slide 11

Slide 11 text

Prelude: Timeline of a Kubernetes Release Emeritus Adviser Release Lead Branch Manager Bug Triage CI Signal Comms Docs Enhancements Release Notes Release Lead Shadows Branch Manager Shadow Bug Triage Shadows CI Signal Shadows Comms Shadows Docs Shadows Enhancements Shadows Release Notes Shadows @MadhavJivrajani & @theonlynabarun

Slide 12

Slide 12 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 13

Slide 13 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 14

Slide 14 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 15

Slide 15 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 16

Slide 16 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 17

Slide 17 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 18

Slide 18 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 19

Slide 19 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 20

Slide 20 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 21

Slide 21 text

Prelude: Timeline of a Kubernetes Release @MadhavJivrajani & @theonlynabarun

Slide 22

Slide 22 text

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community @MadhavJivrajani & @theonlynabarun

Slide 23

Slide 23 text

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community @MadhavJivrajani & @theonlynabarun

Slide 24

Slide 24 text

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community @MadhavJivrajani & @theonlynabarun

Slide 25

Slide 25 text

@MadhavJivrajani & @theonlynabarun

Slide 26

Slide 26 text

@MadhavJivrajani & @theonlynabarun

Slide 27

Slide 27 text

@MadhavJivrajani & @theonlynabarun

Slide 28

Slide 28 text

@MadhavJivrajani & @theonlynabarun

Slide 29

Slide 29 text

@MadhavJivrajani & @theonlynabarun

Slide 30

Slide 30 text

@MadhavJivrajani & @theonlynabarun

Slide 31

Slide 31 text

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community @MadhavJivrajani & @theonlynabarun

Slide 32

Slide 32 text

@MadhavJivrajani & @theonlynabarun

Slide 33

Slide 33 text

@MadhavJivrajani & @theonlynabarun

Slide 34

Slide 34 text

Usually, release-blockers tend to happen towards the end of a release, but not necessarily: @MadhavJivrajani & @theonlynabarun

Slide 35

Slide 35 text

Usually, release-blockers tend to happen towards the end of a release, but not necessarily: @MadhavJivrajani & @theonlynabarun

Slide 36

Slide 36 text

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community @MadhavJivrajani & @theonlynabarun

Slide 37

Slide 37 text

@MadhavJivrajani & @theonlynabarun

Slide 38

Slide 38 text

@MadhavJivrajani & @theonlynabarun

Slide 39

Slide 39 text

@MadhavJivrajani & @theonlynabarun

Slide 40

Slide 40 text

@MadhavJivrajani & @theonlynabarun

Slide 41

Slide 41 text

@MadhavJivrajani & @theonlynabarun

Slide 42

Slide 42 text

@MadhavJivrajani & @theonlynabarun

Slide 43

Slide 43 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 44

Slide 44 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 45

Slide 45 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 46

Slide 46 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 47

Slide 47 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 48

Slide 48 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 49

Slide 49 text

Typical Flow of Fighting A Wildfire @MadhavJivrajani & @theonlynabarun

Slide 50

Slide 50 text

Typical Flow of Fighting A Wildfire Data for release-blockers for releases 1.24 - 1.27 @MadhavJivrajani & @theonlynabarun

Slide 51

Slide 51 text

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community @MadhavJivrajani & @theonlynabarun

Slide 52

Slide 52 text

Sustainability @MadhavJivrajani & @theonlynabarun

Slide 53

Slide 53 text

Sustainability According to Elinor Ostrom, in her Nobel Prize winning work “Governing the Commons”: “[A system is sustainable] as long as the average rate of withdrawal does not exceed the average rate of replenishment” @MadhavJivrajani & @theonlynabarun

Slide 54

Slide 54 text

Sustainability @MadhavJivrajani & @theonlynabarun

Slide 55

Slide 55 text

Sustainability @MadhavJivrajani & @theonlynabarun

Slide 56

Slide 56 text

Sustainability @MadhavJivrajani & @theonlynabarun

Slide 57

Slide 57 text

Recapping… @MadhavJivrajani & @theonlynabarun

Slide 58

Slide 58 text

@MadhavJivrajani & @theonlynabarun

Slide 59

Slide 59 text

Fire Stories: Regressions and Heroics!!! @MadhavJivrajani & @theonlynabarun

Slide 60

Slide 60 text

Fire Stories: Regressions and Heroics!!! @MadhavJivrajani & @theonlynabarun

Slide 61

Slide 61 text

Fire Stories: Regressions and Heroics!!! @MadhavJivrajani & @theonlynabarun

Slide 62

Slide 62 text

Fire Stories: Regressions and Heroics!!! @MadhavJivrajani & @theonlynabarun

Slide 63

Slide 63 text

Fire Stories: Regressions and Heroics!!! @MadhavJivrajani & @theonlynabarun

Slide 64

Slide 64 text

Fire Stories: Regressions and Heroics!!! @MadhavJivrajani & @theonlynabarun

Slide 65

Slide 65 text

Fire Stories: Regressions and Heroics!!! Turn Around Time = ~1 day @MadhavJivrajani & @theonlynabarun

Slide 66

Slide 66 text

Fire Stories: Regressions and Heroics!!! Observations: ● Detection possible due to consumption of latest version of Kubernetes @MadhavJivrajani & @theonlynabarun

Slide 67

Slide 67 text

Fire Stories: Regressions and Heroics!!! Observations: ● Detection possible due to consumption of latest version of Kubernetes ● Community Release Engineers and Triagers available around the globe @MadhavJivrajani & @theonlynabarun

Slide 68

Slide 68 text

Fire Stories: Regressions and Heroics!!! Observations: ● Detection possible due to consumption of latest version of Kubernetes ● Community Release Engineers and people with knowledge of machinery available around the globe Thank you Andy, dims, liggitt, Kubernetes Release Managers and Google Build Admins! @MadhavJivrajani & @theonlynabarun

Slide 69

Slide 69 text

Fire Stories: go1.18 Breaks CSR Validation Like most fires, we start with our CI looking like this: @MadhavJivrajani & @theonlynabarun

Slide 70

Slide 70 text

Fire Stories: go1.18 Breaks CSR Validation Like most fires, we start with our CI looking like this: @MadhavJivrajani & @theonlynabarun

Slide 71

Slide 71 text

Fire Stories: go1.18 Breaks CSR Validation Quick summary of what happened: ● In go1.18 crypto/x509 started to reject certificates signed with SHA-1 hash function. ● Problem was it also rejected CSRs while it should only have rejected certificates. ● Due to this, CI remains red till we get a fix in the next minor Go version @MadhavJivrajani & @theonlynabarun

Slide 72

Slide 72 text

Fire Stories: go1.18 Breaks CSR Validation Triage @MadhavJivrajani & @theonlynabarun

Slide 73

Slide 73 text

Fire Stories: go1.18 Breaks CSR Validation Triage @MadhavJivrajani & @theonlynabarun

Slide 74

Slide 74 text

Fire Stories: go1.18 Breaks CSR Validation Triage Quick fix to unblock CI @MadhavJivrajani & @theonlynabarun

Slide 75

Slide 75 text

Fix: When the actual fix isn’t in our control, “fixing” includes charting the best course forward with what we can control. Fire Stories: go1.18 Breaks CSR Validation @MadhavJivrajani & @theonlynabarun

Slide 76

Slide 76 text

Fire Stories: go1.18 Breaks CSR Validation Fix: Watch and List @MadhavJivrajani & @theonlynabarun

Slide 77

Slide 77 text

Fire Stories: go1.18 Breaks CSR Validation “Subfires” @MadhavJivrajani & @theonlynabarun

Slide 78

Slide 78 text

Fire Stories: go1.18 Breaks CSR Validation From fighting this, we largely see the need for: ● Folks with cross functional knowledge of the tooling and machinery of the project. ● Folks with knowledge about policies of other open source communities and projects that we depend on (Go in this case). @MadhavJivrajani & @theonlynabarun

Slide 79

Slide 79 text

What went right? @MadhavJivrajani & @theonlynabarun

Slide 80

Slide 80 text

Dissecting issues into actionable chunks @MadhavJivrajani & @theonlynabarun

Slide 81

Slide 81 text

Correct Set of Tools @MadhavJivrajani & @theonlynabarun

Slide 82

Slide 82 text

@MadhavJivrajani & @theonlynabarun Correct Set of Tools

Slide 83

Slide 83 text

Correct Set of Tools @MadhavJivrajani & @theonlynabarun

Slide 84

Slide 84 text

Correct Set of Tools @MadhavJivrajani & @theonlynabarun

Slide 85

Slide 85 text

Correct Set of Tools @MadhavJivrajani & @theonlynabarun

Slide 86

Slide 86 text

Global Distribution of Contributors @MadhavJivrajani & @theonlynabarun

Slide 87

Slide 87 text

Employer Support For OSS Work Company Supported 3169 Independent 669 @MadhavJivrajani & @theonlynabarun

Slide 88

Slide 88 text

What Can Be Improved? We’ve seen what went right, let’s take a look at how we can potentially improve. @MadhavJivrajani & @theonlynabarun

Slide 89

Slide 89 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 90

Slide 90 text

Strategically Growing OWNERS ● Growing OWNERS in the project is critical. Period. @MadhavJivrajani & @theonlynabarun

Slide 91

Slide 91 text

Strategically Growing OWNERS ● Growing OWNERS in the project is critical. Period. ● Looking back at our fire stories, we can get things back on track quicker if we have a geo distributed set of firefighters: @MadhavJivrajani & @theonlynabarun

Slide 92

Slide 92 text

Strategically Growing OWNERS ● Growing OWNERS in the project is critical. Period. ● Looking back at our fire stories, we can get things back on track quicker if we have a geo distributed set of firefighters: ○ But is that enough? @MadhavJivrajani & @theonlynabarun

Slide 93

Slide 93 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 94

Slide 94 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 95

Slide 95 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 96

Slide 96 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 97

Slide 97 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 98

Slide 98 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 99

Slide 99 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 100

Slide 100 text

Strategically Growing OWNERS @MadhavJivrajani & @theonlynabarun

Slide 101

Slide 101 text

Strategically Growing OWNERS ● Growing OWNERS in the project is critical. Period. ● Looking back at our fire stories, we can get things back on track quicker if we have a geo distributed set of firefighters: ○ But is that enough? ○ Along with this, we also benefit from a geo distributed set of OWNERS ■ Brings back things back on track faster (ex: unblocks CI faster) ■ More time for CI to soak changes made by PRs (especially towards the end of a release) @MadhavJivrajani & @theonlynabarun

Slide 102

Slide 102 text

Reliability @MadhavJivrajani & @theonlynabarun

Slide 103

Slide 103 text

Reliability We don’t need firefighters if we don’t have fires @MadhavJivrajani & @theonlynabarun

Slide 104

Slide 104 text

Reliability Investing in the reliability of the project gives exponentially positive returns @MadhavJivrajani & @theonlynabarun

Slide 105

Slide 105 text

Reliability Investing in the reliability of the project gives exponentially positive returns: ● There has been a great amount of work being put towards reliability of the Kubernetes project. @MadhavJivrajani & @theonlynabarun

Slide 106

Slide 106 text

Reliability Investing in the reliability of the project gives exponentially positive returns: ● There has been a great amount of work being put towards reliability of the Kubernetes project. ● This effort is largely owed to SIG Testing – thank you to everyone involved, but there is still a lot of help needed here. @MadhavJivrajani & @theonlynabarun

Slide 107

Slide 107 text

Reliability Investing in the reliability of the project gives exponentially positive returns: ● There has been a great amount of work being put towards reliability of the Kubernetes project. ● This effort is largely owed to SIG Testing – thank you to everyone involved, but there is still a lot of help needed here. ○ If you are an end user or a vendor or someone who cares about Kubernetes, investing and funding folks to work on the Kubernetes project is critical for us as an ecosystem. @MadhavJivrajani & @theonlynabarun

Slide 108

Slide 108 text

Having More Firefighters @MadhavJivrajani & @theonlynabarun

Slide 109

Slide 109 text

Having More Firefighters According to Curto-Millet et al. in “The sustainability of open source commons”: “Not all participation is equal and projects and communities need to encourage positive social relations. This involves participants becoming core members through situated learning and identity construction.” @MadhavJivrajani & @theonlynabarun

Slide 110

Slide 110 text

Having More Firefighters ● Undocumented context — one of the largest reasons we depend on a small number of project veterans. @MadhavJivrajani & @theonlynabarun

Slide 111

Slide 111 text

Having More Firefighters ● Undocumented context — one of the largest reasons we depend on a small number of project veterans. ○ As a first step, let’s start doing and publishing post mortems after each fire. @MadhavJivrajani & @theonlynabarun

Slide 112

Slide 112 text

Having More Firefighters ● Undocumented context — one of the largest reasons we depend on a small number of project veterans. ○ As a first step, let’s start doing and publishing post mortems after each fire. ● Enable folks who are potential firefighters @MadhavJivrajani & @theonlynabarun

Slide 113

Slide 113 text

Having More Firefighters ● Undocumented context — one of the largest reasons we depend on a small number of project veterans. ○ As a first step, let’s start doing and publishing post mortems after each fire. ● Enable folks who are potential firefighters ○ When fires come up - having broken down, tangible descriptions and analyses enable potential firefighters. @MadhavJivrajani & @theonlynabarun

Slide 114

Slide 114 text

Having More Firefighters ● Undocumented context — one of the largest reasons we depend on a small number of project veterans. ○ As a first step, let’s start doing and publishing post mortems after each fire. ● Enable folks who are potential firefighters ○ When fires come up - having broken down, tangible descriptions and analyses enable potential firefighters. ● We have amazing teams like the Release CI Signal who can be enabled to be the entry point of firefighting. @MadhavJivrajani & @theonlynabarun

Slide 115

Slide 115 text

Having More Firefighters Link to the video: YouTube @MadhavJivrajani & @theonlynabarun

Slide 116

Slide 116 text

Takeaways @MadhavJivrajani & @theonlynabarun

Slide 117

Slide 117 text

Takeaways Globally distributed contributors, with employer support, trained to triage and debug fires, with the right tools. @MadhavJivrajani & @theonlynabarun

Slide 118

Slide 118 text

Takeaways Globally distributed contributors, with employer support, trained to triage and debug fires, with the right tools. @MadhavJivrajani & @theonlynabarun

Slide 119

Slide 119 text

Takeaways Globally distributed contributors, with employer support, trained to triage and debug fires, with the right tools. @MadhavJivrajani & @theonlynabarun

Slide 120

Slide 120 text

Takeaways Globally distributed contributors, with employer support, trained to triage and debug fires, with the right tools. @MadhavJivrajani & @theonlynabarun

Slide 121

Slide 121 text

Takeaways Globally distributed contributors, with employer support, trained to triage and debug fires, with the right tools. @MadhavJivrajani & @theonlynabarun

Slide 122

Slide 122 text

Thank You! @MadhavJivrajani & @theonlynabarun

Slide 123

Slide 123 text

Come join us at the Kubernetes SIG Meet and Greet Tomorrow at 12.30PM at Europe Foyer 1, Ground Floor, Congress Centre. @MadhavJivrajani & @theonlynabarun

Slide 124

Slide 124 text

Please scan the QR Code above to leave feedback on this session @MadhavJivrajani & @theonlynabarun