Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community

Wildfires, Firefighters and Sustainability Learnings from Mitigating Kubernetes Fires in the Community

Nabarun Pal

April 20, 2023
Tweet

More Decks by Nabarun Pal

Other Decks in Technology

Transcript

  1. Nabarun Pal & Madhav Jivrajani, VMware
    Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community

    View full-size slide

  2. Code of Conduct
    Remember the Golden Rule: Treat others as you would want
    to be treated - with kindness and respect
    Scan the QR code to access and
    review the CNCF Code of Conduct:

    View full-size slide

  3. Virtual Audience Closed Captioning
    Closed captioning for the virtual audience is available during each
    session through Wordly. The Wordly functionality can be found
    under the “Translations” tab on the session page.
    Wordly will default to English. If another language is needed, simply
    click the dropdown at the bottom of the “Translations” tab and choose
    from one of 26+ languages available so you don’t miss a beat from our
    presenters.
    *Note: Closed captioning is ONLY available during the scheduled live sessions and will not be
    available for the recordings on-demand within the virtual conference platform.

    View full-size slide

  4. Who Are We?
    Madhav Jivrajani
    @MadhavJivrajani
    Kubernetes SIG ContribEx Technical Lead
    Nabarun Pal
    @theonlynabarun
    Kubernetes Steering Committee / SIG ContribEx Chair

    View full-size slide

  5. Before We Start…
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  6. registry.k8s.io is GA!🎉
    🚨❄k8s.gcr.io is frozen❄🚨
    More info on
    https://k8s.io/image-registry-redirect
    Also see: k8s.gcr.io Redirect to registry.k8s.io - What You Need to Know
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  7. Agenda
    ● Timeline of a Kubernetes Release
    ● Introduction and Setting the context
    ● Why were the releases delayed?
    ● What went right?
    ● What could be done better?
    ● Takeaways
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  8. Prelude: Timeline of a Kubernetes Release
    Cadence: Every ~4 months
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  9. Prelude: Timeline of a Kubernetes Release
    Elaborate song and dance of People and Processes
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  10. Prelude: Timeline of a Kubernetes Release
    Emeritus
    Adviser
    Release
    Lead
    Branch
    Manager
    Bug Triage CI Signal Comms Docs Enhancements
    Release
    Notes
    Release Lead
    Shadows
    Branch Manager
    Shadow
    Bug Triage
    Shadows
    CI Signal
    Shadows
    Comms
    Shadows
    Docs
    Shadows
    Enhancements
    Shadows
    Release Notes
    Shadows
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  11. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  12. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  13. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  14. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  15. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  16. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  17. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  18. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  19. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  20. Prelude: Timeline of a Kubernetes Release
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  21. Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  22. Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  23. Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  24. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  25. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  26. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  27. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  28. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  29. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  30. Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  31. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  32. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  33. Usually, release-blockers tend to happen towards the end of a release, but not necessarily:
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  34. Usually, release-blockers tend to happen towards the end of a release, but not necessarily:
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  35. Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  36. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  37. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  38. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  39. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  40. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  41. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  42. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  43. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  44. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  45. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  46. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  47. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  48. Typical Flow of Fighting A Wildfire
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  49. Typical Flow of Fighting A Wildfire
    Data for release-blockers for releases 1.24 - 1.27
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  50. Wildfires, Firefighters and Sustainability
    Learnings from Mitigating Kubernetes
    Fires in the Community
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  51. Sustainability
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  52. Sustainability
    According to Elinor Ostrom, in her Nobel Prize winning work “Governing the Commons”:
    “[A system is sustainable] as long as the average rate of withdrawal does not exceed the
    average rate of replenishment”
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  53. Sustainability
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  54. Sustainability
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  55. Sustainability
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  56. Recapping…
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  57. @MadhavJivrajani & @theonlynabarun

    View full-size slide

  58. Fire Stories: Regressions and Heroics!!!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  59. Fire Stories: Regressions and Heroics!!!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  60. Fire Stories: Regressions and Heroics!!!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  61. Fire Stories: Regressions and Heroics!!!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  62. Fire Stories: Regressions and Heroics!!!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  63. Fire Stories: Regressions and Heroics!!!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  64. Fire Stories: Regressions and Heroics!!!
    Turn Around Time = ~1 day
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  65. Fire Stories: Regressions and Heroics!!!
    Observations:
    ● Detection possible due to consumption of latest version of Kubernetes
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  66. Fire Stories: Regressions and Heroics!!!
    Observations:
    ● Detection possible due to consumption of latest version of Kubernetes
    ● Community Release Engineers and Triagers available around the globe
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  67. Fire Stories: Regressions and Heroics!!!
    Observations:
    ● Detection possible due to consumption of latest version of Kubernetes
    ● Community Release Engineers and people with knowledge of machinery available
    around the globe
    Thank you Andy, dims, liggitt, Kubernetes Release Managers and Google Build Admins!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  68. Fire Stories: go1.18 Breaks CSR Validation
    Like most fires, we start with our CI looking like this:
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  69. Fire Stories: go1.18 Breaks CSR Validation
    Like most fires, we start with our CI looking like this:
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  70. Fire Stories: go1.18 Breaks CSR Validation
    Quick summary of what happened:
    ● In go1.18 crypto/x509 started to reject certificates signed with SHA-1 hash function.
    ● Problem was it also rejected CSRs while it should only have rejected certificates.
    ● Due to this, CI remains red till we get a fix in the next minor Go version
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  71. Fire Stories: go1.18 Breaks CSR Validation
    Triage
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  72. Fire Stories: go1.18 Breaks CSR Validation
    Triage
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  73. Fire Stories: go1.18 Breaks CSR Validation
    Triage
    Quick fix to unblock CI
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  74. Fix: When the actual fix isn’t in our control, “fixing” includes charting the best course
    forward with what we can control.
    Fire Stories: go1.18 Breaks CSR Validation
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  75. Fire Stories: go1.18 Breaks CSR Validation
    Fix: Watch and List
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  76. Fire Stories: go1.18 Breaks CSR Validation
    “Subfires”
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  77. Fire Stories: go1.18 Breaks CSR Validation
    From fighting this, we largely see the need for:
    ● Folks with cross functional knowledge of the tooling and machinery of the project.
    ● Folks with knowledge about policies of other open source communities and projects
    that we depend on (Go in this case).
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  78. What went right?
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  79. Dissecting issues into actionable chunks
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  80. Correct Set of Tools
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  81. @MadhavJivrajani & @theonlynabarun
    Correct Set of Tools

    View full-size slide

  82. Correct Set of Tools
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  83. Correct Set of Tools
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  84. Correct Set of Tools
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  85. Global Distribution of Contributors
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  86. Employer Support For OSS Work
    Company Supported 3169
    Independent 669
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  87. What Can Be Improved?
    We’ve seen what went right, let’s take a look at how we can potentially improve.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  88. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  89. Strategically Growing OWNERS
    ● Growing OWNERS in the project is critical. Period.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  90. Strategically Growing OWNERS
    ● Growing OWNERS in the project is critical. Period.
    ● Looking back at our fire stories, we can get things back on track quicker if we have a geo distributed
    set of firefighters:
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  91. Strategically Growing OWNERS
    ● Growing OWNERS in the project is critical. Period.
    ● Looking back at our fire stories, we can get things back on track quicker if we have a geo distributed
    set of firefighters:
    ○ But is that enough?
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  92. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  93. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  94. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  95. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  96. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  97. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  98. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  99. Strategically Growing OWNERS
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  100. Strategically Growing OWNERS
    ● Growing OWNERS in the project is critical. Period.
    ● Looking back at our fire stories, we can get things back on track quicker if we have a geo distributed
    set of firefighters:
    ○ But is that enough?
    ○ Along with this, we also benefit from a geo distributed set of OWNERS
    ■ Brings back things back on track faster (ex: unblocks CI faster)
    ■ More time for CI to soak changes made by PRs (especially towards the end of a release)
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  101. Reliability
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  102. Reliability
    We don’t need firefighters if we don’t have fires
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  103. Reliability
    Investing in the reliability of the project gives exponentially positive returns
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  104. Reliability
    Investing in the reliability of the project gives exponentially positive returns:
    ● There has been a great amount of work being put towards reliability of the Kubernetes project.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  105. Reliability
    Investing in the reliability of the project gives exponentially positive returns:
    ● There has been a great amount of work being put towards reliability of the Kubernetes project.
    ● This effort is largely owed to SIG Testing – thank you to everyone involved, but there is still a lot of
    help needed here.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  106. Reliability
    Investing in the reliability of the project gives exponentially positive returns:
    ● There has been a great amount of work being put towards reliability of the Kubernetes project.
    ● This effort is largely owed to SIG Testing – thank you to everyone involved, but there is still a lot of
    help needed here.
    ○ If you are an end user or a vendor or someone who cares about Kubernetes, investing and
    funding folks to work on the Kubernetes project is critical for us as an ecosystem.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  107. Having More Firefighters
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  108. Having More Firefighters
    According to Curto-Millet et al. in “The sustainability of open source commons”:
    “Not all participation is equal and projects and communities need to encourage
    positive social relations. This involves participants becoming core members through
    situated learning and identity construction.”
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  109. Having More Firefighters
    ● Undocumented context — one of the largest reasons we depend on a small number of project
    veterans.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  110. Having More Firefighters
    ● Undocumented context — one of the largest reasons we depend on a small number of project
    veterans.
    ○ As a first step, let’s start doing and publishing post mortems after each fire.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  111. Having More Firefighters
    ● Undocumented context — one of the largest reasons we depend on a small number of project
    veterans.
    ○ As a first step, let’s start doing and publishing post mortems after each fire.
    ● Enable folks who are potential firefighters
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  112. Having More Firefighters
    ● Undocumented context — one of the largest reasons we depend on a small number of project
    veterans.
    ○ As a first step, let’s start doing and publishing post mortems after each fire.
    ● Enable folks who are potential firefighters
    ○ When fires come up - having broken down, tangible descriptions and analyses enable potential
    firefighters.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  113. Having More Firefighters
    ● Undocumented context — one of the largest reasons we depend on a small number of project
    veterans.
    ○ As a first step, let’s start doing and publishing post mortems after each fire.
    ● Enable folks who are potential firefighters
    ○ When fires come up - having broken down, tangible descriptions and analyses enable potential
    firefighters.
    ● We have amazing teams like the Release CI Signal who can be enabled to be the entry point of
    firefighting.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  114. Having More Firefighters
    Link to the video: YouTube
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  115. Takeaways
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  116. Takeaways
    Globally distributed contributors, with employer
    support, trained to triage and debug fires, with the
    right tools.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  117. Takeaways
    Globally distributed contributors, with employer
    support, trained to triage and debug fires, with the
    right tools.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  118. Takeaways
    Globally distributed contributors, with employer
    support, trained to triage and debug fires, with the
    right tools.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  119. Takeaways
    Globally distributed contributors, with employer
    support, trained to triage and debug fires, with the
    right tools.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  120. Takeaways
    Globally distributed contributors, with employer
    support, trained to triage and debug fires, with the
    right tools.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  121. Thank You!
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  122. Come join us at the Kubernetes SIG Meet and Greet Tomorrow at 12.30PM
    at Europe Foyer 1, Ground Floor, Congress Centre.
    @MadhavJivrajani & @theonlynabarun

    View full-size slide

  123. Please scan the QR Code above
    to leave feedback on this session
    @MadhavJivrajani & @theonlynabarun

    View full-size slide