Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRpfE: Site Reliability (principles for) Everyone

SRpfE: Site Reliability (principles for) Everyone

There are some core principles that are a natural part of the way SREs work that can really help an entire company.

Site Reliability doesn't need to be a mysterious, black art practiced by mythical unicorns at the end of a rainbow. The principles of SRE practices can help all software engineering teams to deliver higher quality, more reliable software even without having a specialist team. This talk will cover the key principles involved in SRE work with the intent of enabling every engineer (and employee!) to understand and use them.

You can see the live presentation at the ADDO2021 program site

Kurt Andersen

October 28, 2021
Tweet

More Decks by Kurt Andersen

Other Decks in Technology

Transcript

  1. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    3 1— Begin with the End in Mind... Photo by Glen Rushton on Unsplash
  2. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    4 What is “reliability”? For a car: • It works when and where you need it, • It accomplishes the task: • taking you and your stuff from here to where you want to go • in a reasonable amount of time • given environmental conditions • It does not distract you with a lot of “overhead” (physical or mental) Begin with the End in Mind... Photo by Obie Fernandez on Unsplash
  3. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    5 Other aspects of reliability • Repeatable • (mostly) Context independent Begin with the End in Mind... Photo by Alok Sharma on Unsplash Photo by Markus Spiske on Unsplash
  4. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    6 How does “reliability” really matter? • Customer satisfaction • Referrals • Reputation Begin with the End in Mind... Photo by Patrick Robert Doyle on Unsplash
  5. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    7 How do you know if you’ve “got it”? Begin with the End in Mind... JIM WILSON/The New York Times/RE
  6. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    8 2— Measure what Matters! •Reliability is an emergent property of a sociotechnical system •Reliability is fuzzy, not a binary yes/no •You need to learn what matters to your users!
  7. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    9 What matters to your users? •CPU or memory usage? •Number of AWS instances? •Monthly spend on swag? Measure what Matters 🚫
  8. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    10 What matters to your users? •How long does it take to do what they came to your site or app to do? •How easy was it? Measure what Matters Put yourself in the user’s position; what else can you think of?
  9. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    11 3— Develop an “eye” for problems One of the distinguishing characteristics of High Reliability Organizations (HROs): An ever-present awareness of how near FAILURE is at every moment
  10. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    12 “Process Feel” or Fingerspitzengefühl Literally meaning "finger tips feeling" and meaning intuitive flair or instinct... It describes a great situational awareness, and the ability to respond... describe[s] a superior ability to respond to an escalated situation. — Wikipedia Develop an “eye” for problems
  11. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    13 Can you feel when the wheel is coming off? Develop an “eye” for problems
  12. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    14 4 — Solve Problems Swiftly and Collaboratively • Empowered teams and team members do not just “make do” in the face of problems. They use their own agency to fix them. • The more swiftly a problem is solved, the less it impacts you, your team, and your users. • By tackling problems in a collaborative fashion, the whole team gets better
  13. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    16 5— Feed Learnings Back into the System Those who fail to learn from history are condemned to repeat it —Winston Churchill
  14. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    17 How do you inform the rest of your org? •Disembodied metrics? •Tickets? •Word of mouth or informal tribal knowledge? Let me explain. No, there is too much. Let me sum up. — Inigo Montoya, The Princess Bride Feed learnings back into the system Do you inform the rest of your org?
  15. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    18 How do you inform the rest of your org? Use stories Feed learnings back into the system
  16. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    19 6— Move from Reactive to Proactive • The “Holy Grail” of reliability is to take pre-emptive action to avoid catastrophe In the late ’90s, a chorus of doom echoed across the computing landscape with the approach of the dreaded Y2K…
  17. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    20 How do you prepare for unscheduled events? • Constantly be learning – from success, near misses, and failures • Have a skilled team to solve problems • Be “in tune” with your systems – Fingerspitzengefühl • Instrument your systems so they can “talk” to you • Look for “leading indicators”, such as error budgets • Know what your goal is – so that you can make effective trade-off decisions Move from reactive to proactive Photo by set.sj on Unsplash
  18. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    21 Most Important of All… All of the above are continuous, dynamic, ever changing processes
  19. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    22 Summary – Site Reliability principles for Everyone 1. Begin with the end in mind 2. Measure what matters (to your users) 3. Develop an “eye” for problems 4. Solve problems quickly and collaboratively 5. Feed learnings back into the system 6. Move from Reactive to Proactive Most Important of All: 7. All of the above are continuous, dynamic, ever changing processes
  20. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    23 Non-code Example 1. Begin with the end in mind… • You want to get 90% of your speakers to promote this to their circle of friends and contacts
  21. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    24 Non-code Example 2. Measure what matters • Social mentions • Conference signups • What about speaker engagement / satisfaction?
  22. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    25 Non-code Example 3. Develop an “eye” for problems • Understand your users – the speaker community 4. Solve problems quickly and collaboratively • Lower the barriers for action
  23. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    26 Non-code Example 5. Feedback learnings into earlier parts of the system 6. Move from reactive to proactive • After the event surveys to tracking in process measures
  24. Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –

    27 Summary – Site Reliability principles for Everyone 1. Begin with the end in mind 2. Measure what matters (to your users) 3. Develop an “eye” for problems 4. Solve problems quickly and collaboratively 5. Feed learnings back into the system 6. Move from Reactive to Proactive Most Important of All: 7. All of the above are continuous, dynamic, ever changing processes