SRpfE: Site Reliability (principles for) Everyone

Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –
1

SRpfE: Site Reliability principles for Everyone Kurt Andersen (@drkurta)

3 1— Begin with the End in Mind... Photo by Glen Rushton on Unsplash

4 What is “reliability”? For a car: • It works when and where you need it, • It accomplishes the task: • taking you and your stuff from here to where you want to go • in a reasonable amount of time • given environmental conditions • It does not distract you with a lot of “overhead” (physical or mental) Begin with the End in Mind... Photo by Obie Fernandez on Unsplash

5 Other aspects of reliability • Repeatable • (mostly) Context independent Begin with the End in Mind... Photo by Alok Sharma on Unsplash Photo by Markus Spiske on Unsplash

6 How does “reliability” really matter? • Customer satisfaction • Referrals • Reputation Begin with the End in Mind... Photo by Patrick Robert Doyle on Unsplash

7 How do you know if you’ve “got it”? Begin with the End in Mind... JIM WILSON/The New York Times/RE

8 2— Measure what Matters! •Reliability is an emergent property of a sociotechnical system •Reliability is fuzzy, not a binary yes/no •You need to learn what matters to your users!

9 What matters to your users? •CPU or memory usage? •Number of AWS instances? •Monthly spend on swag? Measure what Matters 🚫

10 What matters to your users? •How long does it take to do what they came to your site or app to do? •How easy was it? Measure what Matters Put yourself in the user’s position; what else can you think of?

11 3— Develop an “eye” for problems One of the distinguishing characteristics of High Reliability Organizations (HROs): An ever-present awareness of how near FAILURE is at every moment

12 “Process Feel” or Fingerspitzengefühl Literally meaning "finger tips feeling" and meaning intuitive flair or instinct... It describes a great situational awareness, and the ability to respond... describe[s] a superior ability to respond to an escalated situation. — Wikipedia Develop an “eye” for problems

13 Can you feel when the wheel is coming off? Develop an “eye” for problems

14 4 — Solve Problems Swiftly and Collaboratively • Empowered teams and team members do not just “make do” in the face of problems. They use their own agency to fix them. • The more swiftly a problem is solved, the less it impacts you, your team, and your users. • By tackling problems in a collaborative fashion, the whole team gets better

15 • TBD – pit stop video

16 5— Feed Learnings Back into the System Those who fail to learn from history are condemned to repeat it —Winston Churchill

17 How do you inform the rest of your org? •Disembodied metrics? •Tickets? •Word of mouth or informal tribal knowledge? Let me explain. No, there is too much. Let me sum up. — Inigo Montoya, The Princess Bride Feed learnings back into the system Do you inform the rest of your org?

18 How do you inform the rest of your org? Use stories Feed learnings back into the system

19 6— Move from Reactive to Proactive • The “Holy Grail” of reliability is to take pre-emptive action to avoid catastrophe In the late ’90s, a chorus of doom echoed across the computing landscape with the approach of the dreaded Y2K…

20 How do you prepare for unscheduled events? • Constantly be learning – from success, near misses, and failures • Have a skilled team to solve problems • Be “in tune” with your systems – Fingerspitzengefühl • Instrument your systems so they can “talk” to you • Look for “leading indicators”, such as error budgets • Know what your goal is – so that you can make effective trade-off decisions Move from reactive to proactive Photo by set.sj on Unsplash

21 Most Important of All… All of the above are continuous, dynamic, ever changing processes

22 Summary – Site Reliability principles for Everyone 1. Begin with the end in mind 2. Measure what matters (to your users) 3. Develop an “eye” for problems 4. Solve problems quickly and collaboratively 5. Feed learnings back into the system 6. Move from Reactive to Proactive Most Important of All: 7. All of the above are continuous, dynamic, ever changing processes

23 Non-code Example 1. Begin with the end in mind… • You want to get 90% of your speakers to promote this to their circle of friends and contacts

24 Non-code Example 2. Measure what matters • Social mentions • Conference signups • What about speaker engagement / satisfaction?

25 Non-code Example 3. Develop an “eye” for problems • Understand your users – the speaker community 4. Solve problems quickly and collaboratively • Lower the barriers for action

26 Non-code Example 5. Feedback learnings into earlier parts of the system 6. Move from reactive to proactive • After the event surveys to tracking in process measures

27 Summary – Site Reliability principles for Everyone 1. Begin with the end in mind 2. Measure what matters (to your users) 3. Develop an “eye” for problems 4. Solve problems quickly and collaboratively 5. Feed learnings back into the system 6. Move from Reactive to Proactive Most Important of All: 7. All of the above are continuous, dynamic, ever changing processes

28

SRpfE: Site Reliability (principles for) Everyone

SRpfE: Site Reliability (principles for) Everyone

Kurt Andersen

More Decks by Kurt Andersen

Other Decks in Technology

Featured

Transcript

Kurt Andersen (@drkurta) All Day DevOps 2021 (@addo2021) SRpfE –