Slide 1

Slide 1 text

In the Center of the Cyclone Finding Sources of Resilience John Allspaw @allspaw Adaptive Capacity Labs

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

RESILIENCE

Slide 6

Slide 6 text

In the Center of the Cyclone Finding Sources of Resilience where to look? how to look? implies that they require effort to be identified

Slide 7

Slide 7 text

sustained adaptive capacity

Slide 8

Slide 8 text

“…the ability to recognize and adapt to handle unanticipated perturbations…” (Woods) “a resilient system must be both prepared, and be prepared to be unprepared.” (Pariès)

Slide 9

Slide 9 text

unforeseen unanticipated unexpected fundamentally surprising

Slide 10

Slide 10 text

where to look? how to look?

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 13

Slide 13 text

externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 14

Slide 14 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 15

Slide 15 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools system system framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 16

Slide 16 text

deploy organization/ “monitoring” Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/

Slide 17

Slide 17 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 18

Slide 18 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 19

Slide 19 text

Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 20

Slide 20 text

What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 21

Slide 21 text

observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting unforeseen unanticipated unexpected fundamentally surprising …is what copes with and adapts to:

Slide 22

Slide 22 text

Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting RESILIENCE IS HERE (ABOVE THE LINE)

Slide 23

Slide 23 text

sustained adaptive capacity = human performance (cognitive work)

Slide 24

Slide 24 text

challenges and barriers to finding sources of resilience

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

smoothing-the-messy-details- to-fit-a-model goggles

Slide 27

Slide 27 text

detect the issue develop hypothes(es) for contributors (in)validate hypothes(es) fix issue

Slide 28

Slide 28 text

enumerate possible causes use process of elimination collect more data refine remaining hypotheses prove remaining hypotheses cannot fix issue can

Slide 29

Slide 29 text

?

Slide 30

Slide 30 text

?

Slide 31

Slide 31 text

?

Slide 32

Slide 32 text

this is what I want you to pay attention to true, but unrelated to this talk

Slide 33

Slide 33 text

…how the CTO knew an esoteric trick to get things working again ...the realization that yes there IS actually a problem with AWS and Lisa in your infrastructure team called someone she knows there to get the issue escalated ...how Vanessa managed to improvise a script to piece together accidentally deleted data from Hadoop, Elasticsearch indexes, and the Wayback Machine before anyone notices ...what Jenn in Security does to discern ‘normal’ bug behavior from signals of an attack ...the realization that when a bottleneck appears in the analytics pipeline, there are some bits off data collection that can be shut off without severe impact we need to understand actual work

Slide 34

Slide 34 text

…the red herrings, rabbit holes, and unproductive threads of activity …what sources of data or information people do not trust …how responders bring newcomers up to speed …what sacrifice decisions people are making …how specialists in one field communicate problem solving to another … we need to hear about

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

shallow data goggles

Slide 37

Slide 37 text

site down site up 29 min 18 min 16 min

Slide 38

Slide 38 text

Infra Eng1 4 years DBA 2.5 years DBA 1 years App Eng1 2 years Mobile Eng1 2.5 years App Eng3 3 years Infra Eng2 1.5 years App Eng2 1 years site down site up

Slide 39

Slide 39 text

site down site up called in for specific expertise also providing updates to customer support in parallel to responding to incident in Budapest with poor conference wifi unknowingly looking at incorrect availability zone log data

Slide 40

Slide 40 text

site down site up critical relayed observations stated
 hypotheses

Slide 41

Slide 41 text

logs time of year day of year time of day observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/ observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

we have to take off our Goggles Goggles Goggles

Slide 45

Slide 45 text

My experience is that our community has very little patience for looking closely and deeply at cognitive work. “What tool do I use?”

Slide 46

Slide 46 text

1999 1978 1996 This will take time.

Slide 47

Slide 47 text

http://resiliencepapers.club SRE Cognitive Work, Seeking SRE, O’Reilly http://stella.report

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

THE END