Upgrade to Pro — share decks privately, control downloads, hide ads and more …

In the Center of the Cyclone: Finding Sources o...

In the Center of the Cyclone: Finding Sources of Resilience

Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways) both invisible and also difficult to locate in concrete and grounded ways. Understanding complex systems cannot rely on simple approaches, by definition.

“Monitoring,” “observability,” “culture,” “management,” “organizational design”” ... none of these terms, concepts, or approaches can singularly help us in this area. We’ll walk through empirically-supported approaches that do.

John Allspaw

August 16, 2018
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. In the Center of the Cyclone Finding Sources of Resilience

    John Allspaw @allspaw Adaptive Capacity Labs
  2. In the Center of the Cyclone Finding Sources of Resilience

    where to look? how to look? implies that they require effort to be identified
  3. “…the ability to recognize and adapt to handle unanticipated perturbations…”

    (Woods) “a resilient system must be both prepared, and be prepared to be unprepared.” (Pariès)
  4. externally sourced code (e.g. DB) results the using world delivery

    technology stack internally sourced code results
  5. externally sourced code (e.g. DB) results the using world delivery

    technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  6. code repositories macro descriptions testing/validation suites code code stuff meta

    rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  7. code repositories macro descriptions testing/validation suites code code stuff meta

    rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools system system framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  8. deploy organization/ “monitoring” Adding stuff to the running system Getting

    stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/
  9. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  10. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  11. Copyright © 2016 by R.I. Cook for ACL, all rights

    reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  12. What matters. Why what matters matters. code deploy organization/ encapsulation

    “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  13. observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting unforeseen

    unanticipated unexpected fundamentally surprising …is what copes with and adapts to:
  14. Copyright © 2016 by R.I. Cook for ACL, all rights

    reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting RESILIENCE IS HERE (ABOVE THE LINE)
  15. enumerate possible causes use process of elimination collect more data

    refine remaining hypotheses prove remaining hypotheses cannot fix issue can
  16. ?

  17. ?

  18. ?

  19. this is what I want you to pay attention to

    true, but unrelated to this talk
  20. …how the CTO knew an esoteric trick to get things

    working again ...the realization that yes there IS actually a problem with AWS and Lisa in your infrastructure team called someone she knows there to get the issue escalated ...how Vanessa managed to improvise a script to piece together accidentally deleted data from Hadoop, Elasticsearch indexes, and the Wayback Machine before anyone notices ...what Jenn in Security does to discern ‘normal’ bug behavior from signals of an attack ...the realization that when a bottleneck appears in the analytics pipeline, there are some bits off data collection that can be shut off without severe impact we need to understand actual work
  21. …the red herrings, rabbit holes, and unproductive threads of activity

    …what sources of data or information people do not trust …how responders bring newcomers up to speed …what sacrifice decisions people are making …how specialists in one field communicate problem solving to another … we need to hear about
  22. Infra Eng1 4 years DBA 2.5 years DBA 1 years

    App Eng1 2 years Mobile Eng1 2.5 years App Eng3 3 years Infra Eng2 1.5 years App Eng2 1 years site down site up
  23. site down site up called in for specific expertise also

    providing updates to customer support in parallel to responding to incident in Budapest with poor conference wifi unknowingly looking at incorrect availability zone log data
  24. logs time of year day of year time of day

    observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/ observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work
  25. My experience is that our community has very little patience

    for looking closely and deeply at cognitive work. “What tool do I use?”