•Postmortems as re-calibration •Blameless v. sanctionless after action actions •Controlling the costs of coordination •Visualizations during anomaly management •Strange Loops •Dark Debt
and a community 2. A cognitive systems perspective of the CD/CI community 3. Poise/potential/capacity to adapt 4. Some (hopefully) thought-provoking questions
researchers from…. working in these domains… Aviation/ATM Rail Maritime Space Surgery Power Plants Intelligence Agencies Law Enforcement Mining Construction Explosives Firefighting Anesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering Human Factors & Ergonomics Cognitive Systems Engineering Cybernetics Complexity Science Engineering* Psychology Sociology Ecology Safety Science
Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Anne-Sophie Nyssen University de Liege Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU
in light of resilience engineering Unmanned Aircraft Systems in (Inter)national Airspace: Resilience as a Lever in the Debate Sociotechnical Networks for Power Grid Resilience: South Korean Case Study Limits on adaptation: Modeling Resilience and Brittleness in Hospital Emergency Departments
“monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With
“monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
“monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
“monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…
cope with everyday situations – large and small – by adjusting their performance to the conditions. An organisation’s performance is resilient if it can function as required under expected and unexpected conditions alike (changes/disturbances/opportunities).” Hollnagel, Erik. Safety-II in Practice: Developing the Resilience Potentials
are different than ramp-ups • ramp-ups are different than db schema changes • network, hardware, “front-end” “fault injection/chaos” • etc. etc. etc. a change is not a change is not a change
Deploys get different evaluations based on their perceived risk. Freedom to deploy is sometimes restricted. Risk hedging is common. Conventions are everywhere. Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA
“monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Context Is Constructed Here
to do 2. Knowing how the platform works 3. What the platform’s behavior means 4. Being able to devise a change that addresses 1, 2, & 3 5. Being able to predict the effects of that change 6. Being able to force the platform to change in that way 7. Being prepared to deal with the consequences Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA
rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line externally sourced code (e.g. DB) results delivery technology stack internally sourced code results incidents as… drivers of software design - “incidents of yesterday inform the architectures of tomorrow” - incidents “below the line” drive changes “above the line" - staffing, budgets, planning, roadmaps, etc. - shape the design of new components, subsystems, architectures
market value in <10min 3/23/2012 - BATS IPO - systems issue halted the exchange’s own IPO 5/23/2012 - Facebook IPO - systems issue delayed IPO trading 8/1/2012 - Knight Capital - $461 million in 45 minutes “Regulation SCI” - tend also to give birth to new forms of regulations, policies, norms, compliance requirements, explosion of documentation, auditing, constraints, etc. - “incidents of yesterday inform the rules of tomorrow” - influence staffing, budgets, planning, roadmaps, etc. PCI-DSS 1988-1998, Visa and MasterCard reported credit card losses due to fraud of $750 million incidents as… motivators for policy
will it do next? How did it get into this state? WTF is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it’s fixed…but is it…? If we do X, will it prevent it from getting worse…or make it worse? Who else should we call that can help us? Is this OUR issue, or are we BEING ATTACKED?!
potential for uncovering elements of expertise and related cognitive phenomena.” (Klein, Crandall, Hoffman, 2006) A family of well-worn methods, approaches, and techniques Cognitive task/work analysis Process tracing Conversation analysis Critical decision method Critical incident technique more… research validates these opportunities
cope with everyday situations – large and small – by adjusting their performance to the conditions. An organisation’s performance is resilient if it can function as required under expected and unexpected conditions alike (changes/disturbances/opportunities).”
will it do next? How did it get into this state? WTF is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it’s fixed…but is it…? If we do X, will it prevent it from getting worse…or make it worse? Who else should we call that can help us? Is this OUR issue, or are we BEING ATTACKED?!
flows how work is coordinated how escalation manifests the weight of time pressure the effects of uncertainty the impact of ambiguity what consequences are consequential
how attention flows how work is coordinated how escalation manifests the weight of time pressure the effects of uncertainty the impact of ambiguity what consequences are consequential …from these? (M)TTR? (M)TTD? Frequency of incidents? Severity of incidents? Customer impact? Number of deploys? “…while there is value in the items on the right, we value the items on the left more.”
behave - we continually build and revise our understandings based on (relatively sparse) signals our tech sends us. • Applying CD approaches/techniques is an implicit acknowledgement of this sparse and fleeting understanding, and represent coping strategies for this state of affairs. • Understanding activities “above the line” are basically unexplored or ignored in our industry, and this needs to change.