Pro Yearly is on sale from $80 to $50! »

The Psychology of Alert Design

The Psychology of Alert Design

It's 3:37am. Your phone starts buzzing. It doesn't stop. 1000s of alerts. All the things are broken. Where do you even begin?

You freeze.

The infrastructure we're operating are increasingly complex and nuanced. Events at one edge can have unintended and unpredictable effects on the other, and there is no obvious causal relationship. This makes debugging failure hard.

Good alert design is important to lowering the MTTR when our complex infrastructures fail, but what constitutes a "good alert"? Our brains work in unexpected ways, with cognitive biases and priming skewing our perception of reality. It's vitally important to understand how we think and react under pressure when designing alerts and communicating failure.

In this talk, Lindsay will showcase some of the psychological underpinnings you should take into account when designing your alerts, how other industries handle alert design, and what tools are available to increase your operational effectiveness in the face of massive failures today.

Sources used to create this talk:

- http://www.columbiadisaster.info/images/foam_debris_548x627.jpg
- http://upload.wikimedia.org/wikipedia/commons/9/95/Impact-test.jpg
- http://www.youtube.com/watch?v=94J9oVeST0k
- http://www.youtube.com/watch?v=1oBTzbKx0jo
- http://www.flickr.com/photos/frostnova/2268471558
- http://www.flickr.com/photos/buttim/1297081125
- http://www.flickr.com/photos/gsairpics/8318261080
- http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Map_Tenerife_Disaster_EN.svg/2000px-Map_Tenerife_Disaster_EN.svg.png
- http://i1.ytimg.com/vi/LSPkRMbyrGc/maxresdefault.jpg
- http://awesomestories.com/images/user/9add18ae4d.jpg
- http://library.mpib-berlin.mpg.de/ft/rh/RH_Fluency_2008.pdf
- http://www.theatlanticwire.com/global/2012/07/final-air-france-447-report-pilots-misunderstood-their-situation/54209/
- http://www.dailymail.co.uk/news/article-2020136/Pierre-Cedric-Bonin-David-Robert-blamed-Atlantic-Ocean-Air-France-crash-killed-228.html
- http://edition.cnn.com/2012/07/05/world/europe/france-air-crash-report/index.html
- http://www.newscientist.com/blogs/onepercent/2012/07/af447-final-report.html
- http://gizmodo.com/5923866/air-france-447-crash-a-result-of-crew-ignoring-alarms
- http://www.flightglobal.com/news/articles/af447-inquiry-grapples-with-stall-warning-enigma-373857/
- http://www.anesthesia-analgesia.org/content/112/1/78.long
- http://www.used-equipment-medical.com/th_sogemed/medias/big/moniteur-drager-kappa-xlt-infinity.jpg
- http://img.medicalexpo.com/pdf/repository_me/68268/zeus-infinity-empowered-83059_5b.jpg
- http://www.flickr.com/photos/quinnanya/5646121120
- http://www.flickr.com/photos/digital-noise/3650559857
- http://en.wikipedia.org/wiki/File:Arterial_kateter.jpg
- http://drugline.org/img/term/venous-catheter-central-15887_1.jpg
- http://riemann.io/howto.html#group-events-in-time

Fad1e9ed293fc5b3ec7d4abdffeb636f?s=128

Lindsay Holmwood

September 19, 2013
Tweet

Transcript

  1. Psychology of alert design

  2. None
  3. G'day! I'm Lindsay Holmwood @auxesis

  4. Engineering manager @ Bulletproof

  5. cucumber-nagios Visage Flapjack

  6. None
  7. January 16, 2003

  8. foam debris broke off the space shuttle's external tank struck

    left wing http://www.columbiadisaster.info/images/foam_debris_548x627.jpg
  9. http://upload.wikimedia.org/wikipedia/commons/9/95/Impact-test.jpg mockup of polyurethane foam hitting wing structure at 850km/h

  10. February 3, 2003

  11. from nasa tv http://www.youtube.com/watch?v=94J9oVeST0k

  12. from free to air television http://www.youtube.com/watch?v=1oBTzbKx0jo

  13. Did NASA have "good alerts"?

  14. What constitutes a good alert?

  15. good alert is a moral judgement

  16. No one sets out to create "bad alerts"

  17. Alerts designed in context

  18. Locally rational

  19. “people make what they think are best decisions based on

    data at hand”
  20. We design alerts for humans

  21. Let's understand how humans think

  22. 2 principles

  23. Don't startle the operator

  24. Don't suggest, expose

  25. None
  26. What is cognitive bias?

  27. "Mental shortcut"

  28. http://www.flickr.com/photos/frostnova/2268471558/sizes/o

  29. Timeliness Accuracy http://www.flickr.com/photos/frostnova/2268471558/sizes/o

  30. None
  31. • Problem solving

  32. • Problem solving • Heuristic

  33. • Problem solving • Heuristic • Correct result

  34. • Problem solving • Heuristic • Correct result • Rational

    choice
  35. None
  36. • Problem solving

  37. • Problem solving • Heuristic

  38. • Problem solving • Heuristic • Incorrect result

  39. • Problem solving • Heuristic • Incorrect result • Cognitive

    bias!
  40. Heuristic?

  41. Pattern matching Heuristics are simple, efficient rules often used by

    people to form judgements and make decisions. Involve focusing on specific information, and ignoring others. http://www.flickr.com/photos/buttim/1297081125/sizes/o
  42. What helped your ancestors survive!

  43. None
  44. March 27, 1977

  45. http://www.flickr.com/photos/gsairpics/8318261080/

  46. http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/ Map_Tenerife_Disaster_EN.svg/2000px-Map_Tenerife_Disaster_EN.svg.png

  47. http://i1.ytimg.com/vi/LSPkRMbyrGc/maxresdefault.jpg

  48. http://awesomestories.com/images/user/9add18ae4d.jpg

  49. KLM: 234 passengers 16 crew

  50. Pan Am: 326 passengers 9 crew

  51. Frozen in place

  52. None
  53. Normalcy bias

  54. Before a disaster:

  55. None
  56. • Underestimate:

  57. • Underestimate: • risk

  58. • Underestimate: • risk • effects

  59. • Underestimate: • risk • effects • preparation

  60. "Because something bad has never happened, it never will happen"

  61. During a disaster:

  62. people need an average of 4 prompts before they take

    action "this truly can't be happening, everything will be ok"
  63. • Response: people need an average of 4 prompts before

    they take action "this truly can't be happening, everything will be ok"
  64. • Response: • slow reaction people need an average of

    4 prompts before they take action "this truly can't be happening, everything will be ok"
  65. • Response: • slow reaction • seek validation people need

    an average of 4 prompts before they take action "this truly can't be happening, everything will be ok"
  66. • Response: • slow reaction • seek validation • optimistic

    interpretation people need an average of 4 prompts before they take action "this truly can't be happening, everything will be ok"
  67. None
  68. Reaction steps

  69. None
  70. • Cognition

  71. • Cognition • Perception

  72. • Cognition • Perception • Comprehension

  73. • Cognition • Perception • Comprehension • Decision

  74. • Cognition • Perception • Comprehension • Decision • Implementation

  75. • Cognition • Perception • Comprehension • Decision • Implementation

    • Movement
  76. These are complex tasks

  77. You cannot skip these tasks

  78. You can practice to make them more automatic

  79. People who don't practice deliberate during the disaster

  80. http://i1.ytimg.com/vi/LSPkRMbyrGc/maxresdefault.jpg

  81. 70% freeze 15% freak out 15% react to situation

  82. No practice == higher MTTR

  83. Don't startle the operator

  84. Drill

  85. Limit interruptions

  86. This is a test

  87. None
  88. 1.Read the statement once

  89. 1.Read the statement once 2.Count the letter F

  90. None
  91. FINAL FOLIOS SEEM TO RESULT FROM YEARS OF DUTIFUL STUDY

    OF TEXTS ALONG WITH YEARS OF SCIENTIFIC EXPERIENCE.
  92. None
  93. How many did you see?

  94. How many did you see? The answer is 8

  95. Fluency heuristic http://library.mpib-berlin.mpg.de/ft/rh/RH_Fluency_2008.pdf

  96. FINAL FOLIOS SEEM TO RESULT FROM YEARS OF DUTIFUL STUDY

    OF TEXTS ALONG WITH YEARS OF SCIENTIFIC EXPERIENCE.
  97. Brain expects pattern to continue

  98. Brain skips other information

  99. None
  100. Modeling "failure"

  101. None
  102. • a

  103. • a • b

  104. • a • b • c

  105. • a • b • c • d

  106. • a • b • c • d • *boom*

  107. Let's add barriers

  108. None
  109. • a • b • c • d •

  110. • a • b • c • d •

  111. • a • b • c • d • Soft

    Hard Soft Hard
  112. • a • b • c • d •

  113. • a • b • c • d • e

    *boom*
  114. • a • b • c • d • e

    *boom*
  115. • a • b • c • d • e

    *boom* f
  116. • a • b • c • d • e

    *boom* f
  117. • a • b • c • d • e

    *boom* f g h
  118. • a • b • c • d • e

    *boom* f g h
  119. • a • b • c • d • e

    *boom* f g h i j k
  120. • a • b • c • d • e

    *boom* f g h i j k
  121. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  122. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  123. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  124. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  125. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  126. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  127. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z
  128. • a • b • c • d • e

    *boom* f g h i j k l m n o p q r s t u v w x y z Complexity
  129. Our systems are not static

  130. Our systems are dynamic

  131. "Accidents come from relationships, not broken parts"

  132. Parenting: does it even make sense?

  133. Lots of work

  134. Rapidly out of date

  135. Emergent behaviour?

  136. • a • b • c • d • q

    n e f h i j k l m o p r s t u v w x y z g
  137. • a • b • c • d • n

    e f h i j k l m o p r s t u v w x y z g *boom*
  138. • a • b • c • d • n

    f h i k m o p r s t u v w x y z g *boom* *boom* *boom* *boom*
  139. • a • b • c • d • n

    f h i k m o p r s t u v w x y z g *boom* *boom* *boom* *boom* this is alerting
  140. Don't suggest, expose

  141. None
  142. Other industries

  143. Aviation

  144. AF447

  145. None
  146. 70 stall warnings

  147. http://www.theatlanticwire.com/global/2012/07/final-air-france-447-report-pilots- misunderstood-their-situation/54209/ http://www.dailymail.co.uk/news/article-2020136/Pierre-Cedric-Bonin-David-Robert- blamed-Atlantic-Ocean-Air-France-crash-killed-228.html http://edition.cnn.com/2012/07/05/world/europe/france-air-crash-report/index.html http://www.newscientist.com/blogs/onepercent/2012/07/af447-final-report.html http://gizmodo.com/5923866/air-france-447-crash-a-result-of-crew-ignoring-alarms

  148. • Final Air France 447 Report: Pilots misunderstood their situation

    • Poorly-trained pilots to blame for Air France crash that killed 228 • Final Air France crash report says pilots failed to react swiftly • Air France 447 downed as crew ignored alarms • Air France 447 crash a result of crew ignoring alarms http://www.theatlanticwire.com/global/2012/07/final-air-france-447-report-pilots- misunderstood-their-situation/54209/ http://www.dailymail.co.uk/news/article-2020136/Pierre-Cedric-Bonin-David-Robert- blamed-Atlantic-Ocean-Air-France-crash-killed-228.html http://edition.cnn.com/2012/07/05/world/europe/france-air-crash-report/index.html http://www.newscientist.com/blogs/onepercent/2012/07/af447-final-report.html http://gizmodo.com/5923866/air-france-447-crash-a-result-of-crew-ignoring-alarms
  149. “They should have reacted!”

  150. Autopilot disconnect audio warning

  151. Alternate law reconfiguration audio warning

  152. Stall warnings lasted for 54 seconds

  153. C-chord altitude horn lasted for 34 seconds

  154. Dual control signal indicator light on the controls

  155. aural visual Autopilot disconnect x Alternate law reconfiguration x Dual

    input control x Altitude x Stall warning x
  156. Overwhelmed by feedback

  157. "In an aural environment that was already saturated by the

    C-chord warning, the possibility that the crew did not identify the stall warning cannot be ruled out" - BEA report on AF447 http://www.flightglobal.com/news/articles/af447-inquiry-grapples-with-stall-warning- enigma-373857/
  158. Operating theatres

  159. The Wolf Is Crying in the Operating Room: Patient Monitor

    and Anesthesia Workstation Alarming Patterns During Cardiac Surgery Schmid F, Goepfert M, et al, Anesthesia & Analgesia, 2010 http://www.anesthesia-analgesia.org/content/112/1/78.long
  160. Kappa XLT patient monitor http://www.used-equipment-medical.com/th_sogemed/medias/big/moniteur-drager- kappa-xlt-infinity.jpg

  161. Drager Zeus anesthesia workstation http://img.medicalexpo.com/pdf/repository_me/68268/zeus-infinity- empowered-83059_5b.jpg

  162. http://www.flickr.com/photos/quinnanya/5646121120/sizes/l/ pulse oximeter was used

  163. http://www.flickr.com/photos/digital-noise/3650559857/sizes/o electrocardiogram was used

  164. http://en.wikipedia.org/wiki/File:Arterial_kateter.jpg arterial blood pressure monitoring

  165. central venous pressure was measured with a central venous catheter

    http://drugline.org/img/term/venous-catheter-central-15887_1.jpg
  166. 1 second sampling interval

  167. Procedures were video recorded

  168. Results?

  169. 1.2 alerts / minute

  170. 80% of the 8975 alarms were of no consequence

  171. 30% of the 8975 alarms were false positives

  172. None
  173. How can we improve?

  174. Provide more context

  175. None
  176. None
  177. None
  178. None
  179. Don't suggest, expose

  180. None
  181. Reduce notifications

  182. None
  183. No notifications on individual checks

  184. Notify on the aggregate

  185. check_check

  186. $ check_check.rb -s solrserver OK=27 WARNING=0 CRITICAL=1 UNKNOWN=0 services=/solrserver/ hosts=//

    Services in CRITICAL: frontend1.example.com => solrserver client tests
  187. Riemann's event grouping http://riemann.io/howto.html#group-events-in-time

  188. Don't startle the operator

  189. None
  190. Rollup

  191. Limit alerts that are emitted

  192. Aggregate alerts together

  193. Incident response:

  194. Brute force: manual silence

  195. limit # of engineers who watch alerts & graphs

  196. Alerting system

  197. Flapjack

  198. Delay-based notification

  199. Per-media rollup threshold

  200. Don't startle the operator

  201. Granular alerting levels

  202. Alerta

  203. github.com/guardian/alerta/wiki/Alert-Format Alerta alerting levels

  204. Nagios alerting levels

  205. • a • b • c • d • q

    n e f h i j k l m o p r s t u v w x y z g @abestanway's talk: https://speakerdeck.com/astanway/mom-my-algorithms-suck
  206. • a • b • c • d • q

    n e f h i j k l m o p r s t u v w x y z g @abestanway's talk: https://speakerdeck.com/astanway/mom-my-algorithms-suck
  207. • a • b • c • d • q

    n e f h i j k l m o p r s t u v w x y z g we alerts now @abestanway's talk: https://speakerdeck.com/astanway/mom-my-algorithms-suck
  208. None
  209. It's not all doom and gloom

  210. We are on the cutting edge

  211. http://www.flickr.com/photos/quinnanya/5646121120/sizes/l/ pulse oximeter was used

  212. None
  213. Don't startle the operator

  214. Don't suggest, expose

  215. We design alerts for humans

  216. Let's understand how humans think

  217. None
  218. Thank you!

  219. Thank you! — the talk? Let @auxesis know!