DevOpsPorto Meetup9: There is no such thing as human error by João Miranda

DevOpsPorto Meetup9: There is no such thing as human error by João Miranda

Talk delivered by João Miranda

A2c14a1c4e16aa337c7d36abe7d1cf8f?s=128

DevOpsPorto

October 14, 2017
Tweet

Transcript

  1. Human Error There is no such things as...

  2. Ego Self-Massage 17 years in the IT world: developer, scrum

    master, ALM team lead, agile coach, solution architect, engineering manager Manages (huh... tries to cope with) 10 Scrum teams Co-organizes DevOps Lisbon meetup Loves Software Engineering
  3. Favourite journalist question after an accident?

  4. “Was it a human or a technical error?”

  5. “Employing simplicity thinking and linear logic, the official findings and

    the judicial rulings determined that the train driver was “exclusively” responsible for the crash.”* * Disaster complexity and the Santiago de Compostela train derailment
  6. Amazon’s outage “Amazon’s massive AWS outage was caused by human

    error. One incorrect command and the whole internet suffers.” Recode. March 2, 2017
  7. “During the deployment of the new code, however, one of

    Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment (...)” Knightmare: A DevOps Cautionary Tale Knight Capital Loses $440 Million in 30 Minutes
  8. Consumer credit reporting agency. Info on 800 million consumers.

  9. Hackers exposed the Social Security numbers, drivers licenses and other

    sensitive info of 143 million customers.
  10. … yup… you’ve guessed it...

  11. “Former Equifax CEO says breach boiled down to one person

    not doing their job.” https://techcrunch.com/2017/10/03/former-equifax-ceo-says-breach-boiled-down-to-one-person-not-doing-their-job/
  12. “It’s well established that accidents cannot be attributed to a

    single cause or (...) a single individual” Industrial Accident Prevention, H.W. Heinrich, Dan Petersen, Nestor Roos, 1980 (5th edition), McGraw-Hill Book Company (ISBN 0-07-028061-4)
  13. Coping With Complexity Humans are a feature of complex systems.

    They solve the most complex issues (not computers), but they also have some blind spots.
  14. Cognitive Demands of a Domain • Dynamism • Number of

    parts and extensiveness of its interconnections • Uncertainty • Risk A domain is complex if high in all of these dimensions. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  15. Failure to Adapt to New Events People may get fixated

    on initial assessments. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  16. “…[people] have difficulty in dealing with exponential developments (hard to

    imagine how fast things can change, or accelerate).” Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980), via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)
  17. Failure to Use External Guidance to Direct Focus E.g.: Start

    treating a cause before treating more pressing consequences. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  18. Failures of Prospective Memory Forgetting to recall an intention for

    some future point in time. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  19. Treating Interconnected Events as Independent E.g.: Failing to consider how

    a recently deployed change to the Users API may be causing the Check-out process to fail. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  20. “…[people] tend to think in causal series as opposed to

    causal nets (A, therefore B) -> (A and B, therefore C and D, therefore E and A and F)” Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980), via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)
  21. Over Reliance on Familiar Signs “The site is so slow.

    It must be the database again.” * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  22. • Human error is cause of failure • Engineered systems

    are safe • Make progress by protecting systems from unreliable humans “Old” View Of Human Error
  23. It’s so easy and tempting to point fingers and find

    scapegoats after the fact. But...
  24. ...we’re humans. We’re not rational or objective beings. Here’s why.

  25. Hindsight Bias “The inclination, after an event has ocurred, to

    see the event as having been predictable, despite there having been little or no objective basis for predicting it.” “Hindsight bias”
  26. It’s so obvious! How could have they missed it?

  27. Fundamental Attribution Error “Our tendency to explain someone’s behaviour based

    on internal factors, such as personality or disposition, and to underestimate the influence that external factors, such as situational influences (...).” “Fundamental Attribution Error - Definition & Overview”
  28. It’s easier to change people than basic beliefs about a

    system.
  29. “The human tendency to create possible alternatives to life events

    that have already occurred. They are thoughts that consist of ‘If I had only’.” “Counterfactual Thinking” Counterfactuals
  30. Counterfactuals can affect people’s emotions, e.g.: regret, guilt or relief.

    They can also affect how they decide who deserves blame and responsibility.
  31. Stressing what was not done explains nothing about what actually

    happened, or why.
  32. What can we do?

  33. Local Rationality Principle “People do things that make sense to

    them given their goals, understanding of the situation and focus of attention at that time. Work needs to be understood from the local perspectives of those doing the work.” “Local Rationality”
  34. The local decision is always right.

  35. Normal people, doing normal things.

  36. So… can we really find “the” cause?

  37. None
  38. Scott A. Snook, Friendly Fire: The Accidental Shootdown of U.S.

    Black Hawks over Northern Iraq
  39. We don’t find cause. We select cause.

  40. • Human error as symptom of failure • Safety is

    not inherent in systems • Human error connected to features of people, tools, tasks and operating environment “New” View On Human Error
  41. How Organizations Process Information Pathological Bureaucratic Generative Power-oriented Rule-oriented Performance-oriented

    Low co-operation Modest co-operation High co-operation Messengers shot Messengers neglected Messengers trained Responsibilities shirked Narrow responsibilities Risks are shared Bridging discouraged Bridging tolerated Bridging encouraged Failure leads to scapegoating Failure leads to justice Failure leads to inquiry Novelty crushed Novelty leads to problems Novelty implemented Ron Westrum, “A typology of organisational cultures” (2004)
  42. Four Needs an accident report must fulfill Sidney Dekker, “The

    psychology of accident investigation: epistemological, preventive, moral and existential meaning-making” (2014)
  43. Epistemological Preventive Moral Existential Most of the time they are

    in conflict.
  44. The way we look at human error focuses on moral

    and existential needs.
  45. And what do we get by focusing on those two

    needs?
  46. Blame Culture Real or perceived. It doesn’t matter.

  47. Learning from failure is at least as important as fulfilling

    moral and existential needs.
  48. A Systematic Approach to Learn From Past Events Five steps:

    from context-specific to concept-dependent. Sidney Dekker, “Reconstructing human contributions to accidents: the new view on error and performance.” (2014)
  49. 1. Layout Sequence of Events in Context-Specific Language Data about

    an incident reveals a sequence of activities — human observations, actions, assessments, decisions, as well as changes in the state of the process or system.
  50. 2. Divide Sequence of Events into Episodes If the accident

    evolves over a long period of time.
  51. 3. Find Out How the World Looked or Changed During

    Each Episode Find out what their process was doing and what data was available. Couple behaviour with situation.
  52. 4. Identify People's Goals, Focus of Attention and Knowledge Active

    at the Time What people know and what they try to accomplish (their goals) determines where they will look, hence the data that is observable to them.
  53. 5. Step Up to a Conceptual Description It’s crucial so

    that we can learn from failures and identify commonalities between different events.
  54. Now go and make your organization more humane...

  55. ...and Resilient!

  56. Human Factors & System Safety MsC and Learning Labs

  57. Q&A?