Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsPorto Meetup9: There is no such thing as human error by João Miranda

DevOpsPorto
October 14, 2017

DevOpsPorto Meetup9: There is no such thing as human error by João Miranda

Talk delivered by João Miranda

DevOpsPorto

October 14, 2017
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. Ego Self-Massage 17 years in the IT world: developer, scrum

    master, ALM team lead, agile coach, solution architect, engineering manager Manages (huh... tries to cope with) 10 Scrum teams Co-organizes DevOps Lisbon meetup Loves Software Engineering
  2. “Employing simplicity thinking and linear logic, the official findings and

    the judicial rulings determined that the train driver was “exclusively” responsible for the crash.”* * Disaster complexity and the Santiago de Compostela train derailment
  3. Amazon’s outage “Amazon’s massive AWS outage was caused by human

    error. One incorrect command and the whole internet suffers.” Recode. March 2, 2017
  4. “During the deployment of the new code, however, one of

    Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment (...)” Knightmare: A DevOps Cautionary Tale Knight Capital Loses $440 Million in 30 Minutes
  5. “Former Equifax CEO says breach boiled down to one person

    not doing their job.” https://techcrunch.com/2017/10/03/former-equifax-ceo-says-breach-boiled-down-to-one-person-not-doing-their-job/
  6. “It’s well established that accidents cannot be attributed to a

    single cause or (...) a single individual” Industrial Accident Prevention, H.W. Heinrich, Dan Petersen, Nestor Roos, 1980 (5th edition), McGraw-Hill Book Company (ISBN 0-07-028061-4)
  7. Coping With Complexity Humans are a feature of complex systems.

    They solve the most complex issues (not computers), but they also have some blind spots.
  8. Cognitive Demands of a Domain • Dynamism • Number of

    parts and extensiveness of its interconnections • Uncertainty • Risk A domain is complex if high in all of these dimensions. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  9. Failure to Adapt to New Events People may get fixated

    on initial assessments. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  10. “…[people] have difficulty in dealing with exponential developments (hard to

    imagine how fast things can change, or accelerate).” Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980), via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)
  11. Failure to Use External Guidance to Direct Focus E.g.: Start

    treating a cause before treating more pressing consequences. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  12. Failures of Prospective Memory Forgetting to recall an intention for

    some future point in time. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  13. Treating Interconnected Events as Independent E.g.: Failing to consider how

    a recently deployed change to the Users API may be causing the Check-out process to fail. * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  14. “…[people] tend to think in causal series as opposed to

    causal nets (A, therefore B) -> (A and B, therefore C and D, therefore E and A and F)” Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980), via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)
  15. Over Reliance on Familiar Signs “The site is so slow.

    It must be the database again.” * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)
  16. • Human error is cause of failure • Engineered systems

    are safe • Make progress by protecting systems from unreliable humans “Old” View Of Human Error
  17. It’s so easy and tempting to point fingers and find

    scapegoats after the fact. But...
  18. Hindsight Bias “The inclination, after an event has ocurred, to

    see the event as having been predictable, despite there having been little or no objective basis for predicting it.” “Hindsight bias”
  19. Fundamental Attribution Error “Our tendency to explain someone’s behaviour based

    on internal factors, such as personality or disposition, and to underestimate the influence that external factors, such as situational influences (...).” “Fundamental Attribution Error - Definition & Overview”
  20. “The human tendency to create possible alternatives to life events

    that have already occurred. They are thoughts that consist of ‘If I had only’.” “Counterfactual Thinking” Counterfactuals
  21. Counterfactuals can affect people’s emotions, e.g.: regret, guilt or relief.

    They can also affect how they decide who deserves blame and responsibility.
  22. Local Rationality Principle “People do things that make sense to

    them given their goals, understanding of the situation and focus of attention at that time. Work needs to be understood from the local perspectives of those doing the work.” “Local Rationality”
  23. • Human error as symptom of failure • Safety is

    not inherent in systems • Human error connected to features of people, tools, tasks and operating environment “New” View On Human Error
  24. How Organizations Process Information Pathological Bureaucratic Generative Power-oriented Rule-oriented Performance-oriented

    Low co-operation Modest co-operation High co-operation Messengers shot Messengers neglected Messengers trained Responsibilities shirked Narrow responsibilities Risks are shared Bridging discouraged Bridging tolerated Bridging encouraged Failure leads to scapegoating Failure leads to justice Failure leads to inquiry Novelty crushed Novelty leads to problems Novelty implemented Ron Westrum, “A typology of organisational cultures” (2004)
  25. Four Needs an accident report must fulfill Sidney Dekker, “The

    psychology of accident investigation: epistemological, preventive, moral and existential meaning-making” (2014)
  26. A Systematic Approach to Learn From Past Events Five steps:

    from context-specific to concept-dependent. Sidney Dekker, “Reconstructing human contributions to accidents: the new view on error and performance.” (2014)
  27. 1. Layout Sequence of Events in Context-Specific Language Data about

    an incident reveals a sequence of activities — human observations, actions, assessments, decisions, as well as changes in the state of the process or system.
  28. 3. Find Out How the World Looked or Changed During

    Each Episode Find out what their process was doing and what data was available. Couple behaviour with situation.
  29. 4. Identify People's Goals, Focus of Attention and Knowledge Active

    at the Time What people know and what they try to accomplish (their goals) determines where they will look, hence the data that is observable to them.
  30. 5. Step Up to a Conceptual Description It’s crucial so

    that we can learn from failures and identify commonalities between different events.