Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsPorto Meetup9: There is no such thing as human error by João Miranda

DevOpsPorto
October 14, 2017

DevOpsPorto Meetup9: There is no such thing as human error by João Miranda

Talk delivered by João Miranda

DevOpsPorto

October 14, 2017
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. Human Error
    There is no such things as...

    View Slide

  2. Ego Self-Massage
    17 years in the IT world: developer, scrum master, ALM team
    lead, agile coach, solution architect, engineering manager
    Manages (huh... tries to cope with)
    10 Scrum teams
    Co-organizes DevOps Lisbon meetup
    Loves Software Engineering

    View Slide

  3. Favourite journalist question after an
    accident?

    View Slide

  4. “Was it a human or a technical error?”

    View Slide

  5. “Employing simplicity thinking and linear logic,
    the official findings and the judicial rulings
    determined that the train driver was
    “exclusively” responsible for the crash.”*
    * Disaster complexity and the Santiago de
    Compostela train derailment

    View Slide

  6. Amazon’s outage
    “Amazon’s massive
    AWS outage was
    caused by human
    error.
    One incorrect command
    and the whole internet
    suffers.”
    Recode. March 2, 2017

    View Slide

  7. “During the deployment of the new code, however, one of
    Knight’s technicians did not copy the new code to one of the
    eight SMARS computer servers. Knight did not have a
    second technician review this deployment (...)”
    Knightmare: A DevOps Cautionary Tale
    Knight Capital Loses $440 Million in 30 Minutes

    View Slide

  8. Consumer credit reporting agency.
    Info on 800 million consumers.

    View Slide

  9. Hackers exposed the Social Security
    numbers, drivers licenses and other
    sensitive info of 143 million customers.

    View Slide

  10. … yup… you’ve guessed it...

    View Slide

  11. “Former Equifax CEO says breach boiled
    down to one person not doing their job.”
    https://techcrunch.com/2017/10/03/former-equifax-ceo-says-breach-boiled-down-to-one-person-not-doing-their-job/

    View Slide

  12. “It’s well established that
    accidents cannot be attributed
    to a single cause or (...) a
    single individual”
    Industrial Accident Prevention, H.W. Heinrich, Dan Petersen, Nestor Roos, 1980 (5th
    edition), McGraw-Hill Book Company (ISBN 0-07-028061-4)

    View Slide

  13. Coping With Complexity
    Humans are a feature of complex systems. They solve the
    most complex issues (not computers), but they also have
    some blind spots.

    View Slide

  14. Cognitive Demands of a Domain
    ● Dynamism
    ● Number of parts and extensiveness of its interconnections
    ● Uncertainty
    ● Risk
    A domain is complex if high in all of these dimensions.
    * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)

    View Slide

  15. Failure to Adapt to New Events
    People may get fixated on initial assessments.
    * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)

    View Slide

  16. “…[people] have difficulty in dealing with exponential
    developments (hard to imagine how fast things can
    change, or accelerate).”
    Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980),
    via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)

    View Slide

  17. Failure to Use External Guidance to
    Direct Focus
    E.g.: Start treating a cause before treating more pressing
    consequences.
    * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)

    View Slide

  18. Failures of Prospective Memory
    Forgetting to recall an intention for some future point in time.
    * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)

    View Slide

  19. Treating Interconnected Events as
    Independent
    E.g.: Failing to consider how a recently deployed change to
    the Users API may be causing the Check-out process to fail.
    * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)

    View Slide

  20. “…[people] tend to
    think in causal series as opposed to causal nets
    (A, therefore B) ->
    (A and B, therefore C and D, therefore E and A and F)”
    Dietrich Dörner, “On The Difficulties People Have In Dealing With Complexity” (1980),
    via John Allspaw, “Resilience Engineering Part II: Lenses” (2012)

    View Slide

  21. Over Reliance on Familiar Signs
    “The site is so slow. It must be the database again.”
    * David D. Woods, “Coping with complexity: The psychology of human behaviour in complex systems” (1988)

    View Slide

  22. ● Human error is cause of failure
    ● Engineered systems are safe
    ● Make progress by protecting systems from unreliable
    humans
    “Old” View Of Human Error

    View Slide

  23. It’s so easy and tempting to point fingers
    and find scapegoats after the fact.
    But...

    View Slide

  24. ...we’re humans. We’re not rational or
    objective beings.
    Here’s why.

    View Slide

  25. Hindsight Bias
    “The inclination, after an event has ocurred, to see the
    event as having been predictable, despite there having
    been little or no objective basis for predicting it.”
    “Hindsight bias”

    View Slide

  26. It’s so obvious! How could
    have they missed it?

    View Slide

  27. Fundamental Attribution Error
    “Our tendency to explain someone’s behaviour based on
    internal factors, such as personality or disposition, and to
    underestimate the influence that external factors, such as
    situational influences (...).”
    “Fundamental Attribution Error - Definition & Overview”

    View Slide

  28. It’s easier to change people than basic
    beliefs about a system.

    View Slide

  29. “The human tendency to create possible alternatives to life
    events that have already occurred.
    They are thoughts that consist of ‘If I had only’.”
    “Counterfactual Thinking”
    Counterfactuals

    View Slide

  30. Counterfactuals can affect people’s
    emotions, e.g.: regret, guilt or relief.
    They can also affect how they decide
    who deserves blame and responsibility.

    View Slide

  31. Stressing what was not done explains
    nothing about what actually happened,
    or why.

    View Slide

  32. What can we do?

    View Slide

  33. Local Rationality Principle
    “People do things that make sense to them given their
    goals, understanding of the situation and focus of attention
    at that time.
    Work needs to be understood from the local perspectives of
    those doing the work.”
    “Local Rationality”

    View Slide

  34. The local decision is always right.

    View Slide

  35. Normal people, doing normal things.

    View Slide

  36. So… can we really find “the” cause?

    View Slide

  37. View Slide

  38. Scott A. Snook, Friendly Fire: The Accidental Shootdown of U.S. Black Hawks over Northern Iraq

    View Slide

  39. We don’t find
    cause.
    We select cause.

    View Slide

  40. ● Human error as symptom of failure
    ● Safety is not inherent in systems
    ● Human error connected to features of people, tools, tasks
    and operating environment
    “New” View On Human
    Error

    View Slide

  41. How Organizations Process Information
    Pathological Bureaucratic Generative
    Power-oriented Rule-oriented Performance-oriented
    Low co-operation Modest co-operation High co-operation
    Messengers shot Messengers neglected Messengers trained
    Responsibilities shirked Narrow responsibilities Risks are shared
    Bridging discouraged Bridging tolerated Bridging encouraged
    Failure leads to scapegoating Failure leads to justice Failure leads to inquiry
    Novelty crushed Novelty leads to problems Novelty implemented
    Ron Westrum, “A typology of organisational cultures” (2004)

    View Slide

  42. Four Needs
    an accident report must fulfill
    Sidney Dekker, “The psychology of accident investigation: epistemological, preventive, moral and existential meaning-making” (2014)

    View Slide

  43. Epistemological
    Preventive
    Moral
    Existential
    Most of the time they are in conflict.

    View Slide

  44. The way we look at human error focuses
    on moral and existential needs.

    View Slide

  45. And what do we get by focusing on those
    two needs?

    View Slide

  46. Blame Culture
    Real or perceived. It doesn’t
    matter.

    View Slide

  47. Learning from failure
    is at least as important
    as fulfilling moral and
    existential needs.

    View Slide

  48. A Systematic Approach to Learn From
    Past Events
    Five steps: from context-specific to concept-dependent.
    Sidney Dekker, “Reconstructing human contributions to accidents: the new view on error and performance.” (2014)

    View Slide

  49. 1. Layout Sequence of Events in
    Context-Specific Language
    Data about an incident reveals a sequence of
    activities — human observations, actions,
    assessments, decisions, as well as changes in the
    state of the process or system.

    View Slide

  50. 2. Divide Sequence of Events into
    Episodes
    If the accident evolves over a long period of time.

    View Slide

  51. 3. Find Out How the World Looked or
    Changed During Each Episode
    Find out what their process was doing and what data
    was available. Couple behaviour with situation.

    View Slide

  52. 4. Identify People's Goals, Focus of
    Attention and Knowledge Active at the
    Time
    What people know and what they try to accomplish
    (their goals) determines where they will look, hence
    the data that is observable to them.

    View Slide

  53. 5. Step Up to a Conceptual Description
    It’s crucial so that we can learn from failures and
    identify commonalities between different events.

    View Slide

  54. Now go and make your
    organization more
    humane...

    View Slide

  55. ...and Resilient!

    View Slide

  56. Human
    Factors &
    System
    Safety
    MsC and Learning Labs

    View Slide

  57. Q&A?

    View Slide