Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The 5 Whys and Other Lies

The 5 Whys and Other Lies

Incidents and Accidents are only 5 WHYs away from never happening again! The secret to avoid Human Error is to remove all the humans! These myths are unfortunately still incredibly popular in our industry. Let's bust them, and talk about what cutting edge post incident reviews look like.

Andy Fleener

May 22, 2019
Tweet

More Decks by Andy Fleener

Other Decks in Technology

Transcript

  1. @andyfleener THE API AUTHENTICATOR GONE HAYWIRE ▸ Users and services

    were unable to make API requests through SportsEngine’s API ▸ Why? An API Authenticator Lambda function stopped working ▸ Why? A bad deploy of the function lost the NODE_ENV environment variable. ▸ Why? A developer added a new environment variable which overrode the existing NODE_ENV configuration ▸ Why? The developer didn’t know NODE_ENV was being set so he didn’t know it would override it. ▸ Why? This lambda function was his first real production ready function and he wasn’t properly trained.
  2. @andyfleener THE API AUTHENTICATOR GONE HAYWIRE ▸ Users and services

    were unable to make API requests through SportsEngine’s API ▸ Why? An API Authenticator Lambda function stopped working ▸ Why? A bad deploy of the function lost the NODE_ENV environment variable. ▸ Why? A developer added a new environment variable which overrode the existing NODE_ENV configuration ▸ Why? The developer didn’t know NODE_ENV was being set so he didn’t know it would override it. ▸ Why? This lambda function was his first real production ready function and he wasn’t properly trained.
  3. @andyfleener Benefits of the 5 Whys ▸ Help identify the

    root cause of a problem. ▸ Determine the relationship between different root causes of a problem. ▸ One of the simplest tools; easy to complete without statistical analysis. When Is 5 Whys Most Useful? ▸ When problems involve human factors or interactions. ▸ In day-to-day business life; can be used within or without a Six Sigma project.
  4. @andyfleener Benefits of the 5 Whys ▸ Help identify the

    root cause of a problem. ▸ Determine the relationship between different root causes of a problem. ▸ One of the simplest tools; easy to complete without statistical analysis. When Is 5 Whys Most Useful? ▸ When problems involve human factors or interactions. ▸ In day-to-day business life; can be used within or without a Six Sigma project.
  5. @andyfleener CAUSE IT NOT SOMETHING YOU FIND. CAUSE IS SOMETHING

    YOU CONSTRUCT. Sidney Dekker THE FIELD GUIDE TO UNDERSTANDING HUMAN ERROR
  6. @andyfleener ACCIDENTS EMERGE FROM A CONFLUENCE OF CONDITIONS AND OCCURRENCES

    THAT ARE USUALLY ASSOCIATED WITH THE PURSUIT OF SUCCESS, BUT IN THIS COMBINATION—EACH NECESSARY BUT ONLY JOINTLY SUFFICIENT—ABLE TO TRIGGER FAILURE INSTEAD. Dekker, Hollnagel, Woods, Cook RESILIENCE ENGINEERING: NEW DIRECTIONS FOR MEASURING AND MAINTAINING SAFETY IN COMPLEX SYSTEMS
  7. @andyfleener THE REDUCTIONIST VIEW FALLACIES ▸ The system is made

    up of functioning components and it’s only when humans intervene that the systems malfunction ▸ We only have to try hard enough to understand exactly what happened. Complete knowledge is attainable. ▸ “cause-effect symmetry” all effects have a cause and all causes have effects ▸ “root cause seduction” Identifying a single root cause creates a tendency to be overconfident about how much we know
  8. @andyfleener IN ORDER TO LEARN (WHICH SHOULD BE THE GOAL

    OF ANY RETROSPECTIVE OR POST-HOC INVESTIGATION) YOU WANT MULTIPLE AND DIVERSE PERSPECTIVES. YOU GET THESE BY ASKING PEOPLE FOR THEIR OWN NARRATIVES. EFFECTIVELY, YOU’RE ASKING “HOW?“ John Allspaw THE INFINITE HOWS
  9. @andyfleener CONDUCT INTERVIEWS ▸ Focus on the feelings! ▸ Treat

    people like the experts they are! ▸ Find the surprises. I didn’t know it worked like that! Ah hah! ▸ Stay away from counterfactuals. There’s no value in discussing things that never happened.
  10. @andyfleener CONTRIBUTING FACTORS ▸ API Authenticator is a relatively new

    type of service(lambda based) only the second lambda function created at SE as a result the deploy process is not standardized across the platform yet. ▸ This incident was the first incident that involved a lambda function at SE. We were surprised that the failure mode of the function was to use the development environment configuration which was resulted in an impossible to complete request to an undefined dns record. The function was timing out after 5000ms. ▸ During the incident we were struggling with false positive on testing via API Gateway console was giving us different results. Our unfamiliarity with both the Lambda and API Gateway consoles resulted in a lack of visibility, access and monitoring for AWS infrastructure. ▸ This particular failure mode(5s requests timeouts) was not a failure mode that was being monitored. ▸ Increased traffic due to a feature release was hard to detect because of aggressive caching hiding the true request volume. The volume post release grew enough combined with this failure mode API Gateway started throttling traffic. ▸ This incident came with added pressure from an unusual big feature launch
  11. @andyfleener RECAP ▸ We can do so much better than

    the 5 whys ▸ Quality retrospectives are a learning exercise, bring your curiosity! ▸ Embrace the complexity, your goal should be to wade through it, not erase it. ▸ Be a systems detective! Conduct interviews, acknowledge incomplete truths, it’s ok not to know something! ▸ Tell the story!
  12. @andyfleener MASTERY HAS LESS TO DO WITH PUSHING LEVERAGE POINTS

    THAN IT DOES WITH STRATEGICALLY, PROFOUNDLY, MADLY, LETTING GO AND DANCING WITH THE SYSTEM." Donella H. Meadows THINKING IN SYSTEMS