Slide 1

Slide 1 text

@andyfleener THE 5 WHY’S AND OTHER LIES BY ANDY FLEENER

Slide 2

Slide 2 text

@andyfleener

Slide 3

Slide 3 text

@andyfleener

Slide 4

Slide 4 text

@andyfleener

Slide 5

Slide 5 text

@andyfleener WHAT ARE THE 5 WHYS?

Slide 6

Slide 6 text

@andyfleener REAL WORLD EXAMPLE TIME!

Slide 7

Slide 7 text

@andyfleener THE API AUTHENTICATOR GONE HAYWIRE ▸ Users and services were unable to make API requests through SportsEngine’s API ▸ Why? An API Authenticator Lambda function stopped working ▸ Why? A bad deploy of the function lost the NODE_ENV environment variable. ▸ Why? A developer added a new environment variable which overrode the existing NODE_ENV configuration ▸ Why? The developer didn’t know NODE_ENV was being set so he didn’t know it would override it. ▸ Why? This lambda function was his first real production ready function and he wasn’t properly trained.

Slide 8

Slide 8 text

@andyfleener THE API AUTHENTICATOR GONE HAYWIRE ▸ Users and services were unable to make API requests through SportsEngine’s API ▸ Why? An API Authenticator Lambda function stopped working ▸ Why? A bad deploy of the function lost the NODE_ENV environment variable. ▸ Why? A developer added a new environment variable which overrode the existing NODE_ENV configuration ▸ Why? The developer didn’t know NODE_ENV was being set so he didn’t know it would override it. ▸ Why? This lambda function was his first real production ready function and he wasn’t properly trained.

Slide 9

Slide 9 text

@andyfleener Benefits of the 5 Whys ▸ Help identify the root cause of a problem. ▸ Determine the relationship between different root causes of a problem. ▸ One of the simplest tools; easy to complete without statistical analysis. When Is 5 Whys Most Useful? ▸ When problems involve human factors or interactions. ▸ In day-to-day business life; can be used within or without a Six Sigma project.

Slide 10

Slide 10 text

@andyfleener Benefits of the 5 Whys ▸ Help identify the root cause of a problem. ▸ Determine the relationship between different root causes of a problem. ▸ One of the simplest tools; easy to complete without statistical analysis. When Is 5 Whys Most Useful? ▸ When problems involve human factors or interactions. ▸ In day-to-day business life; can be used within or without a Six Sigma project.

Slide 11

Slide 11 text

@andyfleener

Slide 12

Slide 12 text

@andyfleener FIND THE ROOT CAUSE!

Slide 13

Slide 13 text

@andyfleener THE DEVELOPER WASN’T PROPERLY TRAINED.

Slide 14

Slide 14 text

@andyfleener THE DEVELOPER WASN’T PROPERLY TRAINED.

Slide 15

Slide 15 text

@andyfleener

Slide 16

Slide 16 text

@andyfleener CAUSE IT NOT SOMETHING YOU FIND. CAUSE IS SOMETHING YOU CONSTRUCT. Sidney Dekker THE FIELD GUIDE TO UNDERSTANDING HUMAN ERROR

Slide 17

Slide 17 text

@andyfleener

Slide 18

Slide 18 text

@andyfleener THE DEVELOPER INCORRECTLY UPDATED THE ENVIRONMENT VARIABLES.

Slide 19

Slide 19 text

@andyfleener ACCIDENTS EMERGE FROM A CONFLUENCE OF CONDITIONS AND OCCURRENCES THAT ARE USUALLY ASSOCIATED WITH THE PURSUIT OF SUCCESS, BUT IN THIS COMBINATION—EACH NECESSARY BUT ONLY JOINTLY SUFFICIENT—ABLE TO TRIGGER FAILURE INSTEAD. Dekker, Hollnagel, Woods, Cook RESILIENCE ENGINEERING: NEW DIRECTIONS FOR MEASURING AND MAINTAINING SAFETY IN COMPLEX SYSTEMS

Slide 20

Slide 20 text

@andyfleener WHY! IS! THE !WRONG !QUESTION John Allspaw THE INFINITE HOWS * emphasis added by me

Slide 21

Slide 21 text

@andyfleener

Slide 22

Slide 22 text

@andyfleener THE DEVELOPER INCORRECTLY UPDATED THE ENVIRONMENT VARIABLES.

Slide 23

Slide 23 text

@andyfleener HUMAN ERROR

Slide 24

Slide 24 text

@andyfleener WHY LEADS TO CAUSALITY. CAUSALITY LEADS TO HUMAN ERROR. WHY?

Slide 25

Slide 25 text

@andyfleener

Slide 26

Slide 26 text

@andyfleener THE REDUCTIONIST VIEW FALLACIES ▸ The system is made up of functioning components and it’s only when humans intervene that the systems malfunction ▸ We only have to try hard enough to understand exactly what happened. Complete knowledge is attainable. ▸ “cause-effect symmetry” all effects have a cause and all causes have effects ▸ “root cause seduction” Identifying a single root cause creates a tendency to be overconfident about how much we know

Slide 27

Slide 27 text

@andyfleener

Slide 28

Slide 28 text

@andyfleener OK I GOT IT 5WHYS ARE BAD

Slide 29

Slide 29 text

@andyfleener

Slide 30

Slide 30 text

@andyfleener IN ORDER TO LEARN (WHICH SHOULD BE THE GOAL OF ANY RETROSPECTIVE OR POST-HOC INVESTIGATION) YOU WANT MULTIPLE AND DIVERSE PERSPECTIVES. YOU GET THESE BY ASKING PEOPLE FOR THEIR OWN NARRATIVES. EFFECTIVELY, YOU’RE ASKING “HOW?“ John Allspaw THE INFINITE HOWS

Slide 31

Slide 31 text

@andyfleener THE PRIME DIRECTIVE IS TO LEARN

Slide 32

Slide 32 text

@andyfleener THE PRIME DIRECTIVE IS TO LEARN

Slide 33

Slide 33 text

@andyfleener BEHIND HUMAN ERROR SEARCH OUT SECOND STORIES

Slide 34

Slide 34 text

@andyfleener CONDUCT INTERVIEWS ▸ Focus on the feelings! ▸ Treat people like the experts they are! ▸ Find the surprises. I didn’t know it worked like that! Ah hah! ▸ Stay away from counterfactuals. There’s no value in discussing things that never happened.

Slide 35

Slide 35 text

@andyfleener CONTRIBUTING FACTORS ▸ API Authenticator is a relatively new type of service(lambda based) only the second lambda function created at SE as a result the deploy process is not standardized across the platform yet. ▸ This incident was the first incident that involved a lambda function at SE. We were surprised that the failure mode of the function was to use the development environment configuration which was resulted in an impossible to complete request to an undefined dns record. The function was timing out after 5000ms. ▸ During the incident we were struggling with false positive on testing via API Gateway console was giving us different results. Our unfamiliarity with both the Lambda and API Gateway consoles resulted in a lack of visibility, access and monitoring for AWS infrastructure. ▸ This particular failure mode(5s requests timeouts) was not a failure mode that was being monitored. ▸ Increased traffic due to a feature release was hard to detect because of aggressive caching hiding the true request volume. The volume post release grew enough combined with this failure mode API Gateway started throttling traffic. ▸ This incident came with added pressure from an unusual big feature launch

Slide 36

Slide 36 text

@andyfleener TELL STORIES

Slide 37

Slide 37 text

@andyfleener RECAP ▸ We can do so much better than the 5 whys ▸ Quality retrospectives are a learning exercise, bring your curiosity! ▸ Embrace the complexity, your goal should be to wade through it, not erase it. ▸ Be a systems detective! Conduct interviews, acknowledge incomplete truths, it’s ok not to know something! ▸ Tell the story!

Slide 38

Slide 38 text

@andyfleener MASTERY HAS LESS TO DO WITH PUSHING LEVERAGE POINTS THAN IT DOES WITH STRATEGICALLY, PROFOUNDLY, MADLY, LETTING GO AND DANCING WITH THE SYSTEM." Donella H. Meadows THINKING IN SYSTEMS

Slide 39

Slide 39 text

@andyfleener