Slide 1

Slide 1 text

The Detective Hat Investigating production issues 1 Jenny Bramble @jennydoesthings Adrian P. Dunston

Slide 2

Slide 2 text

Hey, dev. We're getting reports of timeouts in the logs. Looks like we got a dead connection. 2

Slide 3

Slide 3 text

I'm sorry to hear that. My condolences. 3

Slide 4

Slide 4 text

We seem to be getting a lot of dead connections recently. Seems like somebody should look into it. 4

Slide 5

Slide 5 text

Yeah, maybe somebody should. But I got my own problems. Besides, I don't do that sort of thing anymore. 5

Slide 6

Slide 6 text

Seems to me like, if a person lets too many dead connections go, eventually something inside them dies too. 6

Slide 7

Slide 7 text

Anyway, see you around... 7

Slide 8

Slide 8 text

I'm a developer. I've got a sweet little epic going. Lots of little tickets to take care of. But darnit she's right. It's a dirty world out there. 8

Slide 9

Slide 9 text

Time to put my detective hat on. 9

Slide 10

Slide 10 text

The Detective Hat Investigating production issues 10

Slide 11

Slide 11 text

Hi, I'm Jenny! ● Director of Quality Engineering at Papa ● Test-based human for most of my career ● Human interfacing is my favorite thing ● Two cats—Dante and Dax ● My pronouns are she/her 11

Slide 12

Slide 12 text

Adrian P. Dunston [email protected] @bitcapulet 12

Slide 13

Slide 13 text

Adrian P. Dunston [email protected] @bitcapulet 13

Slide 14

Slide 14 text

14

Slide 15

Slide 15 text

The purpose of this presentation is to help develop a mental model for tracking down production issues in your codebase. 15

Slide 16

Slide 16 text

The Detective Hat Prologue - Mindset 1. Taking the Case 2. Hitting the Streets 3. Chase Scene! 4. Putting them away… 16

Slide 17

Slide 17 text

Prologue Mindset 17

Slide 18

Slide 18 text

Confession Time… 18

Slide 19

Slide 19 text

We have no idea what police-work is like. But we've got a soft spot for noir stories and detective shows. 19

Slide 20

Slide 20 text

Also… we love production issues. 20

Slide 21

Slide 21 text

Production issues are a game you can win. The truth is out there. You can find it. 21

Slide 22

Slide 22 text

This is also another opportunity to do what Andy Knight was saying with "Shift Right." It's a chance to learn your system better and deepen your mental model. 22

Slide 23

Slide 23 text

Okay, my confession time. I love telling about old scars. They're shared history; even if you weren't there. You can also read the future in bumps, scars, and bruises. 23

Slide 24

Slide 24 text

Testing is expansive, checking for edge cases and fanning out. Production bughunts are contractive, drawing a box around it and narrowing down. 24

Slide 25

Slide 25 text

All production issues are a failure of process. 25

Slide 26

Slide 26 text

Part 1 Taking the Case 26

Slide 27

Slide 27 text

Someone comes in asking for help. There's something wrong, but they don't have a lot to go on, and there's little reward to be had from tracking it down. 27

Slide 28

Slide 28 text

Naturally your first instinct is to tell them to buzz off. This brings us to the first rule of the detective hat. 28

Slide 29

Slide 29 text

Believe the problem is real until you can prove it isn't. 29

Slide 30

Slide 30 text

Believe them on the first report. Maybe it's not like they think it is. Maybe the problem isn't really in your area. But there IS a problem. 30

Slide 31

Slide 31 text

Follow up on the impact and the reach of the problem. Maybe it doesn't need to be addressed. But remember, the decision isn't "is this really impacting people?" The decision is, "can I live with that impact?" 31

Slide 32

Slide 32 text

What if it's a false positive? Well then, the problem is that your system is sending false positives. 32

Slide 33

Slide 33 text

If they have a problem, and you tell them they don't, well now they have 2 problems. 33

Slide 34

Slide 34 text

Now that you've taken the case, it's time to take a deep breath. They're not going to give you much to go on. There aren't footprints leading to the killer, and there's no DNA evidence to follow up on. This is the crux of why detective hat is different. 34

Slide 35

Slide 35 text

There's no design, no well-crafted tickets. It's not as simple as upper or lower bound. It's up to you, and you'd rather this wasn't happening at all. Assume no one is going to help you. 35

Slide 36

Slide 36 text

So we took the case. We took a deep breath. Let's conduct some interviews. 36

Slide 37

Slide 37 text

Somebody reported this problem. Ask them questions. When? How often? What does it look like to them? How can we make it happen again? Who else should I talk to? 37

Slide 38

Slide 38 text

If there are no clear clues there, ask around the neighborhood. Rule 1 is believe there is a problem, so start there. If this was real, how would I prove it? Who would see it? Be creative. Be bold. Knock on doors. Check the data, the logs, the event history. 38

Slide 39

Slide 39 text

We'll talk about your detective notebook later.. Write down everything they tell you. 39

Slide 40

Slide 40 text

Write down everything they tell you. Don't believe a word of it. We always believe that there is a problem. We never believe anything else about it until we can prove it. On X-Files they say, Trust No One. 40

Slide 41

Slide 41 text

Write down everything they tell you. Don't believe a word of it. Verify the important facts. Hard-data, that's what we want. Friday from Dragnet doesn't want to know how people feel about something. "Just the facts, ma'am." Find them. 41

Slide 42

Slide 42 text

Write down everything they tell you. Don't believe a word of it. Verify the important facts. We'll be talking more about this in the next section. Collect, Doubt, and Rule-it-out 42

Slide 43

Slide 43 text

Part 2 Hitting the Streets 43

Slide 44

Slide 44 text

Keep a record of the issue. Start it as soon as you start tracking. This is your incident narrative or "detective's notebook." Detective's Notebook 44

Slide 45

Slide 45 text

Start with the problem as your users are seeing it. They reported something; write down specifically what that was. Also note the reach. This can evolve over time. ● What's the problem? 45

Slide 46

Slide 46 text

Here's where that collect, doubt, and rule it out comes in. You'll come up with a bunch of things that MIGHT be causing this. Other people will come assuming they know what it is. Consider it all "potential" and put it in your notes. ● What's the problem? ● Potential causes 46

Slide 47

Slide 47 text

Similarly, there may be co-occuring oddities. Things that don't seem like a cause, but do seem related. This may help with triangulation later. Put it in the notes. ● What's the problem? ● Potential causes ● Potential accessories (after-the-fact) 47

Slide 48

Slide 48 text

Every fact is treasure, and you don’t know what it will be worth. Write them down as you go. Each step you take, each fact you find, each unanswered question. ● What's the problem? ● Potential causes ● Potential accessories (after-the-fact) ● Unanswered questions 48

Slide 49

Slide 49 text

At the bottom of your notebook, put in the things that aren't relevant, but will be later. When tracking one bug, we often find others. ● Tomorrow's problems ○ unrelated bugs 49

Slide 50

Slide 50 text

Maybe you can see the logs and the data but there's some crucial bit of observability you're lacking. Write it down and ask your team to make it available for next time. ● Tomorrow's problems ○ unrelated bugs ○ blind spots 50

Slide 51

Slide 51 text

"I'm not sure what the problem is yet, but this definitely wouldn't have happened if we'd only…" Sure. Good. Write it down, and head off that conversation. That's tomorrow's problem. ● Tomorrow's problems ○ unrelated bugs ○ blind spots ○ preventative steps 51

Slide 52

Slide 52 text

Here's your detective's notebook ● What's the problem? ● Potential causes ● Potential accessories (after-the-fact) ● Unanswered questions ● Tomorrow's problems 52

Slide 53

Slide 53 text

Part 3 Chase Scene 53

Slide 54

Slide 54 text

Collect, Doubt, and Rule it Out 54

Slide 55

Slide 55 text

Compare, Contrast, and Find-out Fast 55

Slide 56

Slide 56 text

Round up the Usual Suspects (then let them go) 56

Slide 57

Slide 57 text

Learn about the users' situation and motivations Hercule Poirot solves a mystery by understanding the psychology of the people he's investigating. Very often a production issue is a misunderstanding between user's expectation and what we THINK they're expecting. Ask "what does this mean to you?" 57

Slide 58

Slide 58 text

Use a made-up example to illustrate where the problem might be. Go back and talk with your witnesses. Provide possible examples. What if it was like this? How would we know? 58

Slide 59

Slide 59 text

Build up your logging and observability. Surveillance You know where it's failing, but not why. Logging is cheap. Be creative and put some thought into how it would be most effective. 59

Slide 60

Slide 60 text

Make a change, throw the switch, and watch the metrics. The stakeout It's your codebase. If you have a theory about what's going on, why not act on that theory. Feature flags are your friends here. Design an experiment. Even if you don't find out where the bug is, you'll at least find out where it isn't. 60

Slide 61

Slide 61 text

Consider that some things are only possible at certain times or in certain conditions 61

Slide 62

Slide 62 text

Go back and ask the incisive follow-up question "Oh and ahh, one more thing" Go back and re-interview the witness. Once you have verified facts in hand, the story they gave before will have new perspective. Listen again, and this time ask the follow up questions. Lots of solutions to bugs start with "Oh, and one more thing…" 62

Slide 63

Slide 63 text

All of these techniques can be called modeling the incident. Modeling the Incident 63

Slide 64

Slide 64 text

In the noir stories, there's often a moment, when the detective is warned off the case. "I'm telling you, Dick. Don't follow through on this one…" 64

Slide 65

Slide 65 text

There are things about this case you don't want to know. 65

Slide 66

Slide 66 text

Some of your team's code is broken in ways you never imagined. 66

Slide 67

Slide 67 text

Nope, I checked everything. You can't stop thinking you know what's right. But you can start adding "unless there's something I don't know about." ...unless there's something I don't know about. 67

Slide 68

Slide 68 text

The detective usually gets a little banged up during the chase. Be okay with being wrong. This is where those scars come from. This is what builds seniors. 68

Slide 69

Slide 69 text

Communicating Status In a slow-burn issue, you want to report where you're at once or twice a day. In a major production incident, this is every half hour. Whether you have something new to report or not. 69

Slide 70

Slide 70 text

Part 4 Putting them away for a looong time… 70

Slide 71

Slide 71 text

Sherlock Holmes has a brother Mycroft. "He has no ambition and no energy. He will not even go out of his way to verify his own solutions, and would rather be considered wrong than take the trouble to prove himself right." Don't be Mycroft. 71

Slide 72

Slide 72 text

Don't stop until you can tell the whole story with supported facts. Handing it off to the D.A. 72

Slide 73

Slide 73 text

It's no use fixing a production issue if we don't learn from it. A production issue costs money. We paid for the thing. We may as well get our money's worth. 73

Slide 74

Slide 74 text

Sometimes this means holding a post-mortem. Sometimes it means writing documentation or adding process or guardrails to ensure this doesn't happen again. 74

Slide 75

Slide 75 text

Also review your blind-spots. Where were those holes in observability? What are the tools you WISH you had while you were in the thick of it? 75

Slide 76

Slide 76 text

The Detective Hat Prologue - Mindset 1. Taking the Case 2. Hitting the Streets 3. Chase Scene! 4. Putting them away… 76

Slide 77

Slide 77 text

There is a certain joy in having been there, done that, and being too old for this. 77

Slide 78

Slide 78 text

Because all it takes is that one case your folks can't solve, and you're back in the game! 78

Slide 79

Slide 79 text

Good work, detective. Our users will sleep better at night knowing their connections are safe. 79

Slide 80

Slide 80 text

Thanks, I appreciate your support. There's just one thing where you're wrong. I quit the detective game. I'm regular dev with regular dev problems. 80

Slide 81

Slide 81 text

Of course you are, dev. Of course you are... 81

Slide 82

Slide 82 text

…until the next time. 82

Slide 83

Slide 83 text

Fin? 83

Slide 84

Slide 84 text

84

Slide 85

Slide 85 text

The answer to "How would we know if this works in prod?" is VERY OFTEN "We wouldn't." and "It doesn't." 85

Slide 86

Slide 86 text

And the answer to "Why didn't you tell me?!" is often "We did." and "You weren't ready to hear it." 86

Slide 87

Slide 87 text

Don't trust your devs. Don't trust your unit tests. Don't trust your ability to run something and say "there's no problem here." Prove it. Verify in prod. Check prod data. Keep asking, "What if I'm wrong? How would I know?" We are wrong all the time. 87

Slide 88

Slide 88 text

Nobody expects code you tested in one environment to run the same in the world it's deployed to. And if that's your expectation, you're only going to get hurt. 88

Slide 89

Slide 89 text

But take heart! Every production bug is a failure in process. Whether designer, reviewer, tester, or operations expert; we're all here to make quality software. And if the system doesn't work, we're all responsible. This isn't about you. 89

Slide 90

Slide 90 text

We've been on the case a while now. We've asked questions, we've looked at logs, we've accepted hard truths. Now we're starting to get the picture, narrowing down suspects, closing in on the culprit. 90

Slide 91

Slide 91 text

There are a generations of techniques we can bring to bear to flush them out from here. 91

Slide 92

Slide 92 text

Rubber Ducking 92

Slide 93

Slide 93 text

Continuous integration, Continuous Delivery, and Observability Tools. What about SQL? Tech Tools 93