Slide 1

Slide 1 text

A walk to remember Debugging a distributed system failure

Slide 2

Slide 2 text

For attending Still here feel free to interrupt @flaper87 [email protected]

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

4 Main Topics

Slide 5

Slide 5 text

WDH What are we doing here?

Slide 6

Slide 6 text

HLC High to Low context

Slide 7

Slide 7 text

RCA Root Cause Analysis

Slide 8

Slide 8 text

BIH Bring it home

Slide 9

Slide 9 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home 4 main topics

Slide 10

Slide 10 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home Follow the acronyms

Slide 11

Slide 11 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home It’d be great to remember their meaning

Slide 12

Slide 12 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home I don’t think I remember it

Slide 13

Slide 13 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home That’s why I’ve so many slides on acronyms

Slide 14

Slide 14 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home Just making sure the context is set

Slide 15

Slide 15 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home I could go on forever

Slide 16

Slide 16 text

WDH What are we doing here? RCA Root Cause Analysis HLC High to Low context BIH Bring it home ...but I won’t

Slide 17

Slide 17 text

WDH What are we doing here?

Slide 18

Slide 18 text

What kind of issue are we facing? WDH

Slide 19

Slide 19 text

Know what y’all are expected to do WDH

Slide 20

Slide 20 text

Make sure the right people are involved WDH

Slide 21

Slide 21 text

Don’t pull in the entire company WDH

Slide 22

Slide 22 text

HLC High to Low context

Slide 23

Slide 23 text

Assume you’re working in a low-context environment HLC

Slide 24

Slide 24 text

Don’t make assumptions about the steps that have been taken HLC

Slide 25

Slide 25 text

Every part of the system is guilty till proven innocent HLC

Slide 26

Slide 26 text

Know the system’s topology HLC

Slide 27

Slide 27 text

RCA Root Cause Analysis

Slide 28

Slide 28 text

Have a list of steps to follow RCA

Slide 29

Slide 29 text

Many times systems are just misconfigured RCA

Slide 30

Slide 30 text

Bottom-up debugging RCA

Slide 31

Slide 31 text

Top-to-bottom debugging RCA

Slide 32

Slide 32 text

Monkey debugging RCA

Slide 33

Slide 33 text

Correlate your logs RCA

Slide 34

Slide 34 text

Trace events throughout the system RCA

Slide 35

Slide 35 text

Timestamps are pretty much your life RCA

Slide 36

Slide 36 text

Compare executions RCA

Slide 37

Slide 37 text

Visualization tools are quite handy RCA

Slide 38

Slide 38 text

BIH Bring it home

Slide 39

Slide 39 text

Some bugs just take longer to find BIH

Slide 40

Slide 40 text

Describe the (real) problem WE F***ED this UP BIH

Slide 41

Slide 41 text

Build new tools for future cases BIH

Slide 42

Slide 42 text

Build a knowledge base BIH

Slide 43

Slide 43 text

1 2 3 4 Have clear goals Know system’s topology Keep a low-context environment Don’t assume anything 5 Keep the time small and contextualized Summary-ish

Slide 44

Slide 44 text

6 7 8 9 Build new debugging tools Have a check list Check configuration files too Dunno what to put here 10 … seriously, no clue Summary-ish

Slide 45

Slide 45 text

1 2 3 4 Distributed Debugging: http://bit.ly/2bDLXj3 Debugging Deployed Distributed Systems: http://bit.ly/2bDN6aj The ETTO Principle: http://bit.ly/2bbZmvV The programming Ape: https://vimeo.com/40988625 5 Blood and tears references

Slide 46

Slide 46 text

Questions?