A walk to remember
Debugging a distributed system failure
Slide 2
Slide 2 text
For attending
Still here
feel free to interrupt
@flaper87
[email protected]
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
4 Main Topics
Slide 5
Slide 5 text
WDH What are
we doing here?
Slide 6
Slide 6 text
HLC High to Low
context
Slide 7
Slide 7 text
RCA Root Cause
Analysis
Slide 8
Slide 8 text
BIH Bring it
home
Slide 9
Slide 9 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
4 main topics
Slide 10
Slide 10 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
Follow the
acronyms
Slide 11
Slide 11 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
It’d be great to
remember their
meaning
Slide 12
Slide 12 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
I don’t think I
remember it
Slide 13
Slide 13 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
That’s why I’ve so
many slides on
acronyms
Slide 14
Slide 14 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
Just making sure
the context is set
Slide 15
Slide 15 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it
home
I could go on
forever
Slide 16
Slide 16 text
WDH What are
we doing here?
RCA Root Cause
Analysis
HLC High to Low
context
BIH Bring it home
...but I won’t
Slide 17
Slide 17 text
WDH What are
we doing here?
Slide 18
Slide 18 text
What kind of issue are
we facing?
WDH
Slide 19
Slide 19 text
Know what y’all are
expected to do
WDH
Slide 20
Slide 20 text
Make sure the right
people are involved
WDH
Slide 21
Slide 21 text
Don’t pull in the entire
company
WDH
Slide 22
Slide 22 text
HLC High to Low
context
Slide 23
Slide 23 text
Assume you’re working
in a low-context
environment
HLC
Slide 24
Slide 24 text
Don’t make
assumptions about the
steps that have been
taken
HLC
Slide 25
Slide 25 text
Every part of the system
is guilty till proven
innocent
HLC
Slide 26
Slide 26 text
Know the system’s
topology
HLC
Slide 27
Slide 27 text
RCA Root Cause
Analysis
Slide 28
Slide 28 text
Have a list of steps to
follow
RCA
Slide 29
Slide 29 text
Many times systems
are just misconfigured
RCA
Slide 30
Slide 30 text
Bottom-up debugging
RCA
Slide 31
Slide 31 text
Top-to-bottom
debugging
RCA
Slide 32
Slide 32 text
Monkey debugging
RCA
Slide 33
Slide 33 text
Correlate your logs
RCA
Slide 34
Slide 34 text
Trace events
throughout the system
RCA
Slide 35
Slide 35 text
Timestamps are pretty
much your life
RCA
Slide 36
Slide 36 text
Compare executions
RCA
Slide 37
Slide 37 text
Visualization tools are
quite handy
RCA
Slide 38
Slide 38 text
BIH Bring it
home
Slide 39
Slide 39 text
Some bugs just take
longer to find
BIH
Slide 40
Slide 40 text
Describe the
(real)
problem
WE F***ED this UP
BIH
Slide 41
Slide 41 text
Build new tools for
future cases
BIH
Slide 42
Slide 42 text
Build a knowledge
base
BIH
Slide 43
Slide 43 text
1
2
3
4
Have clear goals
Know system’s topology
Keep a low-context environment
Don’t assume anything
5 Keep the time small and contextualized
Summary-ish
Slide 44
Slide 44 text
6
7
8
9
Build new debugging tools
Have a check list
Check configuration files too
Dunno what to put here
10 … seriously, no clue
Summary-ish
Slide 45
Slide 45 text
1
2
3
4
Distributed Debugging: http://bit.ly/2bDLXj3
Debugging Deployed Distributed Systems: http://bit.ly/2bDN6aj
The ETTO Principle: http://bit.ly/2bbZmvV
The programming Ape: https://vimeo.com/40988625
5 Blood and tears
references