Talk given as a Jane Street Tech Talk. Video: https://www.youtube.com/watch?v=7AO4wz6gI3Q
Debugging system failure
• Failures are easiest to debug when they are explicit and fatal
• A system that fails fatally stops: it ceases to make forward
progress, leaving behind a snapshot of its state — a core dump
• Unfortunately, these are not all problems…
• A broad class of problems are non-fatal: the system continues
to operate despite having failed, often destroying evidence
• Worst of all are those non-fatal failures that are also implicit
Implicit, non-fatal failure
• The most difﬁcult, time-consuming bugs to debug are those in
which the system failure is unbeknownst to the system itself
• The system does the wrong thing or returns the wrong result or
has pathological side effects (e.g., resource leaks)
• Of these, the gnarliest class are those failures that are not
strictly speaking failure at all: the system is operating correctly,
but is failing to operate in a timely or efﬁcient fashion
• That is, it just… sucks
The stack of abstraction
• Our software systems are built as stacks of abstraction
• These stacks allow us to stand on the shoulders of history — to
reuse components without rebuilding them
• We can do this because of the software paradox: software is
both information and machine, exhibiting properties of both
• Our stacks are higher and run deeper than we can see or know:
software is silent and opaque; the nature of abstraction is to
seal us from what runs beneath!
• They run so deep as to challenge our deﬁnition of software…
• When the stack of abstraction performs pathologically, its power
transmogriﬁes to peril: layering ampliﬁes performance
pathologies but hinders insight
• Work ampliﬁes as we go down the stack
• Latency ampliﬁes as we go up the stack
• Seemingly minor issues in one layer can cascade into systemic
• These are the butterﬂies that cause hurricanes
Butterﬂy I: ARC-induced black hole
Butterﬂy II: Disk reader starvation
Butterﬂy III: Kernel page-table isolation
Data courtesy Scaleway, running a PHP workload with KPTI patches for Linux. Thank you Edouard Bonlieu and team!
• With pathologically performing systems, we are faced with
Leventhal’s Conundrum: given a hurricane, ﬁnd the butterﬂies!
• This is excruciatingly difﬁcult:
• Symptoms are often far removed from root cause
• There may not be a single root cause but several
• The system is dynamic and may change without warning
• Improvements to the system are hard to model and verify
• Emphatically, this is not “tuning” — it is debugging
• When we think of it as debugging, we can stop pretending that
understanding (and rectifying) pathological system performance
is rote or mechanical — or easy
• We can resist the temptation to be guided by folklore: just
because someone heard about something causing a problem
once doesn’t mean it’s the problem now!
• We can resist the temptation to change the system before
understanding it: just as you wouldn’t (or shouldn’t!) debug by
just changing code, you shouldn’t debug a pathologically
performing system by randomly altering it!
How do we debug?
• To debug methodically, we must resist the temptation to quick
hypotheses, focusing rather on questions and observations
• Iterating between questions and observations gathers the facts
that will constrain future hypotheses
• These facts can be used to disconﬁrm hypotheses!
• How do we ask questions?
• How do we make observations?
• For performance debugging, the initial question formulation is
particularly challenging: where does one start?
• Resource-centric methodologies like the USE Method
(Utilization/Saturation/Errors) can be excellent starting points…
• But keep these methodologies in their context: they provide
initial questions to ask — they are not recipes for debugging
arbitrary performance pathologies!
• Questions are answered through observation
• The observability of the system is paramount
• If the system cannot be observed, one is reduced to guessing,
making changes, and drawing inferences
• If it must be said, drawing inferences based only on change is
highly ﬂawed: correlation does not imply causation!
• To be observable, systems must be instrumentable: they must
be able to be altered to emit a datum in the desired condition
Observability through instrumentation
• Static instrumentation modiﬁes source to provide semantically
relevant information, e.g., via logging or counters
• Dynamic instrumentation allows for the system to be changed
while running to emit data, e.g. DTrace, OpenTracing
• Both mechanisms of instrumentation are essential!
• Static instrumentation provides the observations necessary for
early question formulation…
• Dynamic instrumentation answers deeper, ad hoc questions
Aside: Monitoring vs. observability
• Monitoring is an essential operational activity that can indicate a
pathologically performing system and provide initial questions
• But monitoring alone is often insufﬁcient to completely debug a
pathologically performing system, because the questions that it
can answer are limited to that which is monitored
• As we increasingly deploy developed systems rather than
received ones, it is a welcome (and unsurprising!) development
to see the focus of monitoring expand to observability!
• When instrumenting the system, it can become overwhelmed
with the overhead of instrumentation
• Aggregation is essential for scalable, non-invasive
instrumentation — and is a ﬁrst-class primitive in (e.g.) DTrace
• But aggregation also eliminates important dimensions of data,
especially with respect to time; some questions may only be
answered with disaggregated data!
• Use aggregation for performance debugging — but also
understand its limits!
• The visual cortex is unparalleled at detecting patterns
• The value of visualizing data is not merely providing answers,
but also (and especially) provoking new questions
• Our systems are so large, complicated and abstract that there is
not one way to visualize them, but many
• The visualization of systems and their representations is an
essential skill for performance debugging!
• Graphs are terriﬁc — so much so that we should not restrict
ourselves to the captive graphs found in bundled software!
• An ad hoc plotting tool is essential for performance debugging;
and Gnuplot is an excellent (if idiosyncratic) one
• Gnuplot is easily combined with workhorses like awk or perl
• That Gnuplot is an essential tool helps to set expectation
around performance debugging tools: they are not magicians!
• Especially when trying to understand interplay between different
entities, it can be useful to visualize their state over time
• Time is the critical element here!
• We are experimenting with statemaps whereby state transitions
are instrumented (e.g., with DTrace) and then visualized
• This is not necessarily a new way of visualizing the system
(e.g., early thread debuggers often showed thread state over
time), but with a new focus on post hoc visualization
• Primordial implementation: https://github.com/joyent/statemap
The hurricane’s butterﬂy
• Finding the source(s) of pathologically performing systems must
be thought of as debugging — albeit the hardest kind
• Debugging isn’t about making guesses; it’s about asking
questions and answering them with observations
• We must enshrine observability to assure debuggability!
• Debugging rewards persistence, grit, and resilience more than
intuition or insight — it is more perspiration than inspiration!
• We must have the faith that our systems are — in the end —
purely synthetic; we can ﬁnd the hurricane’s butterﬂy!