The Hurricane's Butterfly: Debugging pathologically performing systems

The Hurricane’s Butterﬂy Debugging pathologically performing systems CTO [email protected] Bryan
Cantrill @bcantrill

Debugging system failure • Failures are easiest to debug when
they are explicit and fatal • A system that fails fatally stops: it ceases to make forward progress, leaving behind a snapshot of its state — a core dump • Unfortunately, these are not all problems… • A broad class of problems are non-fatal: the system continues to operate despite having failed, often destroying evidence • Worst of all are those non-fatal failures that are also implicit

Implicit, non-fatal failure • The most difﬁcult, time-consuming bugs to
debug are those in which the system failure is unbeknownst to the system itself • The system does the wrong thing or returns the wrong result or has pathological side effects (e.g., resource leaks) • Of these, the gnarliest class are those failures that are not strictly speaking failure at all: the system is operating correctly, but is failing to operate in a timely or efﬁcient fashion • That is, it just… sucks

The stack of abstraction • Our software systems are built
as stacks of abstraction • These stacks allow us to stand on the shoulders of history — to reuse components without rebuilding them • We can do this because of the software paradox: software is both information and machine, exhibiting properties of both • Our stacks are higher and run deeper than we can see or know: software is silent and opaque; the nature of abstraction is to seal us from what runs beneath! • They run so deep as to challenge our deﬁnition of software…

The Butterflies • When the stack of abstraction performs pathologically,
its power transmogrifies to peril: layering amplifies performance pathologies but hinders insight • Work amplifies as we go down the stack • Latency amplifies as we go up the stack • Seemingly minor issues in one layer can cascade into systemic pathological performance • These are the butterflies that cause hurricanes

Butterﬂy I: ARC-induced black hole

Butterﬂy II: Disk reader starvation

Butterﬂy III: Kernel page-table isolation Data courtesy Scaleway, running a
PHP workload with KPTI patches for Linux. Thank you Edouard Bonlieu and team!

The Hurricane • With pathologically performing systems, we are faced
with Leventhal’s Conundrum: given a hurricane, find the butterflies! • This is excruciatingly difficult: • Symptoms are often far removed from root cause • There may not be a single root cause but several • The system is dynamic and may change without warning • Improvements to the system are hard to model and verify • Emphatically, this is not “tuning” — it is debugging

Performance debugging • When we think of it as debugging,
we can stop pretending that understanding (and rectifying) pathological system performance is rote or mechanical — or easy • We can resist the temptation to be guided by folklore: just because someone heard about something causing a problem once doesn’t mean it’s the problem now! • We can resist the temptation to change the system before understanding it: just as you wouldn’t (or shouldn’t!) debug by just changing code, you shouldn’t debug a pathologically performing system by randomly altering it!

How do we debug? • To debug methodically, we must
resist the temptation to quick hypotheses, focusing rather on questions and observations • Iterating between questions and observations gathers the facts that will constrain future hypotheses • These facts can be used to disconﬁrm hypotheses! • How do we ask questions? • How do we make observations?

Asking questions • For performance debugging, the initial question formulation
is particularly challenging: where does one start? • Resource-centric methodologies like the USE Method (Utilization/Saturation/Errors) can be excellent starting points… • But keep these methodologies in their context: they provide initial questions to ask — they are not recipes for debugging arbitrary performance pathologies!

Making observations • Questions are answered through observation • The
observability of the system is paramount • If the system cannot be observed, one is reduced to guessing, making changes, and drawing inferences • If it must be said, drawing inferences based only on change is highly ﬂawed: correlation does not imply causation! • To be observable, systems must be instrumentable: they must be able to be altered to emit a datum in the desired condition

Observability through instrumentation • Static instrumentation modiﬁes source to provide
semantically relevant information, e.g., via logging or counters • Dynamic instrumentation allows for the system to be changed while running to emit data, e.g. DTrace, OpenTracing • Both mechanisms of instrumentation are essential! • Static instrumentation provides the observations necessary for early question formulation… • Dynamic instrumentation answers deeper, ad hoc questions

Aside: Monitoring vs. observability • Monitoring is an essential operational
activity that can indicate a pathologically performing system and provide initial questions • But monitoring alone is often insufﬁcient to completely debug a pathologically performing system, because the questions that it can answer are limited to that which is monitored • As we increasingly deploy developed systems rather than received ones, it is a welcome (and unsurprising!) development to see the focus of monitoring expand to observability!

Aggregation • When instrumenting the system, it can become overwhelmed
with the overhead of instrumentation • Aggregation is essential for scalable, non-invasive instrumentation — and is a ﬁrst-class primitive in (e.g.) DTrace • But aggregation also eliminates important dimensions of data, especially with respect to time; some questions may only be answered with disaggregated data! • Use aggregation for performance debugging — but also understand its limits!

Visualization • The visual cortex is unparalleled at detecting patterns
• The value of visualizing data is not merely providing answers, but also (and especially) provoking new questions • Our systems are so large, complicated and abstract that there is not one way to visualize them, but many • The visualization of systems and their representations is an essential skill for performance debugging!

Visualization: Gnuplot • Graphs are terriﬁc — so much so
that we should not restrict ourselves to the captive graphs found in bundled software! • An ad hoc plotting tool is essential for performance debugging; and Gnuplot is an excellent (if idiosyncratic) one • Gnuplot is easily combined with workhorses like awk or perl • That Gnuplot is an essential tool helps to set expectation around performance debugging tools: they are not magicians!

Visualization: Heatmaps

Visualization: Flamegraphs

Visualization: Statemaps • Especially when trying to understand interplay between
different entities, it can be useful to visualize their state over time • Time is the critical element here! • We are experimenting with statemaps whereby state transitions are instrumented (e.g., with DTrace) and then visualized • This is not necessarily a new way of visualizing the system (e.g., early thread debuggers often showed thread state over time), but with a new focus on post hoc visualization • Primordial implementation: https://github.com/joyent/statemap

Visualization: Statemaps

The hurricane’s butterfly • Finding the source(s) of pathologically performing
systems must be thought of as debugging — albeit the hardest kind • Debugging isn’t about making guesses; it’s about asking questions and answering them with observations • We must enshrine observability to assure debuggability! • Debugging rewards persistence, grit, and resilience more than intuition or insight — it is more perspiration than inspiration! • We must have the faith that our systems are — in the end — purely synthetic; we can find the hurricane’s butterfly!

The Hurricane's Butterfly: Debugging pathologic...

The Hurricane's Butterfly: Debugging pathologically performing systems

Bryan Cantrill

More Decks by Bryan Cantrill

Other Decks in Technology

Featured

Transcript

The Hurricane’s Butterﬂy Debugging pathologically performing systems CTO [email protected] Bryan

Debugging system failure • Failures are easiest to debug when

Implicit, non-fatal failure • The most difﬁcult, time-consuming bugs to

The stack of abstraction • Our software systems are built

The Butterﬂies • When the stack of abstraction performs pathologically,

Butterﬂy I: ARC-induced black hole

Butterﬂy II: Disk reader starvation

Butterﬂy III: Kernel page-table isolation Data courtesy Scaleway, running a

The Hurricane • With pathologically performing systems, we are faced

Performance debugging • When we think of it as debugging,

How do we debug? • To debug methodically, we must

Asking questions • For performance debugging, the initial question formulation

Making observations • Questions are answered through observation • The

Observability through instrumentation • Static instrumentation modiﬁes source to provide

Aside: Monitoring vs. observability • Monitoring is an essential operational

Aggregation • When instrumenting the system, it can become overwhelmed

Visualization • The visual cortex is unparalleled at detecting patterns

Visualization: Gnuplot • Graphs are terriﬁc — so much so

Visualization: Heatmaps

Visualization: Flamegraphs

Visualization: Statemaps • Especially when trying to understand interplay between

Visualization: Statemaps

Visualization: Statemaps

Visualization: Statemaps

Visualization: Statemaps

Visualization: Statemaps

The hurricane’s butterﬂy • Finding the source(s) of pathologically performing