Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Hurricane's Butterfly: Debugging pathologically performing systems

The Hurricane's Butterfly: Debugging pathologically performing systems

Talk given as a Jane Street Tech Talk. Video: https://www.youtube.com/watch?v=7AO4wz6gI3Q

Bryan Cantrill

January 19, 2018
Tweet

More Decks by Bryan Cantrill

Other Decks in Technology

Transcript

  1. The Hurricane’s Butterfly
    Debugging pathologically performing systems
    CTO
    [email protected]
    Bryan Cantrill
    @bcantrill

    View full-size slide

  2. Debugging system failure
    • Failures are easiest to debug when they are explicit and fatal
    • A system that fails fatally stops: it ceases to make forward
    progress, leaving behind a snapshot of its state — a core dump
    • Unfortunately, these are not all problems…
    • A broad class of problems are non-fatal: the system continues
    to operate despite having failed, often destroying evidence
    • Worst of all are those non-fatal failures that are also implicit

    View full-size slide

  3. Implicit, non-fatal failure
    • The most difficult, time-consuming bugs to debug are those in
    which the system failure is unbeknownst to the system itself
    • The system does the wrong thing or returns the wrong result or
    has pathological side effects (e.g., resource leaks)
    • Of these, the gnarliest class are those failures that are not
    strictly speaking failure at all: the system is operating correctly,
    but is failing to operate in a timely or efficient fashion
    • That is, it just… sucks

    View full-size slide

  4. The stack of abstraction
    • Our software systems are built as stacks of abstraction
    • These stacks allow us to stand on the shoulders of history — to
    reuse components without rebuilding them
    • We can do this because of the software paradox: software is
    both information and machine, exhibiting properties of both
    • Our stacks are higher and run deeper than we can see or know:
    software is silent and opaque; the nature of abstraction is to
    seal us from what runs beneath!
    • They run so deep as to challenge our definition of software…

    View full-size slide

  5. The Butterflies
    • When the stack of abstraction performs pathologically, its power
    transmogrifies to peril: layering amplifies performance
    pathologies but hinders insight
    • Work amplifies as we go down the stack
    • Latency amplifies as we go up the stack
    • Seemingly minor issues in one layer can cascade into systemic
    pathological performance
    • These are the butterflies that cause hurricanes

    View full-size slide

  6. Butterfly I: ARC-induced black hole

    View full-size slide

  7. Butterfly II: Disk reader starvation

    View full-size slide

  8. Butterfly III: Kernel page-table isolation
    Data courtesy Scaleway, running a PHP workload with KPTI patches for Linux. Thank you Edouard Bonlieu and team!

    View full-size slide

  9. The Hurricane
    • With pathologically performing systems, we are faced with
    Leventhal’s Conundrum: given a hurricane, find the butterflies!
    • This is excruciatingly difficult:
    • Symptoms are often far removed from root cause
    • There may not be a single root cause but several
    • The system is dynamic and may change without warning
    • Improvements to the system are hard to model and verify
    • Emphatically, this is not “tuning” — it is debugging

    View full-size slide

  10. Performance debugging
    • When we think of it as debugging, we can stop pretending that
    understanding (and rectifying) pathological system performance
    is rote or mechanical — or easy
    • We can resist the temptation to be guided by folklore: just
    because someone heard about something causing a problem
    once doesn’t mean it’s the problem now!
    • We can resist the temptation to change the system before
    understanding it: just as you wouldn’t (or shouldn’t!) debug by
    just changing code, you shouldn’t debug a pathologically
    performing system by randomly altering it!

    View full-size slide

  11. How do we debug?
    • To debug methodically, we must resist the temptation to quick
    hypotheses, focusing rather on questions and observations
    • Iterating between questions and observations gathers the facts
    that will constrain future hypotheses
    • These facts can be used to disconfirm hypotheses!
    • How do we ask questions?
    • How do we make observations?

    View full-size slide

  12. Asking questions
    • For performance debugging, the initial question formulation is
    particularly challenging: where does one start?
    • Resource-centric methodologies like the USE Method
    (Utilization/Saturation/Errors) can be excellent starting points…
    • But keep these methodologies in their context: they provide
    initial questions to ask — they are not recipes for debugging
    arbitrary performance pathologies!

    View full-size slide

  13. Making observations
    • Questions are answered through observation
    • The observability of the system is paramount
    • If the system cannot be observed, one is reduced to guessing,
    making changes, and drawing inferences
    • If it must be said, drawing inferences based only on change is
    highly flawed: correlation does not imply causation!
    • To be observable, systems must be instrumentable: they must
    be able to be altered to emit a datum in the desired condition

    View full-size slide

  14. Observability through instrumentation
    • Static instrumentation modifies source to provide semantically
    relevant information, e.g., via logging or counters
    • Dynamic instrumentation allows for the system to be changed
    while running to emit data, e.g. DTrace, OpenTracing
    • Both mechanisms of instrumentation are essential!
    • Static instrumentation provides the observations necessary for
    early question formulation…
    • Dynamic instrumentation answers deeper, ad hoc questions

    View full-size slide

  15. Aside: Monitoring vs. observability
    • Monitoring is an essential operational activity that can indicate a
    pathologically performing system and provide initial questions
    • But monitoring alone is often insufficient to completely debug a
    pathologically performing system, because the questions that it
    can answer are limited to that which is monitored
    • As we increasingly deploy developed systems rather than
    received ones, it is a welcome (and unsurprising!) development
    to see the focus of monitoring expand to observability!

    View full-size slide

  16. Aggregation
    • When instrumenting the system, it can become overwhelmed
    with the overhead of instrumentation
    • Aggregation is essential for scalable, non-invasive
    instrumentation — and is a first-class primitive in (e.g.) DTrace
    • But aggregation also eliminates important dimensions of data,
    especially with respect to time; some questions may only be
    answered with disaggregated data!
    • Use aggregation for performance debugging — but also
    understand its limits!

    View full-size slide

  17. Visualization
    • The visual cortex is unparalleled at detecting patterns
    • The value of visualizing data is not merely providing answers,
    but also (and especially) provoking new questions
    • Our systems are so large, complicated and abstract that there is
    not one way to visualize them, but many
    • The visualization of systems and their representations is an
    essential skill for performance debugging!

    View full-size slide

  18. Visualization: Gnuplot
    • Graphs are terrific — so much so that we should not restrict
    ourselves to the captive graphs found in bundled software!
    • An ad hoc plotting tool is essential for performance debugging;
    and Gnuplot is an excellent (if idiosyncratic) one
    • Gnuplot is easily combined with workhorses like awk or perl
    • That Gnuplot is an essential tool helps to set expectation
    around performance debugging tools: they are not magicians!

    View full-size slide

  19. Visualization: Heatmaps

    View full-size slide

  20. Visualization: Flamegraphs

    View full-size slide

  21. Visualization: Statemaps
    • Especially when trying to understand interplay between different
    entities, it can be useful to visualize their state over time
    • Time is the critical element here!
    • We are experimenting with statemaps whereby state transitions
    are instrumented (e.g., with DTrace) and then visualized
    • This is not necessarily a new way of visualizing the system
    (e.g., early thread debuggers often showed thread state over
    time), but with a new focus on post hoc visualization
    • Primordial implementation: https://github.com/joyent/statemap

    View full-size slide

  22. Visualization: Statemaps

    View full-size slide

  23. Visualization: Statemaps

    View full-size slide

  24. Visualization: Statemaps

    View full-size slide

  25. Visualization: Statemaps

    View full-size slide

  26. Visualization: Statemaps

    View full-size slide

  27. The hurricane’s butterfly
    • Finding the source(s) of pathologically performing systems must
    be thought of as debugging — albeit the hardest kind
    • Debugging isn’t about making guesses; it’s about asking
    questions and answering them with observations
    • We must enshrine observability to assure debuggability!
    • Debugging rewards persistence, grit, and resilience more than
    intuition or insight — it is more perspiration than inspiration!
    • We must have the faith that our systems are — in the end —
    purely synthetic; we can find the hurricane’s butterfly!

    View full-size slide