Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Systems with Statemaps

Visualizing Systems with Statemaps

Talk given at the Observability Practitioners Summit at KubeCon in 2018. Video: https://www.youtube.com/watch?v=U4E0QxzswQc

Bryan Cantrill

December 10, 2018
Tweet

More Decks by Bryan Cantrill

Other Decks in Technology

Transcript

  1. Visualizing Systems with Statemaps
    CTO
    [email protected]
    Bryan Cantrill
    @bcantrill

    View full-size slide

  2. The stack of abstraction
    • Our software systems are built as stacks of abstraction
    • These stacks allow us to stand on the shoulders of history — to
    reuse components without rebuilding them
    • We can do this because of the software paradox: software is
    both information and machine, exhibiting properties of both
    • Our stacks are higher and run deeper than we can see or know:
    software is opaque; the nature of abstraction is to seal us from
    what runs beneath!

    View full-size slide

  3. Run silent, run deep
    • Not only is the stack deep, it is silent
    • Running software emits neither light nor heat; it makes no
    sound; it attracts no mass; it (mostly) has no odor
    • Running software is — by all conventional notions — unseeable
    • This generally isn’t a bad thing, as long as it all works…

    View full-size slide

  4. Hurricanes from butterflies
    • When the stack of abstraction performs pathologically, its power
    transmogrifies to peril: layering amplifies performance
    pathologies but hinders insight
    • Work amplifies as we go down the stack
    • Latency amplifies as we go up the stack
    • Seemingly minor issues in one layer can cascade into systemic
    pathological performance…
    • As the system becomes dominated by its outliers, butterflies
    spawn hurricanes of pathological performance

    View full-size slide

  5. Debugging the hurricanes
    • Understanding a pathologically performing system is
    excruciatingly difficult:
    • Symptoms are often far removed from root cause
    • There may not be a single root cause but several
    • The system is dynamic and may change without warning
    • Improvements to the system are hard to model and verify
    • Emphatically, this is not “tuning” — it is debugging

    View full-size slide

  6. How do we debug?
    • To debug methodically, we must resist the temptation to quick
    hypotheses, focusing rather on questions and observations
    • Iterating between questions and observations gathers the facts
    that will constrain future hypotheses
    • These facts can be used to disconfirm hypotheses!
    • How do we ask questions?
    • How do we make observations?

    View full-size slide

  7. Asking questions
    • For performance debugging, the initial question formulation is
    particularly challenging: where does one start?
    • Resource-centric methodologies like the USE Method
    (Utilization/Saturation/Errors) can be excellent starting points…
    • But keep these methodologies in their context: they provide
    initial questions to ask — they are not recipes for debugging
    arbitrary performance pathologies!

    View full-size slide

  8. Making observations
    • Questions are answered through observation
    • But — reminder! — software cannot by conventionally seen!
    • It is up to the system itself to have the capacity to be seen
    • This capacity is the system’s observability — and without it, we
    are reduced to guessing
    • Do not conflate software observability with control theory’s
    definition of observability!
    • Software is observable when it can answer your question about
    its behavior — software observability is not a boolean!

    View full-size slide

  9. The pillars of observability
    • Much has been made of the so-called “pillars of observability”:
    monitoring, logging and instrumentation
    • Each of these is important, for each has within it the capacity to
    answer questions about the system
    • But each also has limitations!
    • Their shared limitation: each can only be as effective as the
    observer — they cannot answer questions not asked!
    • Observability seeks to answer questions asked and prompt new
    ones: the human is the foundation of observability!

    View full-size slide

  10. Observability through instrumentation
    • Static instrumentation modifies source to provide semantically
    relevant information, e.g., via logging or counters
    • Dynamic instrumentation allows for the system to be changed
    while running to emit data, e.g. DTrace, OpenTracing
    • Both mechanisms of instrumentation are essential!
    • Static instrumentation provides the observations necessary for
    early question formulation…
    • Dynamic instrumentation answers deeper, ad hoc questions

    View full-size slide

  11. Aggregation
    • When instrumenting the system, it can become overwhelmed
    with the overhead of instrumentation
    • Aggregation is essential for scalable, non-invasive
    instrumentation — and is a first-class primitive in (e.g.) DTrace
    • But aggregation also eliminates important dimensions of data,
    especially with respect to time; some questions may only be
    answered with disaggregated data!
    • Use aggregation for performance debugging — but also
    understand its limits!

    View full-size slide

  12. Visualization
    • The visual cortex is unparalleled at detecting patterns
    • The value of visualizing data is not merely providing answers,
    but also (and especially) provoking new questions
    • Our systems are so large, complicated and abstract that there is
    not one way to visualize them, but many
    • The visualization of systems and their representations is an
    essential facet of system observability!

    View full-size slide

  13. Visualization: Gnuplot
    • Graphs are terrific — so much so that we should not restrict
    ourselves to the captive graphs found in bundled software!
    • An ad hoc plotting tool is essential for performance debugging;
    and Gnuplot is an excellent (if idiosyncratic) one
    • Gnuplot is easily combined with workhorses like awk or perl
    • That Gnuplot is an essential tool helps to set expectation
    around performance debugging tools: they are not magicians!

    View full-size slide

  14. Visualization: Heatmaps

    View full-size slide

  15. Visualization: Flamegraphs

    View full-size slide

  16. Visualization: Statemaps
    • Flamegraphs help understand the work a system is doing, but
    how does one visualize a system that isn’t doing work?
    • That is, idleness is a common pathology in a suboptimal
    system; there is a hidden bottleneck — but where?
    • To explore these kinds of problems, we have developed
    statemaps, a visualization of entity state over time

    View full-size slide

  17. Visualization: Statemaps

    View full-size slide

  18. Statemap input data
    • Statemaps operate on a payload of concatenated JSON where
    each line corresponds to a state transition for an entity:


    { "time": "52524411", "entity": "30080", "state": 0 }

    { "time": "52587486", "entity": "30137", "state": 0 }
    { "time": "52769425", "entity": "30080", "state": 4 }
    { "time": "52895402", "entity": "30137", "state": 1 }
    { "time": "53177670", "entity": "62308", "state": 0 }
    { "time": "53230742", "entity": "30137", "state": 0 }
    { "time": "53268043", "entity": "30137", "state": 1 }
    { "time": "53562441", "entity": "62308", "state": 4 }
    { "time": "53616633", "entity": "30137", "state": 0 }
    { "time": "53762283", "entity": "30137", "state": 6 }


    View full-size slide

  19. Statemap input data
    • States are described in JSON metadata header, e.g.:



    {

    "start": [ 1544138397, 322335287 ],

    "title": "PostgreSQL statemap on HAB01436, by process ID",

    "host": "HAB01436",

    "entityKind": "Process",

    "states": {

    "on-cpu": {"value": 0, "color": "#DAF7A6" },

    "off-cpu-waiting": {"value": 1, "color": "#f9f9f9" },

    "off-cpu-semop": {"value": 2, "color": "#FF5733" },

    "off-cpu-blocked": {"value": 3, "color": "#C70039" },

    "off-cpu-zfs-read": {"value": 4, "color": "#FFC300" },

    "off-cpu-zfs-write": {"value": 5, "color": "#338AFF" },

    "off-cpu-zil-commit": {"value": 6, "color": "#66FFCC" },

    "off-cpu-tx-delay": {"value": 7, "color": "#CCFF00" },

    "off-cpu-dead": {"value": 8, "color": "#E0E0E0" },

    "wal-init": {"value": 9, "color": "#dd1871" },

    "wal-init-tx-delay": {"value": 10, "color": "#fd4bc9" }

    }

    }

    View full-size slide

  20. Statemap output
    • Statemap rendering code processes the JSON stream and
    renders it into a SVG that is the actual state map
    • SVG can be manipulated interactively (zoomed, panned,
    highlighted, etc.) but also stands independently
    • Statemaps are entirely neutral with respect to methodology!

    View full-size slide

  21. Instrumentation for statemaps
    • Statemaps themselves — like gnuplot — are entirely generic to
    input data: they visualize arbitrary state over arbitrary time
    • We have developed example statemap-generating dynamic
    instrumentation for database, CPU, I/O, filesystem operations
    • The data rate in terms of state transitions per second varies
    based on what is being instrumented: from <10/sec to >1M/sec

    View full-size slide

  22. Coalescing states
    • For even modestly large inputs, adjacent states must be
    coalesced to allow for reasonable visualization
    • When this aggregation is required, the statemap rendering code
    coalesces the least significant two adjacent states — allowing
    for larger trends to stay intact
    • The threshold at which states are coalesced can be dynamically
    adjusted to allow for higher resolution
    • Importantly, the original data retains all state transitions!

    View full-size slide

  23. Coalescing states

    View full-size slide

  24. Coalescing states

    View full-size slide

  25. Tagged statemaps
    • We have found it useful to be able to tag states with immutable
    information that describes the context around the state
    • For example, tagging a state for CPU execution with immutable
    context information (process, thread, etc.)
    • Tag occurs separately in the stream, e.g.:


    { "state": 0, "tag": "d136827", "pid": "51943", "tid": "1",
    "execname": "postgres", "psargs": "/opt/postgresql/9.6.3/bin/
    postgres -D /manatee/pg/data" }

    …

    { "time": "330931", "entity": "12", "state": 0, "tag": "d136827" }

    View full-size slide

  26. Tagged statemaps

    View full-size slide

  27. Stacked statemaps
    • We have found it useful to be able to stack statemaps from
    either disjoint sources or disjoint machines
    • Allows for activity in one domain or machine to be tightly
    correlated with activity in another domain or machine
    • Across machines, can be subject to wall clock skew…
    • …but if wall clocks are skewing within the datacenter, there are
    likely bigger problems!

    View full-size slide

  28. Stacked statemaps across domains

    View full-size slide

  29. Stacked statemaps across machines

    View full-size slide

  30. Stacked statemaps across many machines?

    View full-size slide

  31. Statemaps
    • Statemaps provide a generic and system-neutral tool for
    visualizing system state over time
    • Statemaps use visualization to prompt questions
    • Statemaps work in concert with system observability facilities
    that can answer the questions that statemaps raise
    • We must keep the human in mind when developing for
    observability — the capacity to answer arbitrary questions is
    only as effective as the human asking them!
    • Statemap renderer: https://github.com/joyent/statemap

    View full-size slide