Upgrade to Pro — share decks privately, control downloads, hide ads and more …

In the Center of the Cyclone: Finding Sources of Resilience

In the Center of the Cyclone: Finding Sources of Resilience

Sustaining the potential to adapt to unforeseen situations (resilience) is a necessary element in complex systems. One could say that all successful endeavors require this. But resilience is (in many ways) both invisible and also difficult to locate in concrete and grounded ways. Understanding complex systems cannot rely on simple approaches, by definition.

“Monitoring,” “observability,” “culture,” “management,” “organizational design”” ... none of these terms, concepts, or approaches can singularly help us in this area. We’ll walk through empirically-supported approaches that do.

John Allspaw

August 16, 2018
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. In the Center of the Cyclone
    Finding Sources of Resilience
    John Allspaw
    @allspaw
    Adaptive Capacity Labs

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. RESILIENCE

    View Slide

  6. In the Center of the Cyclone
    Finding Sources of Resilience
    where to look?
    how to look?
    implies that they require effort to be identified

    View Slide

  7. sustained
    adaptive
    capacity

    View Slide

  8. “…the ability to recognize and adapt to handle unanticipated
    perturbations…” (Woods)
    “a resilient system must be both prepared, and be prepared to
    be unprepared.” (Pariès)

    View Slide

  9. unforeseen
    unanticipated
    unexpected
    fundamentally surprising

    View Slide

  10. where to look?
    how to look?

    View Slide

  11. View Slide

  12. externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results

    View Slide

  13. externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    macro
    descriptions
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results

    View Slide

  14. code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results

    View Slide

  15. code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    system
    system framing
    doing
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results

    View Slide

  16. deploy organization/
    “monitoring”
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    code deploy
    organization/

    View Slide

  17. code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results

    View Slide

  18. code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    Copyright © 2016 by R.I. Cook for ACL, all rights reserved
    ack: Michael Angeles http://konigi.com/tools/
    What matters. Why what matters matters.
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    above
    the line
    below
    the line
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    goals
    purposes
    risks
    cognition
    actions
    interactions
    speech
    gestures
    clicks
    signals
    representations
    artifacts
    the line of
    representation
    individuals have
    unique models
    of the “system”
    Copyright © 2016 by R.I. Cook for ACL, all rights reserved
    ack: Michael Angeles http://konigi.com/tools/
    What matters. Why what matters matters.
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    above
    the line
    below
    the line
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    goals
    purposes
    risks
    cognition
    actions
    interactions
    speech
    gestures
    clicks
    signals
    representations
    artifacts
    the line of
    representation
    individuals have
    unique models
    of the “system”
    observing
    inferring
    anticipating
    planning
    troubleshooting
    diagnosing
    correcting
    modifying
    reacting

    View Slide

  19. Copyright © 2016 by R.I. Cook for ACL, all rights reserved
    ack: Michael Angeles http://konigi.com/tools/
    What matters. Why what matters matters.
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    above
    the line
    below
    the line
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    goals
    purposes
    risks
    cognition
    actions
    interactions
    speech
    gestures
    clicks
    signals
    representations
    artifacts
    the line of
    representation
    individuals have
    unique models
    of the “system”
    Copyright © 2016 by R.I. Cook for ACL, all rights reserved
    ack: Michael Angeles http://konigi.com/tools/
    What matters. Why what matters matters.
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    above
    the line
    below
    the line
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    goals
    purposes
    risks
    cognition
    actions
    interactions
    speech
    gestures
    clicks
    signals
    representations
    artifacts
    the line of
    representation
    individuals have
    unique models
    of the “system”
    observing
    inferring
    anticipating
    planning
    troubleshooting
    diagnosing
    correcting
    modifying
    reacting

    View Slide

  20. What matters. Why what matters matters.
    code
    deploy organization/
    encapsulation
    “monitoring”
    Why is it doing that?
    hat needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    go
    purp
    ris
    cogn
    act
    intera
    spe
    ges
    cli
    sig
    represe
    What matters. Why what matters matters.
    code
    deploy organization/
    encapsulation
    “monitoring”
    Why is it doing that?
    hat needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    go
    purp
    ris
    cogn
    act
    intera
    spe
    ges
    cli
    sig
    represe
    observing
    inferring
    anticipating
    planning
    troubleshooting
    diagnosing
    correcting
    modifying
    reacting

    View Slide

  21. observing
    inferring
    anticipating
    planning
    troubleshooting
    diagnosing
    correcting
    modifying
    reacting
    unforeseen
    unanticipated
    unexpected
    fundamentally surprising
    …is what copes with and adapts to:

    View Slide

  22. Copyright © 2016 by R.I. Cook for ACL, all rights reserved
    ack: Michael Angeles http://konigi.com/tools/
    What matters. Why what matters matters.
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    above
    the line
    below
    the line
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    goals
    purposes
    risks
    cognition
    actions
    interactions
    speech
    gestures
    clicks
    signals
    representations
    artifacts
    the line of
    representation
    individuals have
    unique models
    of the “system”
    Copyright © 2016 by R.I. Cook for ACL, all rights reserved
    ack: Michael Angeles http://konigi.com/tools/
    What matters. Why what matters matters.
    code repositories
    macro
    descriptions
    testing/validation
    suites
    code
    code stuff
    meta
    rules
    scripts,
    rules, etc.
    test cases
    code
    generating
    tools
    testing
    tools
    deploy
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    above
    the line
    below
    the line
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should be happening
    What does it mean?
    Adding stuff
    to the running
    system
    Getting stuff
    ready to be part
    of the running
    system
    architectural
    & structural
    framing
    keeping track
    of what “the
    system” is
    doing
    externally sourced
    code (e.g. DB)
    results
    the using
    world
    delivery
    technology
    stack
    internally sourced code
    results
    goals
    purposes
    risks
    cognition
    actions
    interactions
    speech
    gestures
    clicks
    signals
    representations
    artifacts
    the line of
    representation
    individuals have
    unique models
    of the “system”
    observing
    inferring
    anticipating
    planning
    troubleshooting
    diagnosing
    correcting
    modifying
    reacting
    RESILIENCE IS HERE
    (ABOVE THE LINE)

    View Slide

  23. sustained
    adaptive
    capacity
    = human
    performance
    (cognitive work)

    View Slide

  24. challenges and barriers
    to
    finding sources of resilience

    View Slide

  25. View Slide

  26. smoothing-the-messy-details-
    to-fit-a-model
    goggles

    View Slide

  27. detect the
    issue
    develop
    hypothes(es)
    for
    contributors
    (in)validate
    hypothes(es) fix issue

    View Slide

  28. enumerate
    possible
    causes
    use process
    of elimination
    collect more
    data
    refine
    remaining
    hypotheses
    prove
    remaining
    hypotheses
    cannot
    fix issue
    can

    View Slide

  29. ?

    View Slide

  30. ?

    View Slide

  31. ?

    View Slide

  32. this is what I want you to pay attention to
    true, but unrelated to this talk

    View Slide

  33. …how the CTO knew an esoteric trick to get things working again
    ...the realization that yes there IS actually a problem with AWS and Lisa in your
    infrastructure team called someone she knows there to get the issue escalated
    ...how Vanessa managed to improvise a script to piece together accidentally deleted
    data from Hadoop, Elasticsearch indexes, and the Wayback Machine before anyone
    notices
    ...what Jenn in Security does to discern ‘normal’ bug behavior from signals of an attack
    ...the realization that when a bottleneck appears in the analytics pipeline, there are
    some bits off data collection that can be shut off without severe impact
    we need to understand actual work

    View Slide

  34. …the red herrings, rabbit holes, and unproductive threads of activity
    …what sources of data or information people do not trust
    …how responders bring newcomers up to speed
    …what sacrifice decisions people are making
    …how specialists in one field communicate problem solving to another

    we need to hear about

    View Slide

  35. View Slide

  36. shallow data goggles

    View Slide

  37. site down site up
    29 min
    18 min 16 min

    View Slide

  38. Infra Eng1
    4 years
    DBA
    2.5 years
    DBA
    1 years
    App Eng1
    2 years
    Mobile Eng1
    2.5 years
    App Eng3
    3 years
    Infra Eng2
    1.5 years
    App Eng2
    1 years
    site down site up

    View Slide

  39. site down site up
    called in for
    specific expertise
    also providing
    updates to
    customer support
    in parallel to
    responding to
    incident
    in Budapest
    with poor
    conference wifi
    unknowingly looking
    at incorrect availability
    zone log
    data

    View Slide

  40. site
    down site up
    critical
    relayed
    observations
    stated

    hypotheses

    View Slide

  41. logs
    time of year
    day of year
    time of day
    observations and
    hypotheses others
    share
    what has been
    investigated thus far
    what’s been happening
    in the world
    (news, service provider
    outages, etc.)
    time-series
    data
    alerts
    tracing/
    observability tools
    recent
    changes in
    existing tech
    new
    dependencies
    who is on
    vacation, at a
    conference,
    traveling, etc.
    status of other
    ongoing work

    View Slide

  42. View Slide

  43. View Slide

  44. we have to take off our Goggles
    Goggles
    Goggles

    View Slide

  45. My experience is that our community has very little patience for looking
    closely and deeply at cognitive work.
    “What tool do I use?”

    View Slide

  46. 1999
    1978
    1996
    This will take time.

    View Slide

  47. http://resiliencepapers.club
    SRE Cognitive Work, Seeking SRE, O’Reilly
    http://stella.report

    View Slide

  48. View Slide

  49. THE END

    View Slide