Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A walk to remember: Debugging a distributed system failure

flaper87
August 22, 2016

A walk to remember: Debugging a distributed system failure

Debugging distributed systems has a different set of complications than other fields in our industry. Each system may behave differently depending on the environment it's running in and this undeterministic behavior makes the process more challenging. If the debugging happens on a production environment the risk increases and the nerves get to us.

The debugging process for a distributed system is hardly the same every time. Therefore, we need to have a toolsbelt ready to attack this issue from different fronts but we also need to be ready to backoff when we've gathered enough information to do a proper analysis.

This talk will walk you through the debugging process for an issue on an OpenStack deployment and the strategy used from a technical and non-technical perspective.

flaper87

August 22, 2016
Tweet

More Decks by flaper87

Other Decks in Programming

Transcript

  1. A walk to remember
    Debugging a distributed system failure

    View Slide

  2. For attending
    Still here
    feel free to interrupt
    @flaper87
    [email protected]

    View Slide

  3. View Slide

  4. 4 Main Topics

    View Slide

  5. WDH What are
    we doing here?

    View Slide

  6. HLC High to Low
    context

    View Slide

  7. RCA Root Cause
    Analysis

    View Slide

  8. BIH Bring it
    home

    View Slide

  9. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    4 main topics

    View Slide

  10. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    Follow the
    acronyms

    View Slide

  11. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    It’d be great to
    remember their
    meaning

    View Slide

  12. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    I don’t think I
    remember it

    View Slide

  13. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    That’s why I’ve so
    many slides on
    acronyms

    View Slide

  14. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    Just making sure
    the context is set

    View Slide

  15. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it
    home
    I could go on
    forever

    View Slide

  16. WDH What are
    we doing here?
    RCA Root Cause
    Analysis
    HLC High to Low
    context
    BIH Bring it home
    ...but I won’t

    View Slide

  17. WDH What are
    we doing here?

    View Slide

  18. What kind of issue are
    we facing?
    WDH

    View Slide

  19. Know what y’all are
    expected to do
    WDH

    View Slide

  20. Make sure the right
    people are involved
    WDH

    View Slide

  21. Don’t pull in the entire
    company
    WDH

    View Slide

  22. HLC High to Low
    context

    View Slide

  23. Assume you’re working
    in a low-context
    environment
    HLC

    View Slide

  24. Don’t make
    assumptions about the
    steps that have been
    taken
    HLC

    View Slide

  25. Every part of the system
    is guilty till proven
    innocent
    HLC

    View Slide

  26. Know the system’s
    topology
    HLC

    View Slide

  27. RCA Root Cause
    Analysis

    View Slide

  28. Have a list of steps to
    follow
    RCA

    View Slide

  29. Many times systems
    are just misconfigured
    RCA

    View Slide

  30. Bottom-up debugging
    RCA

    View Slide

  31. Top-to-bottom
    debugging
    RCA

    View Slide

  32. Monkey debugging
    RCA

    View Slide

  33. Correlate your logs
    RCA

    View Slide

  34. Trace events
    throughout the system
    RCA

    View Slide

  35. Timestamps are pretty
    much your life
    RCA

    View Slide

  36. Compare executions
    RCA

    View Slide

  37. Visualization tools are
    quite handy
    RCA

    View Slide

  38. BIH Bring it
    home

    View Slide

  39. Some bugs just take
    longer to find
    BIH

    View Slide

  40. Describe the
    (real)
    problem
    WE F***ED this UP
    BIH

    View Slide

  41. Build new tools for
    future cases
    BIH

    View Slide

  42. Build a knowledge
    base
    BIH

    View Slide

  43. 1
    2
    3
    4
    Have clear goals
    Know system’s topology
    Keep a low-context environment
    Don’t assume anything
    5 Keep the time small and contextualized
    Summary-ish

    View Slide

  44. 6
    7
    8
    9
    Build new debugging tools
    Have a check list
    Check configuration files too
    Dunno what to put here
    10 … seriously, no clue
    Summary-ish

    View Slide

  45. 1
    2
    3
    4
    Distributed Debugging: http://bit.ly/2bDLXj3
    Debugging Deployed Distributed Systems: http://bit.ly/2bDN6aj
    The ETTO Principle: http://bit.ly/2bbZmvV
    The programming Ape: https://vimeo.com/40988625
    5 Blood and tears
    references

    View Slide

  46. Questions?

    View Slide