Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A walk to remember: Debugging a distributed system failure

C05edcc8a57f64b4e040d94ad89cee57?s=47 flaper87
August 22, 2016

A walk to remember: Debugging a distributed system failure

Debugging distributed systems has a different set of complications than other fields in our industry. Each system may behave differently depending on the environment it's running in and this undeterministic behavior makes the process more challenging. If the debugging happens on a production environment the risk increases and the nerves get to us.

The debugging process for a distributed system is hardly the same every time. Therefore, we need to have a toolsbelt ready to attack this issue from different fronts but we also need to be ready to backoff when we've gathered enough information to do a proper analysis.

This talk will walk you through the debugging process for an issue on an OpenStack deployment and the strategy used from a technical and non-technical perspective.

C05edcc8a57f64b4e040d94ad89cee57?s=128

flaper87

August 22, 2016
Tweet

Transcript

  1. A walk to remember Debugging a distributed system failure

  2. For attending Still here feel free to interrupt @flaper87 flavio@redhat.com

  3. None
  4. 4 Main Topics

  5. WDH What are we doing here?

  6. HLC High to Low context

  7. RCA Root Cause Analysis

  8. BIH Bring it home

  9. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home 4 main topics
  10. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home Follow the acronyms
  11. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home It’d be great to remember their meaning
  12. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home I don’t think I remember it
  13. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home That’s why I’ve so many slides on acronyms
  14. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home Just making sure the context is set
  15. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home I could go on forever
  16. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home ...but I won’t
  17. WDH What are we doing here?

  18. What kind of issue are we facing? WDH

  19. Know what y’all are expected to do WDH

  20. Make sure the right people are involved WDH

  21. Don’t pull in the entire company WDH

  22. HLC High to Low context

  23. Assume you’re working in a low-context environment HLC

  24. Don’t make assumptions about the steps that have been taken

    HLC
  25. Every part of the system is guilty till proven innocent

    HLC
  26. Know the system’s topology HLC

  27. RCA Root Cause Analysis

  28. Have a list of steps to follow RCA

  29. Many times systems are just misconfigured RCA

  30. Bottom-up debugging RCA

  31. Top-to-bottom debugging RCA

  32. Monkey debugging RCA

  33. Correlate your logs RCA

  34. Trace events throughout the system RCA

  35. Timestamps are pretty much your life RCA

  36. Compare executions RCA

  37. Visualization tools are quite handy RCA

  38. BIH Bring it home

  39. Some bugs just take longer to find BIH

  40. Describe the (real) problem WE F***ED this UP BIH

  41. Build new tools for future cases BIH

  42. Build a knowledge base BIH

  43. 1 2 3 4 Have clear goals Know system’s topology

    Keep a low-context environment Don’t assume anything 5 Keep the time small and contextualized Summary-ish
  44. 6 7 8 9 Build new debugging tools Have a

    check list Check configuration files too Dunno what to put here 10 … seriously, no clue Summary-ish
  45. 1 2 3 4 Distributed Debugging: http://bit.ly/2bDLXj3 Debugging Deployed Distributed

    Systems: http://bit.ly/2bDN6aj The ETTO Principle: http://bit.ly/2bbZmvV The programming Ape: https://vimeo.com/40988625 5 Blood and tears references
  46. Questions?