Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A walk to remember: Debugging a distributed system failure

flaper87
August 22, 2016

A walk to remember: Debugging a distributed system failure

Debugging distributed systems has a different set of complications than other fields in our industry. Each system may behave differently depending on the environment it's running in and this undeterministic behavior makes the process more challenging. If the debugging happens on a production environment the risk increases and the nerves get to us.

The debugging process for a distributed system is hardly the same every time. Therefore, we need to have a toolsbelt ready to attack this issue from different fronts but we also need to be ready to backoff when we've gathered enough information to do a proper analysis.

This talk will walk you through the debugging process for an issue on an OpenStack deployment and the strategy used from a technical and non-technical perspective.

flaper87

August 22, 2016
Tweet

More Decks by flaper87

Other Decks in Programming

Transcript

  1. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home 4 main topics
  2. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home Follow the acronyms
  3. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home It’d be great to remember their meaning
  4. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home I don’t think I remember it
  5. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home That’s why I’ve so many slides on acronyms
  6. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home Just making sure the context is set
  7. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home I could go on forever
  8. WDH What are we doing here? RCA Root Cause Analysis

    HLC High to Low context BIH Bring it home ...but I won’t
  9. 1 2 3 4 Have clear goals Know system’s topology

    Keep a low-context environment Don’t assume anything 5 Keep the time small and contextualized Summary-ish
  10. 6 7 8 9 Build new debugging tools Have a

    check list Check configuration files too Dunno what to put here 10 … seriously, no clue Summary-ish
  11. 1 2 3 4 Distributed Debugging: http://bit.ly/2bDLXj3 Debugging Deployed Distributed

    Systems: http://bit.ly/2bDN6aj The ETTO Principle: http://bit.ly/2bbZmvV The programming Ape: https://vimeo.com/40988625 5 Blood and tears references