Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Lifecycle of an Outage

The Lifecycle of an Outage

When an incident occurs, we typically have an increased risk of an outage. How we structure our initial response, our decision making process, and our communication directly affects the impact that this incident will have. We need to think critically about our ability to quickly resolve any problems and reduce the risk of future incidents.

Speaker Notes -- https://gist.github.com/jssjr/5957e9a5cc3ca4846e9c
Video -- http://vimeo.com/95245539

70bd372389add5e121b7a9a929b2d887?s=128

Scott Sanders

May 06, 2014
Tweet

Transcript

  1. The Lifecycle of an Outage

  2. Scott Sanders github.com/jssjr @scott_sanders

  3. #monitoring ❤️

  4. tools + process = confidence

  5. graphite logstash kibana collectd flapjack riemann splunk diamond statsd newrelic

    pagerduty skyline grafana nagios icinga cacti ganglia really bad tag clouds zenoss
  6. Availability :)

  7. Outages :(

  8. What can we do?

  9. Human error is not random. It is systematically connected to

    features of people's tools, tasks and operating environment. — Sidney Dekker
  10. The Trigger

  11. Detection & Notification

  12. avoid alert fatigue

  13. don’t fight sleep

  14. simplify overrides

  15. be persistent

  16. escalate quickly

  17. be loud

  18. create handoff reports

  19. Initial Response

  20. establish command & determine severity

  21. None
  22. None
  23. None
  24. +

  25. None
  26. collectd ~1,200 metrics/host

  27. statsd ~4,000,000 events/sec

  28. and … sFlow, SNMP, HTTP, etc

  29. graphite ~175,000 updates/sec

  30. logging scrolls, splunk, syslog-ng

  31. None
  32. build interfaces that fit your culture

  33. None
  34. None
  35. None
  36. None
  37. None
  38. Corrective Action

  39. collective knowledge & feedback loops

  40. None
  41. None
  42. None
  43. None
  44. None
  45. distribute knowledge

  46. tools make software less terrible

  47. None
  48. None
  49. None
  50. Follow Through

  51. persist the experience & influence your future

  52. None
  53. identify problems, involve many people, propose solutions

  54. None
  55. reduce risk & increase availability

  56. DDoS auto-mitigation, faster alerts

  57. External Probes nugget, thousandeyes

  58. Awareness attack surface monitoring

  59. None
  60. Your tools are complementary to your process, not the other

    way around
  61. Communication is the cornerstone for effective incident management

  62. Leverage the combination of process and tooling to enable confidence

  63. Never stop iterating on emergency response

  64. Thanks! github.com/jssjr @scott_sanders