Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Lifecycle of an Outage

The Lifecycle of an Outage

When an incident occurs, we typically have an increased risk of an outage. How we structure our initial response, our decision making process, and our communication directly affects the impact that this incident will have. We need to think critically about our ability to quickly resolve any problems and reduce the risk of future incidents.

Speaker Notes -- https://gist.github.com/jssjr/5957e9a5cc3ca4846e9c
Video -- http://vimeo.com/95245539

Scott Sanders

May 06, 2014
Tweet

More Decks by Scott Sanders

Other Decks in Technology

Transcript

  1. graphite logstash kibana collectd flapjack riemann splunk diamond statsd newrelic

    pagerduty skyline grafana nagios icinga cacti ganglia really bad tag clouds zenoss
  2. Human error is not random. It is systematically connected to

    features of people's tools, tasks and operating environment. — Sidney Dekker
  3. +