The Lifecycle of an Outage

Scott Sanders github.com/jssjr @scott_sanders

#monitoring ❤️

tools + process = conﬁdence

graphite logstash kibana collectd ﬂapjack riemann splunk diamond statsd newrelic
pagerduty skyline grafana nagios icinga cacti ganglia really bad tag clouds zenoss

Availability :)

Outages :(

What can we do?

Human error is not random. It is systematically connected to
features of people's tools, tasks and operating environment. — Sidney Dekker

The Trigger

Detection & Notiﬁcation

avoid alert fatigue

don’t ﬁght sleep

simplify overrides

be persistent

escalate quickly

be loud

create handoff reports

Initial Response

establish command & determine severity

collectd ~1,200 metrics/host

statsd ~4,000,000 events/sec

and … sFlow, SNMP, HTTP, etc

graphite ~175,000 updates/sec

logging scrolls, splunk, syslog-ng

build interfaces that ﬁt your culture

Corrective Action

collective knowledge & feedback loops

distribute knowledge

tools make software less terrible

Follow Through

persist the experience & inﬂuence your future

identify problems, involve many people, propose solutions

reduce risk & increase availability

DDoS auto-mitigation, faster alerts

External Probes nugget, thousandeyes

Awareness attack surface monitoring

Your tools are complementary to your process, not the other
way around

Communication is the cornerstone for effective incident management

Leverage the combination of process and tooling to enable conﬁdence

Never stop iterating on emergency response

Thanks! github.com/jssjr @scott_sanders

The Lifecycle of an Outage

The Lifecycle of an Outage

More Decks by Scott Sanders

Other Decks in Technology

Featured

Transcript