Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
The Lifecycle of an Outage
Slide 2
Slide 2 text
Scott Sanders github.com/jssjr @scott_sanders
Slide 3
Slide 3 text
#monitoring ❤️
Slide 4
Slide 4 text
tools + process = confidence
Slide 5
Slide 5 text
graphite logstash kibana collectd flapjack riemann splunk diamond statsd newrelic pagerduty skyline grafana nagios icinga cacti ganglia really bad tag clouds zenoss
Slide 6
Slide 6 text
Availability :)
Slide 7
Slide 7 text
Outages :(
Slide 8
Slide 8 text
What can we do?
Slide 9
Slide 9 text
Human error is not random. It is systematically connected to features of people's tools, tasks and operating environment. — Sidney Dekker
Slide 10
Slide 10 text
The Trigger
Slide 11
Slide 11 text
Detection & Notification
Slide 12
Slide 12 text
avoid alert fatigue
Slide 13
Slide 13 text
don’t fight sleep
Slide 14
Slide 14 text
simplify overrides
Slide 15
Slide 15 text
be persistent
Slide 16
Slide 16 text
escalate quickly
Slide 17
Slide 17 text
be loud
Slide 18
Slide 18 text
create handoff reports
Slide 19
Slide 19 text
Initial Response
Slide 20
Slide 20 text
establish command & determine severity
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
+
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
collectd ~1,200 metrics/host
Slide 27
Slide 27 text
statsd ~4,000,000 events/sec
Slide 28
Slide 28 text
and … sFlow, SNMP, HTTP, etc
Slide 29
Slide 29 text
graphite ~175,000 updates/sec
Slide 30
Slide 30 text
logging scrolls, splunk, syslog-ng
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
build interfaces that fit your culture
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
No content
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
Corrective Action
Slide 39
Slide 39 text
collective knowledge & feedback loops
Slide 40
Slide 40 text
No content
Slide 41
Slide 41 text
No content
Slide 42
Slide 42 text
No content
Slide 43
Slide 43 text
No content
Slide 44
Slide 44 text
No content
Slide 45
Slide 45 text
distribute knowledge
Slide 46
Slide 46 text
tools make software less terrible
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
No content
Slide 49
Slide 49 text
No content
Slide 50
Slide 50 text
Follow Through
Slide 51
Slide 51 text
persist the experience & influence your future
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
identify problems, involve many people, propose solutions
Slide 54
Slide 54 text
No content
Slide 55
Slide 55 text
reduce risk & increase availability
Slide 56
Slide 56 text
DDoS auto-mitigation, faster alerts
Slide 57
Slide 57 text
External Probes nugget, thousandeyes
Slide 58
Slide 58 text
Awareness attack surface monitoring
Slide 59
Slide 59 text
No content
Slide 60
Slide 60 text
Your tools are complementary to your process, not the other way around
Slide 61
Slide 61 text
Communication is the cornerstone for effective incident management
Slide 62
Slide 62 text
Leverage the combination of process and tooling to enable confidence
Slide 63
Slide 63 text
Never stop iterating on emergency response
Slide 64
Slide 64 text
Thanks! github.com/jssjr @scott_sanders