Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Data Driven Monitoring Daniel Schauenberg
[email protected]
@mrtazz
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
@mrtazz
Slide 4
Slide 4 text
Item by TheBackPackShoppe
Slide 5
Slide 5 text
http://www.flickr.com/photos/brianglanz/1095706242
Slide 6
Slide 6 text
@mrtazz
Slide 7
Slide 7 text
How comfortable are you deploying a change right now?
Slide 8
Slide 8 text
“If this is your first day at Etsy, you deploy the site”
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
@mrtazz
Slide 12
Slide 12 text
@mrtazz Ganglia • System level metrics • Instance per DC/environment • > 220k RRD files • Fully configured through Chef role attributes
Slide 13
Slide 13 text
@mrtazz Rainbow Graphs!
Slide 14
Slide 14 text
@mrtazz StatsD
Slide 15
Slide 15 text
@mrtazz Graphite • Application level metrics • 96G RAM, 20 Cores, 7.3T SSD RAID 10 • 525k metrics/minute • Mirrored Primary/Primary Setup • Functionally sharded relays
Slide 16
Slide 16 text
@mrtazz
Slide 17
Slide 17 text
@mrtazz
Slide 18
Slide 18 text
@mrtazz nagios
Slide 19
Slide 19 text
@mrtazz <3 nagios
Slide 20
Slide 20 text
@mrtazz
Slide 21
Slide 21 text
@mrtazz Nagios • 2 instances in each DC/environment • Fully Chef generated configuration • Service checks and contacts in git • Notifications via email->SMS gateway • ~75% ops on-call
Slide 22
Slide 22 text
@mrtazz github.com/lozzd/nagdash
Slide 23
Slide 23 text
@mrtazz
Slide 24
Slide 24 text
@mrtazz Much more… • Syslog-ng • Logstash • Logster • Supergrep • Eventinator
Slide 25
Slide 25 text
Information Overload Image by http://jasoncasteel.deviantart.com/
Slide 26
Slide 26 text
@mrtazz Alert Fatigue
Slide 27
Slide 27 text
We have the data We can make it better Item by PicksFromThePast
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
@mrtazz nagios-herald
Slide 30
Slide 30 text
@mrtazz nagios-herald
Slide 31
Slide 31 text
@mrtazz nagios-herald
Slide 32
Slide 32 text
@mrtazz Failed Check nagios-herald Formatter Helpers Graphite Ganglia Logstash Message
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
github.com/etsy/nagios-herald
Slide 35
Slide 35 text
@mrtazz opsweekly
Slide 36
Slide 36 text
@mrtazz
Slide 37
Slide 37 text
@mrtazz Opsweekly
Slide 38
Slide 38 text
@mrtazz Alert categorization
Slide 39
Slide 39 text
@mrtazz Wearables! Item by JennysTrinketShoppe
Slide 40
Slide 40 text
@mrtazz Sleep tracking
Slide 41
Slide 41 text
github.com/etsy/opsweekly
Slide 42
Slide 42 text
@mrtazz Summary • Set of trusted tools for monitoring • Always experiment • Always learn • Always improve • Use the data, Luke
Slide 43
Slide 43 text
@mrtazz Shout out to @lozzd and @Ryan_Frantz
Slide 44
Slide 44 text
@mrtazz codeascraft.com etsy.com/codeascraft/talks etsy.github.com etsy.com/careers
Slide 45
Slide 45 text
Questions?
Slide 46
Slide 46 text
Data Driven Monitoring Daniel Schauenberg
[email protected]
@mrtazz