Slide 1

Slide 1 text

Data Driven Monitoring Daniel Schauenberg [email protected] @mrtazz

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

@mrtazz

Slide 4

Slide 4 text

Item by TheBackPackShoppe

Slide 5

Slide 5 text

http://www.flickr.com/photos/brianglanz/1095706242

Slide 6

Slide 6 text

@mrtazz

Slide 7

Slide 7 text

How comfortable are you deploying a change right now?

Slide 8

Slide 8 text

“If this is your first day at Etsy, you deploy the site”

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

@mrtazz

Slide 12

Slide 12 text

@mrtazz Ganglia • System level metrics • Instance per DC/environment • > 220k RRD files • Fully configured through Chef role attributes

Slide 13

Slide 13 text

@mrtazz Rainbow Graphs!

Slide 14

Slide 14 text

@mrtazz StatsD

Slide 15

Slide 15 text

@mrtazz Graphite • Application level metrics • 96G RAM, 20 Cores, 7.3T SSD RAID 10 • 525k metrics/minute • Mirrored Primary/Primary Setup • Functionally sharded relays

Slide 16

Slide 16 text

@mrtazz

Slide 17

Slide 17 text

@mrtazz

Slide 18

Slide 18 text

@mrtazz nagios

Slide 19

Slide 19 text

@mrtazz <3 nagios

Slide 20

Slide 20 text

@mrtazz

Slide 21

Slide 21 text

@mrtazz Nagios • 2 instances in each DC/environment • Fully Chef generated configuration • Service checks and contacts in git • Notifications via email->SMS gateway • ~75% ops on-call

Slide 22

Slide 22 text

@mrtazz github.com/lozzd/nagdash

Slide 23

Slide 23 text

@mrtazz

Slide 24

Slide 24 text

@mrtazz Much more… • Syslog-ng • Logstash • Logster • Supergrep • Eventinator

Slide 25

Slide 25 text

Information Overload Image by http://jasoncasteel.deviantart.com/

Slide 26

Slide 26 text

@mrtazz Alert Fatigue

Slide 27

Slide 27 text

We have the data We can make it better Item by PicksFromThePast

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

@mrtazz nagios-herald

Slide 30

Slide 30 text

@mrtazz nagios-herald

Slide 31

Slide 31 text

@mrtazz nagios-herald

Slide 32

Slide 32 text

@mrtazz Failed Check nagios-herald Formatter Helpers Graphite Ganglia Logstash Message

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

github.com/etsy/nagios-herald

Slide 35

Slide 35 text

@mrtazz opsweekly

Slide 36

Slide 36 text

@mrtazz

Slide 37

Slide 37 text

@mrtazz Opsweekly

Slide 38

Slide 38 text

@mrtazz Alert categorization

Slide 39

Slide 39 text

@mrtazz Wearables! Item by JennysTrinketShoppe

Slide 40

Slide 40 text

@mrtazz Sleep tracking

Slide 41

Slide 41 text

github.com/etsy/opsweekly

Slide 42

Slide 42 text

@mrtazz Summary • Set of trusted tools for monitoring • Always experiment • Always learn • Always improve • Use the data, Luke

Slide 43

Slide 43 text

@mrtazz Shout out to @lozzd and @Ryan_Frantz

Slide 44

Slide 44 text

@mrtazz codeascraft.com etsy.com/codeascraft/talks etsy.github.com etsy.com/careers

Slide 45

Slide 45 text

Questions?

Slide 46

Slide 46 text

Data Driven Monitoring Daniel Schauenberg [email protected] @mrtazz