Slide 1

Slide 1 text

Self Service Alerting for Everyone Dan Rowe Wayfair.com April 22, 2014

Slide 2

Slide 2 text

DRowe's Definitions:

Slide 3

Slide 3 text

3 Definitions • Everyone 1. Any user with a browser (within your organization). 2. Yes Everyone! • Self Service Alerting 1. What it is not: • Committing and pushing Nagios configs, and than restarting Nagios. • Needing a Masters in Zabbix to setup a trigger, action, email template, etc.. 2. What it is: • The ability for everyone to create a basic alert and get notified by it. • Key Performance Indicator 1. A metric that is a proxy of success for • a company • particular department • system

Slide 4

Slide 4 text

4 KPIs What KPIs should I track? Don’t ask me, ask SMEs

Slide 5

Slide 5 text

5 KPIs • One person's KPIs are another's noise. • WebOps = Web Site Performance / Availability • ServerEng = CPU / Mem / Disk • NetEng = Bandwidth / packets per sec • OMS = Order Count • Communications = Max concurrent calls • Customer Service = Calls answered in under X seconds • Finance = Time it takes to close an accounting period Some times you’re lucky Some times you’re not

Slide 6

Slide 6 text

6 Alerts What should I alert on? Déjà vu? Isn’t this the same/similar exercise as KPIs? Reminder: Don’t ask me, ask SMEs

Slide 7

Slide 7 text

7 Alerts • Who should alerts be created by? • Everyone • Why? • Alerting should be tuned by the people that care about it most. • They can adjust it to a reasonable threshold that they are happy with. • They should be able to see how it performs over time so they can improve it.

Slide 8

Slide 8 text

8 Alerts This is a DevOps talk, no throwing things over the fence! We aren’t saying to start sending them all straight to the beeper. The alert creator should also initially receive and react to it. This helps with awareness and tuning.

Slide 9

Slide 9 text

9 Analogy Email Routing and Filtering as an Analogy Who sets up Outlook rules? (answer: Everyone) Who sets up Exchange / Postfix rules? (answer: Less people than Everyone)

Slide 10

Slide 10 text

10 Promotion of Alerts Great now we have all these user created alerts, now what? The alert history can be reviewed for signal to noise ratio. After an alert has proven itself useful to a user or team promote it! Promoting it might mean migrating the alert to your normal alerting system, or could be as simple expand the subscription of the current alert.

Slide 11

Slide 11 text

11 Rainbows and Unicorns Great another talk about Rainbows and unicorns. Actually it’s not, you should start doing this now and can!

Slide 12

Slide 12 text

12 Metrics One prerequisite for this is having a lot of good metrics (or at least the ability to easily add them). We are DevOps! So we're already graphing / logging all the things (right?)

Slide 13

Slide 13 text

13 Pick a tool Back in my day…. you had to write your own self service alerting tool https://github.com/wayfair/Graphite-Tattle

Slide 14

Slide 14 text

14 Pick a tool These days there is a list of tools in this category Spin the wheel or try a few and pick the one you like.

Slide 15

Slide 15 text

15 Pick a tool Examples are: https://github.com/scobal/seyren And https://github.com/arachnys/cabot Somewhere in between: https://github.com/livingsocial/rearview/

Slide 16

Slide 16 text

16 Done? Great now we’re managing alerting like a super star, Pro Tip: Setup alerts and dashboards for the number of alerts you have and the number of alerts that are being sent out. Someone has to watch the watchers.

Slide 17

Slide 17 text

17 Next? Where do we go from here? • Infrastructure and systems have more than just Graphite data to share. • Expand beyond Graphite to alert on other systems with Self Service tools.

Slide 18

Slide 18 text

18 Example New Frontier Example of a new frontier: Logstash + Elasticsearch + Kibana = Logging done well. It's similar to the Statsd + Graphite = Metrics done well. But it has a similar gap that the Metrics stack had early one.

Slide 19

Slide 19 text

19 Caution Rainbows and Unicorns ahead

Slide 20

Slide 20 text

20 Elasticsearch Alert UI We needed a Tattle for Elasticsearch We are starting to develop/test one currently.

Slide 21

Slide 21 text

21 Elasticsearch Alert UI And it’s an easy page that Everyone (in our organization) can use

Slide 22

Slide 22 text

Yes we are hiring  http://www.wayfair.com/careers Tell them DRowe sent you

Slide 23

Slide 23 text

23 Image credits in order of appearance: • http://commons.wikimedia.org/wiki/File:Who_is_responsible_not_me.jpg • http://commons.wikimedia.org/wiki/File:Venn_diagram_ABC_BW.png • http://en.wikipedia.org/wiki/File:Mond-vergleich.svg • https://www.flickr.com/photos/gedankenstuecke/108894568 • http://commons.wikimedia.org/wiki/File:Pager_1.jpg • http://www.jungleredwriters.com/2011/01/all-good-things-come-in- threes.html • http://ocw.mit.edu/courses/special-programs/sp-2322-unicorns-and- rainbows-a-seminar-fall-2014/ • http://en.wikipedia.org/wiki/File:WheelUK2001Round1.jpg • https://www.flickr.com/photos/rileyroxx/151985627/ • http://en.wikipedia.org/wiki/File:Compass_align.jpg • http://www.ebay.com/bhp/unicorn-sign • http://ih2.redbubble.net/image.9886230.9887/sticker,375x360.png • http://en.wikipedia.org/wiki/File:Ap_16_view_of_Earth_during_TLC.jpg