Finding signal in the monitoring noise with Flapjack

Finding signal in the monitoring noise

What is ﬂapjack?

Monitoring alert routing system

Composable

Rollup

Alert routing

• event

• event • ↪ notify?

• event • ↪ notify? • ↪ who?

• event • ↪ notify? • ↪ who? • ↪
how?

API driven

No restarts required

Developed + used in production at:

Developed + used in production at: Developers: Ali Graham Jesse
Reynolds Project manager: Lindsay Holmwood

• Technology:

• Technology: • Ruby

• Technology: • Ruby • Redis

• Technology: • Ruby • Redis • EventMachine*

• Technology: • Ruby • Redis • EventMachine* • *
Replaced in 2.0 with Ruby threads

Designed for humans

Considerate of: Alert fatigue Normalcy bias Conﬁrmation bias

Why ﬂapjack ?

Speciﬁc use cases

Multi-tenant

Segregated responsibility

Check engine independence

Killer features

Self-checking

event producers

event producers ﬂapjack

event producers ﬂapjack oobetet

Rollup (alert summarisation)

Per-media thresholds

• Contact

• Contact • has many • Media

• Contact • has many • Media • has one
• Summary Threshold

Tagging

How does it work?

Data model

Contact

Contact Checks Checks Media

Contact Checks Checks Media Checks Checks Notiﬁcation Rules

Contact Checks Checks Media Checks Checks Notiﬁcation Rules Checks Checks
Entities

Contact Checks Checks Media Checks Checks Notiﬁcation Rules Checks Checks
Checks Checks Checks Entities

Contact Checks Checks Media Checks Checks Notiﬁcation Rules History (maintenance,
acks, state changes) Checks Checks Checks Checks Checks Entities

Architecture

event producers

event producers processors

event producers processors gateways

processors gateways Nagios nagios-receiver Nagios nagios-receiver

Event Producers Nagios Icinga Sensu Cron

processors gateways Nagios nagios-receiver Nagios nagios-receiver

processor gateways notiﬁer Nagios nagios-receiver Nagios nagios-receiver

What is an event?

{ "entity": ENTITY, "check": CHECK, "type": EVENT_TYPE, "state": STATE, "time":
TIMESTAMP, "summary": SUMMARY }

{ "entity": ENTITY, } ENTITY (string) Name of the relevant
entity (e.g. FQDN)

{ "check": CHECK, } CHECK (string) The check name ('service
description' in Nagios lingo)

{ "type": EVENT_TYPE, } EVENT_TYPE (string) One of 'service' or
'action'

{ "state": STATE, } STATE (string) One of 'ok', 'warning',
'critical', 'unknown', or 'acknowledgement'

{ "time": TIMESTAMP, } TIMESTAMP (string) UNIX timestamp of the
event's creation

{ "summary": SUMMARY } SUMMARY (string) The check output in
the case of a service event, otherwise a message created for an acknowledgement, or similar

How are alerts routed?

event ﬁlters Find failing events

notiﬁcation event ﬁlters Find failing events

Find people interested in entity map [ alice bob, carol
] notiﬁcation event ﬁlters Find failing events

Find people interested in entity map map Find media owned
by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notiﬁcation event ﬁlters Find failing events

Find people interested in entity map map reduce Find media
owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notiﬁcation event ﬁlters Find failing events

Find people interested in entity map map reduce reduce Find
media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notiﬁcation event ﬁlters Find failing events

Find people interested in entity map map reduce reduce reduce
Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notiﬁcation event ﬁlters Find failing events [ [ alice, sms ], [ bob, sms ], ]

Find people interested in entity map map reduce reduce reduce
Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notiﬁcation event ﬁlters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ]

processor notiﬁer Nagios nagios-receiver Nagios nagios-receiver Email SMS Jabber PagerDuty

processor notiﬁer Nagios nagios-receiver Nagios nagios-receiver Email SMS Jabber PagerDuty
Web API

Things that may surprise you

Constant heartbeat

No one-off events

How long has a check been failing?

NOT "How many times has the check failed?"

No HARD/SOFT states

Broadcast delay

Alert summarisation (Rollup)

"Nagios as a dumb check executor"

No notiﬁcations

No acknowledgements

No downtime

No parenting

Just checking

Shards of Nagios

Scale horizontally

Nagios shared state

Case study

Production use at Bulletproof since December 2012

As of November 2013: 896 entities 5778 checks* *some of
these are stale

As of January 2014: Processing ~60 events/second

/self_stats.json

Manage (customer portal)

manage-ﬂapjack-sync

manage (source of truth) Ὃ manage-ﬂapjack-sync Ὃ ﬂapjack

Flapjack

Flapjack Manage

Flapjack Manage Nagios

Flapjack Manage collectd Nagios

Flapjack Manage collectd Nagios ???

TIMESTAMP, "summary": SUMMARY } ENTITY (string) Name of the relevant entity (e.g. FQDN) Bulletproof uses the FQDN.

Shortcomings

Fixed broadcast delay (30s)

Single external source of truth

Contact import/export

Bulletproof-ism

Open Source

• github.com/ﬂpjck/ﬂapjack

Release planning in open github.com/ﬂpjck/ﬂapjack/wiki/Releasing

Policy for triaging bugs + features github.com/ﬂpjck/ﬂapjack/wiki/Releasing

Semantic versioning (2.0.0)

We write tests! Unit + Integration

We write tests! ~80% coverage

We run tests!

•github.com /ﬂpjck

•github.com /flpjck •github.com /flpjck/flapjack

•github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner

•github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner •github.com /flpjck/vagrant-flapjack

•github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner •github.com /flpjck/vagrant-flapjack •github.com /flpjck/omnibus-flapjack

•github.com /ﬂpjck/packages.ﬂapjack.io

•github.com /ﬂpjck/packages.ﬂapjack.io •

packages.ﬂapjack.io

fat packages (omnibus)

.deb provided .rpm in the works

Quality documentation github.com/ﬂpjck/ﬂapjack/wiki

Bad documentation? BUG

Bad ﬁrst experience? BUG

Thank you! flapjack.io • github.com/flpjck/flapjack

Thank you! Liked the talk? Let @auxesis + @jessereynolds know!
flapjack.io • github.com/flpjck/flapjack

Credits: http://www.flickr.com/photos/lizadaly/4373330774 http://www.flickr.com/photos/meltwater/420749031 http://www.flickr.com/photos/whatknot/8642836187 http://www.flickr.com/photos/jonmould/5393395335 http://www.flickr.com/photos/thomasforsyth/4313764488 http://www.flickr.com/photos/rubodewig/5161937181 http://www.flickr.com/photos/ronwls/7001551988 http://www.flickr.com/photos/sparktography/83217827 http://www.flickr.com/photos/sdphotography/1570906849
http://tosbourn.com/wp-content/uploads/2013/12/redis-logo.png http://www.flickr.com/photos/derekskey/9530097369 http://giphy.com/gifs/yeUxljCJjH1rW http://www.flickr.com/photos/karen_d/8448507872 http://www.flickr.com/photos/buzzhoffman/4127280540 http://i.imgur.com/2UduUZ5.gif

Finding signal in the monitoring noise with Fla...

Finding signal in the monitoring noise with Flapjack

More Decks by Lindsay Holmwood

Other Decks in Technology

Featured

Transcript