Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding signal in the monitoring noise with Flapjack

Finding signal in the monitoring noise with Flapjack

Working in operations in 2014 is hard.*

More applications are running in the cloud, the infrastructures we manage are getting bigger and bigger, and responsibility for that is being divided up across multiple teams.

Then something breaks. All hell breaks loose. Your on-call engineer receives 900 SMS in 30 seconds. Her phone melts. You can’t distinguish the signal from the noise. It takes an hour to fix the problem.

Weren’t computers meant to solve these problems?

Enter Flapjack: a distributed event processing + monitoring alert routing system. Flapjack sits at the end of your monitoring pipeline and works out who it should send alerts to. Sounds pretty simple? Flapjack tries to make it so.

There are still really hard problems to solve when working out who to notify about a detected failure, and what to do when lots of things fail simultaneously.

You should be interested in Flapjack if:

- You want to track down failures faster by rolling up your alerts across multiple monitoring systems.
- You monitor large infrastructures that have multiple teams responsible for keeping them up.
- You want to dip your toe in the water and try alternative check execution engines like Sensu in parallel to Nagios.

In this talk, Jesse Reynolds and Lindsay Holmwood will take you on a whirlwind tour of Flapjack - what it is, how it solves problems, where it’s going - with a hands on lab that you can start applying in your organisation tomorrow.

*Disclaimer: this abstract was written in 2013. Things may have since gotten awesome and we’re all sitting on the beach in the Bahamas drinking piña coladas. But this is highly unlikely.

Lindsay Holmwood

January 08, 2014
Tweet

More Decks by Lindsay Holmwood

Other Decks in Technology

Transcript

  1. Developed + used in production at: Developers: Ali Graham Jesse

    Reynolds Project manager: Lindsay Holmwood
  2. Contact Checks Checks Media Checks Checks Notification Rules History (maintenance,

    acks, state changes) Checks Checks Checks Checks Checks Entities
  3. { "state": STATE, } STATE (string) One of 'ok', 'warning',

    'critical', 'unknown', or 'acknowledgement'
  4. { "summary": SUMMARY } SUMMARY (string) The check output in

    the case of a service event, otherwise a message created for an acknowledgement, or similar
  5. Find people interested in entity map [ alice bob, carol

    ] notification event filters Find failing events
  6. Find people interested in entity map map Find media owned

    by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notification event filters Find failing events
  7. Find people interested in entity map map reduce Find media

    owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  8. Find people interested in entity map map reduce reduce Find

    media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  9. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events [ [ alice, sms ], [ bob, sms ], ]
  10. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ]
  11. { "entity": ENTITY, "check": CHECK, "type": EVENT_TYPE, "state": STATE, "time":

    TIMESTAMP, "summary": SUMMARY } ENTITY (string) Name of the relevant entity (e.g. FQDN) Bulletproof uses the FQDN.
  12. Thank you! Liked the talk? Let @auxesis + @jessereynolds know!

    flapjack.io • github.com/flpjck/flapjack