Upgrade to Pro — share decks privately, control downloads, hide ads and more …

flapjack - monitoring notification system

flapjack - monitoring notification system

This talk was given on the Open Source Monitoring Conference in Nuremberg OSMC2013.

To quote the flapjack website (http://flapjack.io/): “Flapjack is a highly scalable and distributed monitoring notification system.

It sits on top of existing
monitoring engines like Nagios or Sensu, and does event processing & notifications.”

Birger Schmidt

October 23, 2013
Tweet

Other Decks in Technology

Transcript

  1. ✤ since mid 2012 R&D engineer at Bulletproof, Sydney, Australia

    ✤ 2008 - 2012 consultant / trainer at NETWAYS GmbH ✤ Dipl. Inf. Uni Rostock and HU zu Berlin ✤ since the beginning of the 90s working in ✤ IT-Infrastructure, Operating and Development today called DevOps Birger Schmidt
  2. ✤ Bulletproof v1.0 ✤ Founded 2000 as a Managed Service

    Provider, providing managed networking services ✤ Bulletproof v2.0 ✤ First VMware based public Cloud service in Australia in 2006 ✤ Bulletproof v3.0 ✤ First to launch Managed AWS in Australia in 2012, and most successful provider currently Bulletproof
  3. agenda ✤ 1. who (already done) and what ✤ 2.

    motivation ✤ 3. history ✤ 4. surrounding facts ✤ 5. background ✤ 6. architecture, documentation ✤ 7. demo
  4. 23.10.2013 “Flapjack is a highly scalable and distributed monitoring notification

    system. It sits on top of existing monitoring engines like Nagios or Sensu, and does event processing & notifications.” http://flapjack.io/
  5. ✤ we want to monitor things in an insane way

    ✤ like mad ✤ existing systems are not fully designed for that ✤ there is demand for that Motivation I
  6. Movember is one of our the biggest customers they want

    the numbers and not only the bold ones that I’ll show you now
  7. ✤ we do not want to be bothered with ✤

    too many alerts ✤ configuration of dependencies like ✤ parent / child ✤ host or service ✤ we can not afford to restart monitoring (on notification rule changes) Motivation II
  8. ✤ Lindsay Holmwood and Matt Moor wanted to build the

    “next generation monitoring system” ✤ simple to setup & operate ✤ with obvious paths to scale History 2008
  9. ✤ Lindsay Holmwood started hacking on Flapjack ✤ a working

    prototype that runs basic monitoring checks was ready by mid 2009 ✤ idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines ✤ but... History 2009
  10. ✤ Flapjack was considered a dead project **  WARNING:  Flapjack

     is  no  longer  under  active  development.  ** **  Feel  free  to  fork  it,  but  I  don't  have  the  bandwidth  to  work  on  this  project  any  longer.  ** ✤ but there were plenty of other interesting projects like Sensu with similar goals making excellent progress ✤ later on, Mod-Gearman and others walked into the same direction ✤ no secret, other people where doing awesome work in the monitoring space History 2010 and 2011
  11. ✤ Flapjack was rebooted with a significantly altered focus ✤

    check execution is no longer part of it ✤ it is now focused on ✤ Event processing ✤ Correlation & rollup ✤ API driven configuration History 2012
  12. ✤ we use Flapjack in production since January 2013 ✤

    and we still heavily develop it ✤ the rollup just got in but is not entirely finished yet Present / current state 2013
  13. ✤ fully open source ✤ MIT license ✤ development sponsored

    by Bulletproof ✤ up to 3 full time engineers working on it ✤ the team... surrounding facts I
  14. ✤ based on redis, eventmachine, and written in ruby (which

    could change later for some components) surrounding facts II EventMachine*
  15. ✤ notify a dynamic group of people in different ways

    ✤ Bulletproof has thousands of people to be notified ✤ each of those individuals can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure ✤ on-call is just one fashion of those people Background / Problem I
  16. ✤ alert, but make sure nobody gets bombarded during outages

    ✤ do the above in an API driven way, and it must work in a multitenant environment with strong segregation between customers, and integrate with an existing monitoring & customer self-service stack Background / Problem II
  17. ✤ in the end human beings are always the receiver

    of alerts ✤ Lindsay had given some great talks about the psychologic and human mechanisms of processing alerts ✤ hindsight bias / confirmation bias or why do we don’t accept to be wrong (and how that causes the inability to see mistakes) ✤ rate of alerts that humans can (or can not) handle Background / Humans
  18. ✤ Lindsay Holmwood talked about Mountainwest RubyConf 04/2013 Escalating complexity:

    DevOps learnings from Air France 447 devops downunder 07/2013 Monitorama EU 09/2013 Background / Humans
  19. ✤ event producer (fj nagios reciver) ✤ flapjack executive (fj

    executive) ✤ flapjack notifier (as of now part of the fj executive) ✤ gateways (implemented as Pikelets, say flapjack components which can be run within the same ruby process, or as separate processes) ✤ unidirectional: Email, SMS ✤ bidirectional: XMPP, PagerDuty ✤ diagram... architecture
  20. Nagios Nagios nagios-receiver Redis nagios-receiver events state lookup alerts processor

    notifier processor notifier PagerDuty SMS Email Executive Gateways
  21. Nagios Nagios nagios-receiver Redis nagios-receiver events state lookup alerts processor

    notifier processor notifier PagerDuty SMS Email Executive Gateways
  22. event processing Find people interested in entity map [ alice

    bob, carol ] notification event filters Find failing events
  23. event processing Find people interested in entity map map Find

    media owned by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notification event filters Find failing events
  24. event processing Find people interested in entity map map reduce

    Find media owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  25. event processing Find people interested in entity map map reduce

    reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  26. event processing Find people interested in entity map map reduce

    reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events [ [ alice, sms ], [ bob, sms ], ]
  27. event processing Find people interested in entity map map reduce

    reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ]
  28. ✤ filter on status (warning, critical, recovery) ✤ (new entities

    won't trigger alerts) ✤ maintenance ✤ ack ✤ delay (think of soft states in nagios) Event filters
  29. ✤ are event driven ✤ flapjack needs the flow of

    incoming check results / events ✤ are time based ✤ flapjack alerts / or holds back alerts according to time constrains ✤ and tags... Notifications
  30. ✤ right now you can use tags in notification rules

    to match against failing checks ✤ flapjack creates auto-tags from all the words in the entity name, check name, and the summary output from the monitoring check ✤ you can add custom tags to entities ✤ tag a host with 'top-priority' and set your notification rules to alert via sms, email, pagerduty ✤ or tag with 'nobody-cares-except-fred' and then everyone but fred could set a notification rule to match this tag and blackhole all matching notifications Tags
  31. ✤ improve rollup ✤ we try to make you feel

    welcome as much as we can ✤ documentation ✤ vagrant-flapjack - go test for yourself ✤ omd? future