Slide 1

Slide 1 text

23.10.2013 Flapjack monitoring notification system

Slide 2

Slide 2 text

23.10.2013 Who is that guy? and what company is he working for?

Slide 3

Slide 3 text

Birger Schmidt since mid 2012 R&D engineer at

Slide 4

Slide 4 text

✤ since mid 2012 R&D engineer at Bulletproof, Sydney, Australia ✤ 2008 - 2012 consultant / trainer at NETWAYS GmbH ✤ Dipl. Inf. Uni Rostock and HU zu Berlin ✤ since the beginning of the 90s working in ✤ IT-Infrastructure, Operating and Development today called DevOps Birger Schmidt

Slide 5

Slide 5 text

✤ Bulletproof v1.0 ✤ Founded 2000 as a Managed Service Provider, providing managed networking services ✤ Bulletproof v2.0 ✤ First VMware based public Cloud service in Australia in 2006 ✤ Bulletproof v3.0 ✤ First to launch Managed AWS in Australia in 2012, and most successful provider currently Bulletproof

Slide 6

Slide 6 text

Bulletproof Customers

Slide 7

Slide 7 text

23.10.2013 agenda

Slide 8

Slide 8 text

agenda ✤ 1. who (already done) and what ✤ 2. motivation ✤ 3. history ✤ 4. surrounding facts ✤ 5. background ✤ 6. architecture, documentation ✤ 7. demo

Slide 9

Slide 9 text

23.10.2013 “Flapjack is a highly scalable and distributed monitoring notification system. It sits on top of existing monitoring engines like Nagios or Sensu, and does event processing & notifications.” http://flapjack.io/

Slide 10

Slide 10 text

23.10.2013 motivation

Slide 11

Slide 11 text

✤ we want to monitor things in an insane way ✤ like mad ✤ existing systems are not fully designed for that ✤ there is demand for that Motivation I

Slide 12

Slide 12 text

Movember is one of our the biggest customers they want the numbers and not only the bold ones that I’ll show you now

Slide 13

Slide 13 text

Movember is a charity organization that fundraises money to found research for mens health

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

✤ we do not want to be bothered with ✤ too many alerts ✤ configuration of dependencies like ✤ parent / child ✤ host or service ✤ we can not afford to restart monitoring (on notification rule changes) Motivation II

Slide 17

Slide 17 text

23.10.2013 history

Slide 18

Slide 18 text

✤ Lindsay Holmwood and Matt Moor wanted to build the “next generation monitoring system” ✤ simple to setup & operate ✤ with obvious paths to scale History 2008

Slide 19

Slide 19 text

✤ Lindsay Holmwood started hacking on Flapjack ✤ a working prototype that runs basic monitoring checks was ready by mid 2009 ✤ idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines ✤ but... History 2009

Slide 20

Slide 20 text

✤ Flapjack was considered a dead project **  WARNING:  Flapjack  is  no  longer  under  active  development.  ** **  Feel  free  to  fork  it,  but  I  don't  have  the  bandwidth  to  work  on  this  project  any  longer.  ** ✤ but there were plenty of other interesting projects like Sensu with similar goals making excellent progress ✤ later on, Mod-Gearman and others walked into the same direction ✤ no secret, other people where doing awesome work in the monitoring space History 2010 and 2011

Slide 21

Slide 21 text

✤ Flapjack was rebooted with a significantly altered focus ✤ check execution is no longer part of it ✤ it is now focused on ✤ Event processing ✤ Correlation & rollup ✤ API driven configuration History 2012

Slide 22

Slide 22 text

✤ we use Flapjack in production since January 2013 ✤ and we still heavily develop it ✤ the rollup just got in but is not entirely finished yet Present / current state 2013

Slide 23

Slide 23 text

23.10.2013 surrounding flapjack facts

Slide 24

Slide 24 text

✤ fully open source ✤ MIT license ✤ development sponsored by Bulletproof ✤ up to 3 full time engineers working on it ✤ the team... surrounding facts I

Slide 25

Slide 25 text

✤ Lindsay Holmwood ✤ Jesse Reynolds ✤ Ali Graham ✤ me, you? distributed team

Slide 26

Slide 26 text

✤ based on redis, eventmachine, and written in ruby (which could change later for some components) surrounding facts II EventMachine*

Slide 27

Slide 27 text

23.10.2013 background

Slide 28

Slide 28 text

✤ notify a dynamic group of people in different ways ✤ Bulletproof has thousands of people to be notified ✤ each of those individuals can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure ✤ on-call is just one fashion of those people Background / Problem I

Slide 29

Slide 29 text

✤ alert, but make sure nobody gets bombarded during outages ✤ do the above in an API driven way, and it must work in a multitenant environment with strong segregation between customers, and integrate with an existing monitoring & customer self-service stack Background / Problem II

Slide 30

Slide 30 text

✤ in the end human beings are always the receiver of alerts ✤ Lindsay had given some great talks about the psychologic and human mechanisms of processing alerts ✤ hindsight bias / confirmation bias or why do we don’t accept to be wrong (and how that causes the inability to see mistakes) ✤ rate of alerts that humans can (or can not) handle Background / Humans

Slide 31

Slide 31 text

✤ Lindsay Holmwood talked about Mountainwest RubyConf 04/2013 Escalating complexity: DevOps learnings from Air France 447 devops downunder 07/2013 Monitorama EU 09/2013 Background / Humans

Slide 32

Slide 32 text

23.10.2013 architecture

Slide 33

Slide 33 text

✤ event producer (fj nagios reciver) ✤ flapjack executive (fj executive) ✤ flapjack notifier (as of now part of the fj executive) ✤ gateways (implemented as Pikelets, say flapjack components which can be run within the same ruby process, or as separate processes) ✤ unidirectional: Email, SMS ✤ bidirectional: XMPP, PagerDuty ✤ diagram... architecture

Slide 34

Slide 34 text

Nagios Nagios nagios-receiver Redis nagios-receiver events state lookup alerts processor notifier processor notifier PagerDuty SMS Email Executive Gateways

Slide 35

Slide 35 text

Nagios Nagios nagios-receiver Redis nagios-receiver events state lookup alerts processor notifier processor notifier PagerDuty SMS Email Executive Gateways

Slide 36

Slide 36 text

✤ events are processed as pictured in the following diagram event processing

Slide 37

Slide 37 text

event processing

Slide 38

Slide 38 text

event processing event

Slide 39

Slide 39 text

event processing event filters Find failing events

Slide 40

Slide 40 text

event processing notification event filters Find failing events

Slide 41

Slide 41 text

event processing Find people interested in entity map [ alice bob, carol ] notification event filters Find failing events

Slide 42

Slide 42 text

event processing Find people interested in entity map map Find media owned by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notification event filters Find failing events

Slide 43

Slide 43 text

event processing Find people interested in entity map map reduce Find media owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events

Slide 44

Slide 44 text

event processing Find people interested in entity map map reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events

Slide 45

Slide 45 text

event processing Find people interested in entity map map reduce reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events [ [ alice, sms ], [ bob, sms ], ]

Slide 46

Slide 46 text

event processing Find people interested in entity map map reduce reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ]

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

23.10.2013 event filters

Slide 49

Slide 49 text

✤ filter on status (warning, critical, recovery) ✤ (new entities won't trigger alerts) ✤ maintenance ✤ ack ✤ delay (think of soft states in nagios) Event filters

Slide 50

Slide 50 text

23.10.2013 notifications

Slide 51

Slide 51 text

✤ are event driven ✤ flapjack needs the flow of incoming check results / events ✤ are time based ✤ flapjack alerts / or holds back alerts according to time constrains ✤ and tags... Notifications

Slide 52

Slide 52 text

✤ you can use tags to direct notifications Tags

Slide 53

Slide 53 text

✤ right now you can use tags in notification rules to match against failing checks ✤ flapjack creates auto-tags from all the words in the entity name, check name, and the summary output from the monitoring check ✤ you can add custom tags to entities ✤ tag a host with 'top-priority' and set your notification rules to alert via sms, email, pagerduty ✤ or tag with 'nobody-cares-except-fred' and then everyone but fred could set a notification rule to match this tag and blackhole all matching notifications Tags

Slide 54

Slide 54 text

23.10.2013 web interface

Slide 55

Slide 55 text

Web interface

Slide 56

Slide 56 text

23.10.2013 excellent documentation

Slide 57

Slide 57 text

Documentation https://github.com/flpjck/flapjack/wiki

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

We care about documentation!

Slide 62

Slide 62 text

Bad documentation? BUG

Slide 63

Slide 63 text

23.10.2013 future

Slide 64

Slide 64 text

✤ improve rollup ✤ we try to make you feel welcome as much as we can ✤ documentation ✤ vagrant-flapjack - go test for yourself ✤ omd? future

Slide 65

Slide 65 text

23.10.2013 Talk is cheap, demo it!

Slide 66

Slide 66 text

23.10.2013 Thank you! https://github.com/flpjck [email protected]

Slide 67

Slide 67 text

23.10.2013 feedback https://github.com/flpjck [email protected]

Slide 68

Slide 68 text

23.10.2013 Questions? https://github.com/flpjck [email protected]

Slide 69

Slide 69 text

23.10.2013 https://github.com/flpjck