23.10.2013
Flapjack
monitoring notification system
Slide 2
Slide 2 text
23.10.2013
Who is that guy?
and what company is he working for?
Slide 3
Slide 3 text
Birger Schmidt
since mid 2012 R&D engineer at
Slide 4
Slide 4 text
✤ since mid 2012 R&D engineer at Bulletproof, Sydney, Australia
✤ 2008 - 2012 consultant / trainer at NETWAYS GmbH
✤ Dipl. Inf. Uni Rostock and HU zu Berlin
✤ since the beginning of the 90s working in
✤ IT-Infrastructure, Operating and Development
today called DevOps
Birger Schmidt
Slide 5
Slide 5 text
✤ Bulletproof v1.0
✤ Founded 2000 as a Managed Service Provider, providing managed
networking services
✤ Bulletproof v2.0
✤ First VMware based public Cloud service in Australia in 2006
✤ Bulletproof v3.0
✤ First to launch Managed AWS in Australia in 2012, and most
successful provider currently
Bulletproof
Slide 6
Slide 6 text
Bulletproof Customers
Slide 7
Slide 7 text
23.10.2013
agenda
Slide 8
Slide 8 text
agenda
✤ 1. who (already done) and what
✤ 2. motivation
✤ 3. history
✤ 4. surrounding facts
✤ 5. background
✤ 6. architecture, documentation
✤ 7. demo
Slide 9
Slide 9 text
23.10.2013
“Flapjack is a highly scalable and
distributed monitoring notification
system.
It sits on top of existing
monitoring engines like Nagios or
Sensu, and does event processing &
notifications.”
http://flapjack.io/
Slide 10
Slide 10 text
23.10.2013
motivation
Slide 11
Slide 11 text
✤ we want to monitor things in an insane way
✤ like mad
✤ existing systems are not fully designed for that
✤ there is demand for that
Motivation I
Slide 12
Slide 12 text
Movember
is one of our the biggest
customers
they want the numbers
and not only the bold ones
that I’ll show you now
Slide 13
Slide 13 text
Movember
is a charity organization
that fundraises money
to found research for
mens health
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
✤ we do not want to be bothered with
✤ too many alerts
✤ configuration of dependencies like
✤ parent / child
✤ host or service
✤ we can not afford to restart monitoring
(on notification rule changes)
Motivation II
Slide 17
Slide 17 text
23.10.2013
history
Slide 18
Slide 18 text
✤ Lindsay Holmwood and Matt Moor wanted to build the
“next generation monitoring system”
✤ simple to setup & operate
✤ with obvious paths to scale
History 2008
Slide 19
Slide 19 text
✤ Lindsay Holmwood started hacking on Flapjack
✤ a working prototype that runs basic monitoring checks
was ready by mid 2009
✤ idea was simple:
decouple the check execution from the alerting and notification,
and use message queues to distribute the check execution
across lots of machines
✤ but...
History 2009
Slide 20
Slide 20 text
✤ Flapjack was considered a dead project
**
WARNING:
Flapjack
is
no
longer
under
active
development.
**
**
Feel
free
to
fork
it,
but
I
don't
have
the
bandwidth
to
work
on
this
project
any
longer.
**
✤ but there were plenty of other interesting projects
like Sensu with similar goals making excellent progress
✤ later on, Mod-Gearman and others walked into the same direction
✤ no secret,
other people where doing awesome work in the monitoring space
History 2010 and 2011
Slide 21
Slide 21 text
✤ Flapjack was rebooted with a significantly altered focus
✤ check execution is no longer part of it
✤ it is now focused on
✤ Event processing
✤ Correlation & rollup
✤ API driven configuration
History 2012
Slide 22
Slide 22 text
✤ we use Flapjack in production since January 2013
✤ and we still heavily develop it
✤ the rollup just got in
but is not entirely finished yet
Present / current state 2013
Slide 23
Slide 23 text
23.10.2013
surrounding flapjack facts
Slide 24
Slide 24 text
✤ fully open source
✤ MIT license
✤ development sponsored by Bulletproof
✤ up to 3 full time engineers working on it
✤ the team...
surrounding facts I
Slide 25
Slide 25 text
✤ Lindsay Holmwood
✤ Jesse Reynolds
✤ Ali Graham
✤ me, you?
distributed
team
Slide 26
Slide 26 text
✤ based on redis, eventmachine,
and written in ruby (which could change later for some components)
surrounding facts II
EventMachine*
Slide 27
Slide 27 text
23.10.2013
background
Slide 28
Slide 28 text
✤ notify a dynamic group of people in different ways
✤ Bulletproof has thousands of people to be notified
✤ each of those individuals can have different notification settings
based on time of day or week, the type of service affected,
or the severity of the failure
✤ on-call is just one fashion of those people
Background / Problem I
Slide 29
Slide 29 text
✤ alert, but make sure nobody gets bombarded during outages
✤ do the above in an API driven way,
and it must work in a multitenant environment with
strong segregation between customers,
and integrate with an existing
monitoring & customer self-service stack
Background / Problem II
Slide 30
Slide 30 text
✤ in the end human beings are always the receiver of alerts
✤ Lindsay had given some great talks about the psychologic and
human mechanisms of processing alerts
✤ hindsight bias / confirmation bias or why do we don’t accept to be
wrong (and how that causes the inability to see mistakes)
✤ rate of alerts that humans can (or can not) handle
Background / Humans
Slide 31
Slide 31 text
✤ Lindsay Holmwood talked about
Mountainwest RubyConf 04/2013
Escalating complexity: DevOps learnings from Air France 447
devops downunder 07/2013 Monitorama EU 09/2013
Background / Humans
Slide 32
Slide 32 text
23.10.2013
architecture
Slide 33
Slide 33 text
✤ event producer (fj nagios reciver)
✤ flapjack executive (fj executive)
✤ flapjack notifier (as of now part of the fj executive)
✤ gateways (implemented as Pikelets, say flapjack components which
can be run within the same ruby process, or as separate processes)
✤ unidirectional: Email, SMS
✤ bidirectional: XMPP, PagerDuty
✤ diagram...
architecture
event processing
Find people interested in entity map
[
alice
bob,
carol
]
notification
event
filters
Find failing events
Slide 42
Slide 42 text
event processing
Find people interested in entity map
map
Find media owned by people
[
[ alice, email ],
[ alice, sms ],
[ bob, email ],
[ bob, sms ],
[ carol, sms ],
]
notification
event
filters
Find failing events
Slide 43
Slide 43 text
event processing
Find people interested in entity map
map
reduce
Find media owned by people
Delete media based on tags, severity, time of day
[
[ alice, email ],
[ alice, sms ],
[ bob, sms ],
]
notification
event
filters
Find failing events
Slide 44
Slide 44 text
event processing
Find people interested in entity map
map
reduce
reduce
Find media owned by people
Delete media based on tags, severity, time of day
Delete media based on blackholes
[
[ alice, sms ],
[ bob, sms ],
]
notification
event
filters
Find failing events
Slide 45
Slide 45 text
event processing
Find people interested in entity map
map
reduce
reduce
reduce
Find media owned by people
Delete media based on tags, severity, time of day
Delete media based on blackholes
Delete media based on notification intervals
notification
event
filters
Find failing events
[
[ alice, sms ],
[ bob, sms ],
]
Slide 46
Slide 46 text
event processing
Find people interested in entity map
map
reduce
reduce
reduce
Find media owned by people
Delete media based on tags, severity, time of day
Delete media based on blackholes
Delete media based on notification intervals
notification
event
filters
Find failing events
alert alert
[
[ alice, sms ],
[ bob, sms ],
]
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
23.10.2013
event filters
Slide 49
Slide 49 text
✤ filter on status (warning, critical, recovery)
✤ (new entities won't trigger alerts)
✤ maintenance
✤ ack
✤ delay (think of soft states in nagios)
Event filters
Slide 50
Slide 50 text
23.10.2013
notifications
Slide 51
Slide 51 text
✤ are event driven
✤ flapjack needs the flow of incoming check results / events
✤ are time based
✤ flapjack alerts / or holds back alerts
according to time constrains
✤ and tags...
Notifications
Slide 52
Slide 52 text
✤ you can use
tags to direct
notifications
Tags
Slide 53
Slide 53 text
✤ right now you can use tags in notification rules to
match against failing checks
✤ flapjack creates auto-tags from all the words in the entity name,
check name, and the summary output from the monitoring check
✤ you can add custom tags to entities
✤ tag a host with 'top-priority'
and set your notification rules to alert via sms, email, pagerduty
✤ or tag with 'nobody-cares-except-fred'
and then everyone but fred could set a notification rule to match
this tag and blackhole all matching notifications
Tags