Finding signal in the monitoring noise with Flapjack

Finding signal in the monitoring noise with Flapjack

Working in operations in 2014 is hard.*

More applications are running in the cloud, the infrastructures we manage are getting bigger and bigger, and responsibility for that is being divided up across multiple teams.

Then something breaks. All hell breaks loose. Your on-call engineer receives 900 SMS in 30 seconds. Her phone melts. You can’t distinguish the signal from the noise. It takes an hour to fix the problem.

Weren’t computers meant to solve these problems?

Enter Flapjack: a distributed event processing + monitoring alert routing system. Flapjack sits at the end of your monitoring pipeline and works out who it should send alerts to. Sounds pretty simple? Flapjack tries to make it so.

There are still really hard problems to solve when working out who to notify about a detected failure, and what to do when lots of things fail simultaneously.

You should be interested in Flapjack if:

- You want to track down failures faster by rolling up your alerts across multiple monitoring systems.
- You monitor large infrastructures that have multiple teams responsible for keeping them up.
- You want to dip your toe in the water and try alternative check execution engines like Sensu in parallel to Nagios.

In this talk, Jesse Reynolds and Lindsay Holmwood will take you on a whirlwind tour of Flapjack - what it is, how it solves problems, where it’s going - with a hands on lab that you can start applying in your organisation tomorrow.

*Disclaimer: this abstract was written in 2013. Things may have since gotten awesome and we’re all sitting on the beach in the Bahamas drinking piña coladas. But this is highly unlikely.

Fad1e9ed293fc5b3ec7d4abdffeb636f?s=128

Lindsay Holmwood

January 08, 2014
Tweet

Transcript

  1. Finding signal in the monitoring noise

  2. What is flapjack?

  3. Monitoring alert routing system

  4. Composable

  5. Rollup

  6. Alert routing

  7. None
  8. • event

  9. • event • ↪ notify?

  10. • event • ↪ notify? • ↪ who?

  11. • event • ↪ notify? • ↪ who? • ↪

    how?
  12. API driven

  13. No restarts required

  14. Developed + used in production at:

  15. Developed + used in production at: Developers: Ali Graham Jesse

    Reynolds Project manager: Lindsay Holmwood
  16. None
  17. • Technology:

  18. • Technology: • Ruby

  19. • Technology: • Ruby • Redis

  20. • Technology: • Ruby • Redis • EventMachine*

  21. • Technology: • Ruby • Redis • EventMachine* • *

    Replaced in 2.0 with Ruby threads
  22. Designed for humans

  23. Considerate of: Alert fatigue Normalcy bias Confirmation bias

  24. None
  25. Why flapjack ?

  26. Specific use cases

  27. Multi-tenant

  28. Segregated responsibility

  29. Check engine independence

  30. None
  31. Killer features

  32. Self-checking

  33. None
  34. event producers

  35. event producers flapjack

  36. event producers flapjack oobetet

  37. None
  38. Rollup (alert summarisation)

  39. Per-media thresholds

  40. None
  41. • Contact

  42. • Contact • has many • Media

  43. • Contact • has many • Media • has one

    • Summary Threshold
  44. Tagging

  45. None
  46. How does it work?

  47. Data model

  48. Contact

  49. Contact Checks Checks Media

  50. Contact Checks Checks Media Checks Checks Notification Rules

  51. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  52. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Checks Checks Checks Entities
  53. Contact Checks Checks Media Checks Checks Notification Rules History (maintenance,

    acks, state changes) Checks Checks Checks Checks Checks Entities
  54. Architecture

  55. None
  56. event producers

  57. event producers processors

  58. event producers processors gateways

  59. processors gateways Nagios nagios-receiver Nagios nagios-receiver

  60. Event Producers Nagios Icinga Sensu Cron

  61. processors gateways Nagios nagios-receiver Nagios nagios-receiver

  62. processor gateways notifier Nagios nagios-receiver Nagios nagios-receiver

  63. processor gateways notifier Nagios nagios-receiver Nagios nagios-receiver

  64. What is an event?

  65. JSON

  66. { "entity": ENTITY, "check": CHECK, "type": EVENT_TYPE, "state": STATE, "time":

    TIMESTAMP, "summary": SUMMARY }
  67. { "entity": ENTITY, } ENTITY (string) Name of the relevant

    entity (e.g. FQDN)
  68. { "check": CHECK, } CHECK (string) The check name ('service

    description' in Nagios lingo)
  69. { "type": EVENT_TYPE, } EVENT_TYPE (string) One of 'service' or

    'action'
  70. { "state": STATE, } STATE (string) One of 'ok', 'warning',

    'critical', 'unknown', or 'acknowledgement'
  71. { "time": TIMESTAMP, } TIMESTAMP (string) UNIX timestamp of the

    event's creation
  72. { "summary": SUMMARY } SUMMARY (string) The check output in

    the case of a service event, otherwise a message created for an acknowledgement, or similar
  73. { "entity": ENTITY, "check": CHECK, "type": EVENT_TYPE, "state": STATE, "time":

    TIMESTAMP, "summary": SUMMARY }
  74. processor gateways notifier Nagios nagios-receiver Nagios nagios-receiver

  75. How are alerts routed?

  76. event

  77. event filters Find failing events

  78. notification event filters Find failing events

  79. Find people interested in entity map [ alice bob, carol

    ] notification event filters Find failing events
  80. Find people interested in entity map map Find media owned

    by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notification event filters Find failing events
  81. Find people interested in entity map map reduce Find media

    owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  82. Find people interested in entity map map reduce reduce Find

    media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  83. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events [ [ alice, sms ], [ bob, sms ], ]
  84. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ]
  85. processor gateways notifier Nagios nagios-receiver Nagios nagios-receiver

  86. processor notifier Nagios nagios-receiver Nagios nagios-receiver Email SMS Jabber PagerDuty

  87. processor notifier Nagios nagios-receiver Nagios nagios-receiver Email SMS Jabber PagerDuty

    Web API
  88. None
  89. Things that may surprise you

  90. Constant heartbeat

  91. No one-off events

  92. How long has a check been failing?

  93. NOT "How many times has the check failed?"

  94. Why?

  95. No HARD/SOFT states

  96. Broadcast delay

  97. Alert summarisation (Rollup)

  98. None
  99. "Nagios as a dumb check executor"

  100. No notifications

  101. No acknowledgements

  102. No downtime

  103. No parenting

  104. Just checking

  105. Shards of Nagios

  106. Scale horizontally

  107. Nagios shared state

  108. None
  109. Case study

  110. Production use at Bulletproof since December 2012

  111. As of November 2013: 896 entities 5778 checks* *some of

    these are stale
  112. As of January 2014: Processing ~60 events/second

  113. /self_stats.json

  114. Manage (customer portal)

  115. None
  116. None
  117. manage-flapjack-sync

  118. manage (source of truth) Ὃ manage-flapjack-sync Ὃ flapjack

  119. None
  120. Flapjack

  121. Flapjack Manage

  122. Flapjack Manage Nagios

  123. Flapjack Manage collectd Nagios

  124. Flapjack Manage collectd Nagios ???

  125. Flapjack Manage collectd Nagios ???

  126. { "entity": ENTITY, "check": CHECK, "type": EVENT_TYPE, "state": STATE, "time":

    TIMESTAMP, "summary": SUMMARY }
  127. { "entity": ENTITY, "check": CHECK, "type": EVENT_TYPE, "state": STATE, "time":

    TIMESTAMP, "summary": SUMMARY } ENTITY (string) Name of the relevant entity (e.g. FQDN) Bulletproof uses the FQDN.
  128. None
  129. Shortcomings

  130. Fixed broadcast delay (30s)

  131. Single external source of truth

  132. Contact import/export

  133. Bulletproof-ism

  134. None
  135. Open Source

  136. • github.com/flpjck/flapjack

  137. Release planning in open github.com/flpjck/flapjack/wiki/Releasing

  138. Policy for triaging bugs + features github.com/flpjck/flapjack/wiki/Releasing

  139. Semantic versioning (2.0.0)

  140. We write tests! Unit + Integration

  141. We write tests! ~80% coverage

  142. We run tests!

  143. None
  144. •github.com /flpjck

  145. •github.com /flpjck •github.com /flpjck/flapjack

  146. •github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner

  147. •github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner •github.com /flpjck/vagrant-flapjack

  148. •github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner •github.com /flpjck/vagrant-flapjack •github.com /flpjck/omnibus-flapjack

  149. •github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner •github.com /flpjck/vagrant-flapjack •github.com /flpjck/omnibus-flapjack

    •github.com /flpjck/packages.flapjack.io
  150. •github.com /flpjck •github.com /flpjck/flapjack •github.com /flpjck/flapjack-diner •github.com /flpjck/vagrant-flapjack •github.com /flpjck/omnibus-flapjack

    •github.com /flpjck/packages.flapjack.io •
  151. packages.flapjack.io

  152. fat packages (omnibus)

  153. .deb provided .rpm in the works

  154. Quality documentation github.com/flpjck/flapjack/wiki

  155. Bad documentation? BUG

  156. Bad first experience? BUG

  157. None
  158. Thank you! flapjack.io • github.com/flpjck/flapjack

  159. Thank you! Liked the talk? Let @auxesis + @jessereynolds know!

    flapjack.io • github.com/flpjck/flapjack
  160. Credits: http://www.flickr.com/photos/lizadaly/4373330774 http://www.flickr.com/photos/meltwater/420749031 http://www.flickr.com/photos/whatknot/8642836187 http://www.flickr.com/photos/jonmould/5393395335 http://www.flickr.com/photos/thomasforsyth/4313764488 http://www.flickr.com/photos/rubodewig/5161937181 http://www.flickr.com/photos/ronwls/7001551988 http://www.flickr.com/photos/sparktography/83217827 http://www.flickr.com/photos/sdphotography/1570906849

    http://tosbourn.com/wp-content/uploads/2013/12/redis-logo.png http://www.flickr.com/photos/derekskey/9530097369 http://giphy.com/gifs/yeUxljCJjH1rW http://www.flickr.com/photos/karen_d/8448507872 http://www.flickr.com/photos/buzzhoffman/4127280540 http://i.imgur.com/2UduUZ5.gif
  161. None
  162. None
  163. None
  164. None
  165. None
  166. None
  167. None
  168. None
  169. None
  170. None
  171. None