Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data driven alerting with Flapjack + Puppet + Hiera

Data driven alerting with Flapjack + Puppet + Hiera

Working in operations in 2014 is hard.

The infrastructures we manage are growing rapidly, and responsibility is being divided up across multiple teams.

Then something breaks. Your on-call engineer receives 900 SMS in 30 seconds. Her phone melts. You can’t distinguish the signal from the noise. It takes an hour to fix the problem.

Enter Flapjack: an event processing & monitoring alert routing system. Flapjack sits at the end of your monitoring pipeline and sends alerts to the right person.

You should be interested in Flapjack if:

- You want to automatically configure how your on-call are notified from within Puppet
- You want to identify failures faster by rolling up your alerts across multiple monitoring systems.
- You monitor infrastructures that have multiple teams responsible for keeping them up.

In this talk you will learn how to setup Flapjack with Puppet, data-driven Flapjack configuration with Hiera, and how you can leverage Puppet's metadata to effectively route alerts to people who solve problems.

## Credits

http://www.flickr.com/photos/lizadaly/4373330774
http://www.flickr.com/photos/meltwater/420749031
http://www.flickr.com/photos/whatknot/8642836187
http://www.flickr.com/photos/jonmould/5393395335
http://vmfarms.com/static/img/logos/ruby-logo.png
http://www.flickr.com/photos/l1v32r1d3bmx/3985457584
http://www.flickr.com/photos/thomasforsyth/4313764488
http://www.flickr.com/photos/rubodewig/5161937181
http://www.flickr.com/photos/ronwls/7001551988
http://www.flickr.com/photos/sparktography/83217827
http://www.flickr.com/photos/sdphotography/1570906849
http://tosbourn.com/wp-content/uploads/2013/12/redis-logo.png?e0df77
http://www.flickr.com/photos/derekskey/9530097369
http://giphy.com/gifs/yeUxljCJjH1rW
http://en.wikipedia.org/wiki/Broadcast_delay
http://www.flickr.com/photos/karen_d/8448507872
http://www.flickr.com/photos/buzzhoffman/4127280540
http://i.imgur.com/2UduUZ5.gif

Lindsay Holmwood

February 10, 2014
Tweet

More Decks by Lindsay Holmwood

Other Decks in Technology

Transcript

  1. Data-driven alerting with Flapjack + Puppet + Hiera

  2. What is flapjack?

  3. Monitoring alert routing system

  4. Composable

  5. Rollup

  6. Alert routing

  7. None
  8. • event

  9. • event • ↪ notify?

  10. • event • ↪ notify? • ↪ who?

  11. • event • ↪ notify? • ↪ who? • ↪

    how?
  12. API driven

  13. No restarts required

  14. Developed + used in production at:

  15. Developed + used in production at: Developers: Ali Graham Jesse

    Reynolds Project manager: Lindsay Holmwood
  16. +

  17. Designed for humans

  18. None
  19. None
  20. Why flapjack ?

  21. Specific use cases

  22. Multi-tenant

  23. Segregated responsibility

  24. Check engine independence

  25. None
  26. Killer features

  27. Self-checking

  28. None
  29. event producers

  30. event producers flapjack

  31. event producers flapjack oobetet

  32. None
  33. flapper

  34. event producers flapper

  35. event producers flapjack flapper

  36. event producers flapjack jabber room flapper

  37. event producers flapjack jabber room flapper oobetet

  38. event producers flapjack jabber room flapper oobetet PagerDuty

  39. Rollup (alert summarisation)

  40. Per-media thresholds

  41. None
  42. • Contact

  43. • Contact • has many • Media

  44. • Contact • has many • Media • has one

    • Summary Threshold
  45. Tagging

  46. None
  47. How does it work?

  48. Data model

  49. Contact

  50. Contact Checks Checks Media

  51. Contact Checks Checks Media Checks Checks Notification Rules

  52. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  53. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Checks Checks Checks Entities
  54. Contact Checks Checks Media Checks Checks Notification Rules History (maintenance,

    acks, state changes) Checks Checks Checks Checks Checks Entities
  55. Architecture

  56. None
  57. event producers

  58. event producers processors

  59. event producers processors gateways

  60. processors gateways Icinga flapjackfeeder Sensu jestin's thing

  61. Event Producers Icinga Sensu Cron Nagios

  62. processors gateways Icinga flapjackfeeder Sensu jestin's thing

  63. processor gateways notifier Icinga flapjackfeeder Sensu jestin's thing

  64. processor gateways notifier Icinga flapjackfeeder Sensu jestin's thing

  65. How are alerts routed?

  66. event

  67. event filters Find failing events

  68. notification event filters Find failing events

  69. Find people interested in entity map [ alice bob, carol

    ] notification event filters Find failing events
  70. Find people interested in entity map map Find media owned

    by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notification event filters Find failing events
  71. Find people interested in entity map map reduce Find media

    owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  72. Find people interested in entity map map reduce reduce Find

    media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events
  73. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events [ [ alice, sms ], [ bob, sms ], ]
  74. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ]
  75. processor gateways notifier Icinga flapjackfeeder Sensu jestin's thing

  76. processor notifier Icinga flapjackfeeder Sensu jestin's thing Email SMS Jabber

    PagerDuty Web API
  77. None
  78. Things that may surprise you

  79. Constant heartbeat

  80. No one-off events

  81. How long has a check been failing?

  82. NOT "How many times has the check failed?"

  83. Why?

  84. No HARD/SOFT states

  85. Broadcast delay

  86. Alert summarisation (Rollup)

  87. None
  88. None
  89. Integrating

  90. Configure Flapjack with Puppet

  91. puppet as external source of truth

  92. None
  93. Contact

  94. Contact Checks Checks Media

  95. Contact Checks Checks Media Checks Checks Notification Rules

  96. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  97. How?

  98. None
  99. Puppet

  100. API Puppet

  101. API Puppet

  102. flapjack API Puppet

  103. flapjack API Puppet events

  104. flapjack API Puppet events notifications

  105. Puppet type + provider for flapjack

  106. Bootstrapping

  107. git clone https://github.com/flpjck/vagrant-flapjack.git cd vagrant-flapjack vagrant up

  108. manifests/site.pp

  109. node default { class {'icinga': } -> class {'nagios': }

    -> class {'flapjack': }
  110. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  111. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  112. flapjack_contact { '[email protected]': ensure => present, first_name => 'Ada', last_name

    => 'Lovelace', timezone => 'Europe/London', }
  113. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  114. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  115. flapjack_contact { '[email protected]': ensure => present, first_name => 'Ada', last_name

    => 'Lovelace', timezone => 'Europe/London', }
  116. flapjack_contact { '[email protected]': ensure => present, first_name => 'Ada', last_name

    => 'Lovelace', timezone => 'Europe/London', sms_media => { address => '+61412345678', interval => '120', rollup_threshold => '5', }, }
  117. flapjack_contact { '[email protected]': ensure => present, first_name => 'Ada', last_name

    => 'Lovelace', timezone => 'Europe/London', sms_media => { address => '+61412345678', interval => '120', rollup_threshold => '5', }, email_media => { address => '[email protected]', interval => '1800', } }
  118. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  119. Contact Checks Checks Media Checks Checks Notification Rules Checks Checks

    Entities
  120. None
  121. flapjack_notification_rule { 'ada catchall': contact_id => '[email protected]', warning_media => [

    'email' ], critical_media => [ 'sms' ], }
  122. flapjack_notification_rule { 'ada app-01': contact_id => '[email protected]', entities => [

    'app-01.example.com' ] warning_media => [ 'sms' ], critical_media => [ 'sms' ], } flapjack_notification_rule { 'ada catchall': contact_id => '[email protected]', warning_media => [ 'email' ], critical_media => [ 'sms' ], }
  123. flapjack_notification_rule { 'ada db': contact_id => '[email protected]', entity_tags => [

    'db' ], warning_media => [ 'email' ], critical_media => [ ], } flapjack_notification_rule { 'ada app-01': contact_id => '[email protected]', entities => [ 'app-01.example.com' ] warning_media => [ 'sms' ], critical_media => [ 'sms' ], } flapjack_notification_rule { 'ada catchall': contact_id => '[email protected]', warning_media => [ 'email' ], critical_media => [ 'sms' ], }
  124. Hiera

  125. hiera_resources(['resources']) node default { # ... }

  126. resources: flapjack_contact: '[email protected]': ensure: present first_name: John last_name: Doe timezone:

    'Australia/Sydney' sms_media: address: '+61431261000' interval: 120 rollup_threshold: 5
  127. Open Source

  128. • github.com/flpjck/flapjack

  129. Quality documentation github.com/flpjck/flapjack/wiki

  130. Bad documentation? BUG

  131. Bad first experience? BUG

  132. None
  133. Thank you! flapjack.io github.com/flpjck/flapjack github.com/flpjck/flapjack-vagrant

  134. Thank you! Liked the talk? Let @auxesis know! flapjack.io github.com/flpjck/flapjack

    github.com/flpjck/flapjack-vagrant
  135. Credits: http://www.flickr.com/photos/lizadaly/4373330774 http://www.flickr.com/photos/meltwater/420749031 http://www.flickr.com/photos/whatknot/8642836187 http://www.flickr.com/photos/jonmould/5393395335 http://vmfarms.com/static/img/logos/ruby-logo.png http://www.flickr.com/photos/l1v32r1d3bmx/3985457584 http://www.flickr.com/photos/thomasforsyth/4313764488 http://www.flickr.com/photos/rubodewig/5161937181 http://www.flickr.com/photos/ronwls/7001551988

    http://www.flickr.com/photos/sparktography/83217827 http://www.flickr.com/photos/sdphotography/1570906849 http://tosbourn.com/wp-content/uploads/2013/12/redis-logo.png?e0df77 http://www.flickr.com/photos/derekskey/9530097369 http://giphy.com/gifs/yeUxljCJjH1rW http://en.wikipedia.org/wiki/Broadcast_delay http://www.flickr.com/photos/karen_d/8448507872 http://www.flickr.com/photos/buzzhoffman/4127280540 http://i.imgur.com/2UduUZ5.gif