Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flapjack tutorial - Velocity NY 2014 - with notes

Flapjack tutorial - Velocity NY 2014 - with notes

In this tutorial, Jesse Reynolds and Lindsay Holmwood will take you on a whirlwind tour of Flapjack – what it is, how it solves problems, where it’s going – with a hands on lab that you can start applying in your organisation tomorrow.

Attendees of this tutorial will come away with an understanding of:

- How to install + configure Flapjack
- How to use Flapjack when migrating away from Nagios as a check execution engine
- How to work with Flapjack’s APIs to integrate with your existing systems

Lindsay Holmwood

September 15, 2014
Tweet

More Decks by Lindsay Holmwood

Other Decks in Technology

Transcript

  1. • event • ↪ notify? • ↪ who? • ↪

    how? At its core, Flapjack is simply asking questions about event data it observes: ! - do I actually need to notify someone about a problem? - who do I notify? - how do I notify them?
  2. Composable http://www.flickr.com/photos/lizadaly/4373330774/sizes/o ! Flapjack is composable, and sits downstream of

    your existing check execution engines like Sensu or Nagios, aggregating sampled state.
  3. API driven http://www.flickr.com/photos/jonmould/5393395335/sizes/o ! Flapjack is API driven. We try

    and expose as much of Flapjack's configuration via an API as possible.
  4. flapjack-diner Flapjack provide a JSONAPI compliant REST API. ! We

    ship a Ruby client for it called flapjack-diner. ! When you're starting out with Flapjack, you should use flapjack-diner to interact with the API. ! All of the examples we're running through today use flapjack-diner.
  5. Large numbers of potential contacts https://www.flickr.com/photos/nathaninsandiego/5364959065/sizes/o https://www.flickr.com/photos/warrenfree/4809625285/sizes/o ! If you

    work in an organisation with large numbers of people who care about monitoring of production systems, or you work at a service provider with many customers who want to be notified about problems with their service, Flapjack is an excellent fit. ! We developed Flapjack at Bulletproof because we have some unique requirements for our monitoring platform - there are up to 4,000 people who can be notified by our monitoring platform at any point in time. ! We needed an alerting system that would allow users to change their alerting configuration in real time, and also easily import data from other sources of truth, like databases of customer services.
  6. Multi-tenant http://www.flickr.com/photos/thomasforsyth/4313764488/sizes/o ! If you need to have alerting for

    multiple customers configured from one place, Flapjack is perfectly suited for this task. ! Because Flapjack makes few assumptions about your data, Flapjack provides a strong foundation for multitenancy. Bulletproof uses flapjack-diner heavily to consume data for Flapjack and display it to customers via our customer portal.
  7. Segregated responsibility http://www.flickr.com/photos/rubodewig/5161937181/sizes/o ! If you work in a really

    large company that has different teams responsible for different apps or parts of the infrastructure., Flapjack's multitenancy features are perfectly suited. !
  8. • # Install Flapjack! • echo 'deb http://packages.flapjack.io/deb/v1 precise main'

    | ! • ↪ sudo tee /etc/apt/sources.list.d/flapjack.list! ! • sudo apt-get update! • sudo apt-get install flapjack ! ! • sudo service flapjack status
  9. fat packages (omnibus) We build packages with omnibus, which means

    they're fat packages that have everything needed to start using Flapjack out of the box. ! We opted for this approach because we want the first Flapjack experience for users to be awesome. ! Having to piece together build dependencies and runtime dependencies is not a fun experience - we want people to be up, running, and solving problems in a matter of minutes.
  10. + http://vmfarms.com/static/img/logos/ruby-logo.png ! Included in the packages are a self

    contained version of Ruby, all the RubyGems required by Flapjack, compiled go binaries, and our own version of Redis.
  11. Redis on 6380 We run Redis on a different port,

    so it doesn't conflict with any Redis you already have installed.
  12. .deb provided Shipping a package that provides an awesome out-of-the-box

    experience is no easy feat. ! Omnibus helps a lot, and we've recently put a lot of engineering time into building a solid toolchain on top of Docker to make package building and distribution reliable. ! The good news with our recent Docker work is that it'll be pretty straight forward to build RPMs. ! If you're interested in contributing an awesome RPM experience, please come and chat afterwards.
  13. API driven http://www.flickr.com/photos/jonmould/5393395335/sizes/o ! Flapjack makes the majority of data

    you'll want to manipulate on a daily basis available via the API. ! This means you don't have to restart Flapjack when you're adding, removing, or changing contacts, or their notification rules.
  14. static vs dynamic (reload required) https://www.flickr.com/photos/fitzharris/12040967996/ ! There is a

    small amount of static configuration required for spinning up the various components that make up the Flapjack stack. ! Any changes to this configuration file need require a reload or restart of Flapjack. You'll be doing this very rarely.
  15. • cd vagrant-flapjack! • vagrant ssh ! • sudo $EDITOR

    /etc/flapjack/flapjack_config.yaml! • # make changes! • sudo service flapjack restart
  16. • ---! • production:! • pid_dir: /var/run/flapjack/! • log_dir: /var/log/flapjack/!

    • daemonize: yes! • logger:! • level: INFO! • syslog_errors: yes! • redis:! • host: 127.0.0.1! • port: 6380! • db: 0 common config Flapjack configuration is split into environments. ! The default configuration file we ship has production, development, and test environments. ! Within each environment, there is common configuration that's shared across all of the bits that make up Flapjack under the hood. !
  17. processor:! enabled: yes! queue: events! notifier_queue: notifications! archive_events: true! events_archive_maxage:

    10800! new_check_scheduled_maintenance_duration: 100 years! logger:! level: INFO! syslog_errors: yes Before https://github.com/flpjck/flapjack/wiki/USING#processor-new_check_scheduled_maintenance_duration
  18. processor:! enabled: yes! queue: events! notifier_queue: notifications! archive_events: true! events_archive_maxage:

    10800! new_check_scheduled_maintenance_duration: 0 seconds! logger:! level: INFO! syslog_errors: yes After https://github.com/flpjck/flapjack/wiki/USING#processor-new_check_scheduled_maintenance_duration
  19. Grace period on alerts new_check_scheduled_maintenance_duration is a grace period on

    alerts. A new check result won't alert until the ncsm duration expires. ! This gives you leeway to finish provisioning services before people start being notified about them.
  20. sms_twilio:! enabled: yes! queue: sms_twilio_notifications! account_sid: "<sid>"! auth_token: "<token>"! from:

    "<number>"! logger:! level: DEBUG bit.ly/flapjack-twilio The last thing we're going to set up is the Twilio gateway. ! We'll be using this later on to send yourself SMS from Flapjack.
  21. • ---! • production:! • processor:! • notifier:! • nagios-receiver:!

    • nsca-receiver: components Then there is configuration for all the different Flapjack components that are doing the event processing work.
  22. • ---! • production:! • gateways:! • email:! • sms:!

    • sms_twilio:! • sns:! • jabber:! • pagerduty:! • web:! • jsonapi:! • oobetet: gateways Finally, there are the gateways. ! Gateways handle contacting a human once Flapjack has decided that someone must be notified.
  23. Architecture So now we're gotten a glimpse of the different

    Flapjack components in the configuration file, this is a good time to look at how Flapjack hangs together at a really high level.
  24. event producers processors gateways Flapjack's architecture is really simple. There

    are three building blocks: ! - event producers - processors - gateways ! Event producers observe environment state and report samples as event heartbeats to the processors. ! Processors handle incoming events, update various counters to track state, and make the decision about if an alert should be sent, and who it should be sent to. ! Gateways handle the actual delivery of messages to people.
  25. processors gateways Sensu Sensu::Extension::Flapjack event producers httpchecker oneoff httpbroker Icinga

    flapjackfeeder Nagios nagios-receiver There are several integrations to take events from existing check execution engines and feed them into Flapjack. ! Redis is a core to all these integrations, and all of the integrations conceptually work the same way - they gather and push state to Flapjack via Redis. !
  26. {! "entity": ENTITY,! "check": CHECK,! "tags": TAGS,! "state": STATE,! "time":

    TIMESTAMP,! "summary": SUMMARY! } Each event has these fields. ! Most of these fields are mandatory.
  27. processors gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios

    nagios-receiver Now we know what events and event producers are, we're going to connect one up to Flapjack. ! Quick show of hands: who here is using Nagios in production? Icinga? Sensu? Naemon? Something else?
  28. • vagrant ssh # in case you exited! • sudo

    $EDITOR /etc/nagios3/nagios.cfg! sud
  29. # /etc/nagios3/nagios.cfg ! broker_module=/usr/local/lib/flapjackfeeder.o ↪ redis_host=localhost,redis_port=6380 As Nagios the unfortunate

    de facto standard for monitoring in our industry, we'll have a quick look at feeding event from Nagios into Flapjack. ! We ship two integrations for Nagios and its forks - flapjackfeeder, and nagios-receiver. flapjackfeeder is a Nagios Event Broker module written in C that pushes events, and nagios-receiver is a standalone tool that reads Nagios status from a named pipe. ! We recommend using flapjackfeeder, as Nagios has some pretty flaky behaviour when pushing status into a named pipe. Specifically, it'll fill up the pipe, and drop events on the floor. ! To plug Icinga into Flapjack, edit /etc/nagios3/nagios3.cfg, uncomment this line, and restart Nagios. You should start seeing events come through to Flapjack.
  30. https://www.flickr.com/photos/lifesgood/3921282128/ https://www.flickr.com/photos/jessicafm/79895511/ ! If you refresh the Flapjack all entities

    web page, you'll see results for both Nagios and Icinga in one place. ! This is one of the key things we want Flapjack to be - a single pane of glass for all your systems that can alert people.
  31. Constant heartbeat http://giphy.com/gifs/yeUxljCJjH1rW ! Flapjack expects a constant stream of

    heartbeat events. ! Your event producers need to be sending updates at a regular interval for Flapjack's alerting to kick in - even if your event producers haven't observed a change in state. !
  32. processors gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios

    nagios-receiver To illustrate this, you can see above that the event producers are submitting events to Flapjack at a regular interval, even though the individual event producers are performing their checks at different rates.
  33. How long has a check been failing? Flapjack asks a

    fundamentally different question to other alerting tools: ! how long has a check been failing?
  34. NOT "How many times has the check failed?" Not the

    question that Nagios and other tools of its ilk ask: ! how many times has the check failed?
  35. No  SOFT / HARD states Soft/hard states are an

    alerting construct that Flapjack rejects. ! They only work if checks execute regularly, with no delay or latency. Our monitoring systems should aspire to consistent check execution throughput to ensure timely operator insight into potential failures, but they are an unreliable mechanism to rely on for alerting. ! Any check execution latency impacts the timeliness of alerts. ! —— ! If you expect 100% of your checks to execute every 2 minutes, but a mass outage causes your checks to execute in twice the amount of time, you need to ensure 100% of your checks normally execute in under 1 minute.
  36. Flapjack cares about duration of a problem Flapjack will potentially

    alert a human if a problem has been detected for longer than 30 seconds. ! This duration is a global constant, and something we plan to make globally and per-check configurable in the 2.x series of Flapjack.
  37. Simple implementation (no timers, event driven) The main benefit of

    alerting based on problem duration is that it keeps the alerting logic inside Flapjack very simple, and thus easier for the operator to reason about. ! Flapjack reacts to each event as it comes in - it doesn't create any timers to check for event updates. Flapjack assumes it is being provided a constant stream of heartbeat events by upstream event producers.
  38. Works around check execution latency This means that when you

    do have latency in your upstream event producers, Flapjack will alert as soon as it's received more than 2 failure events in a window greater than 30 seconds.
  39. Rollup http://www.flickr.com/photos/meltwater/420749031/sizes/o ! http://en.wikipedia.org/wiki/Broadcast_delay ! The focus on duration also

    makes it easier to implement rollup. ! Flapjack looks at the number of alerts that have been sent to each person within the alert summarisation window. !
  40. No one-off events We said before that you can't really

    have any one-off events. ! We've recently added a mechanism in Flapjack to fake out one-off events.
  41. httpbroker The httpbroker provides a very simple HTTP endpoint you

    can post events to, and it handles sending a cached copy of those events to Flapjack at a regular interval.
  42. processors httpbroker ✗ The httpbroker: ! - accepts HTTP POSTs

    of state - caches them - emits the last cached value to Flapjack at a user-configurable interval
  43. TTLs The httpbroker also implements the concept of TTLs on

    user-submitted state. ! If the submitted state becomes older than the TTL, the httpbroker starts sending UNKNOWNs to Flapjack. ! This is useful if you have processes or systems that you expect to check in regularly, and you want to alert someone if this doesn't happen. ! For example you, might have a batch process that you know should take 3 hours to run. At the beginning of the batch process, you submit a state with a TTL of 12,600 (3.5 hours). At the end of the process, you submit another state saying that the process completed. ! If the batch process doesn't complete within 3.5 hours, the TTL will expire, and anyone who has opted into alerts about that process will be notified. ! You can disable TTLs by setting the TTL value on state submission to 0.
  44. Architecture Ok, let's take a step back and have a

    look at the architecture again.
  45. event producers processors gateways ✔ So we've got three main

    building blocks ! - event producers - processors - gateways ! We've looked at event producers in a bit of depth. ! Now we're going to have a look the processors.
  46. processor gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios

    nagios-receiver notifier There are two types of processors ! - processor - notifier ! Both rely heavily on Redis for doing their jobs, and communicating.
  47. processor gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios

    nagios-receiver notifier • Processor: • Read events • Update state • Determine if notification should be sent processor The processor: ! - reads events from upstream event producers - updates internal Flapjack state, including various counters and event history - finally, the processor determines if a notification should be sent
  48. processor gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios

    nagios-receiver notifier • Notifier: • Reads notifications • Routes notifications • Dispatches alerts to gateways notifier If a notification does need to be sent, the processor dispatches the notification to the notifier. ! The notifier: ! - reads notifications from the processor - routes notifications based on who is interested in the result, and how they want to be alerted about it - and dispatches the alert to the gateways to do the actual message delivery
  49. Data model To understand the domain logic within the processors

    and notifiers, we need to take a quick high level look at how Flapjack models data internally.
  50. Contact Checks Checks Media Checks Checks Notification Rules History (maintenance,

    acks, state changes) Checks Checks Checks Checks Checks Entities Everything in Flapjack relates to contacts. A contact is a representation of a person (and potentially a group of people, if you're using PagerDuty as a alerting mechanism for Flapjack). ! Contacts have media. Media are mediums for contacting the contact. ! Notification rules encapsulate the logic for how people are alerted about failures. ! Entities are a grouping mechanism for checks. Entities can represent hosts, clusters, racks, datacenters, regions, galaxies, universes, realities. ! Checks are probes that determine apparent state. ! Flapjack retains history for all checks, acknowledgements, and maintenances.
  51. Find people interested in entity map map reduce reduce reduce

    Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] [ alice bob, carol ] [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events alert [ [ alice, sms ], [ bob, sms ], ] alert [ [ alice, sms ], [ bob, sms ], ] Processor Notifier
  52. Get familiar with PRY PRY is a RubyGem that allows

    you to interrupt the running program and inspect data.
  53. • vagrant ssh # in case you exited! • cd

    /vagrant/tutorial! • ./0-pry.rb! • # experiment with PRY
  54. Create the ALL entity ! • ./1-create-entity-ALL.rb The ALL entity

    is a way of referring to all entities that Flapjack knows about. ! In Flapjack, contacts need to be linked to an entity to receive alerts for it. ! Maintaining contact-entity links can quickly get burdensome if you have large numbers of machines that come an go, say in an autoscaling group. ! Linking a contact to the ALL entity allows the contact to receive alerts for any entity Flapjack currently knows about. !
  55. Create a contact ! • ./2-create-contact-ada.rb Now we're going to

    create a contact that represents Ada Lovelace. ! We have transported her forward in time, and given her your cell number.
  56. Add cell number ! • ./3-update-contact-ada-sms.rb Now Flapjack knows about

    Ada, we're going to assign your cell number to her. …Jesse: +1 347 784 4837 ! => breakpoint 1 ! we've found the Ada contact, but there is media associated yet ! => breakpoint 2 ! the media should now be set on the Ada contact
  57. Set up notification rules ! • ./4-notification-rules.rb So now we

    have: ! - an ALL entity - a contact with a cell number, linked to the ALL entity ! The last piece of the puzzle is notification rules. ! Notification rules are then used to refine what alerts end up being sent to the contact. ! => breakpoint 1 ! general_rules shows the general notification rule ! The generation notification rule is a catchall rule for alerts that don't match any other notification rules. ! It's used for providing user-specific defaults for alerting. ! We can see there are no tags, entities, or regex tags or entities on this rule. ! => breakpoint 2 ! created_rule is a notification rule that matches things tagged with "http" and "response-time", from entities matching /-app-/. !
  58. event producers processors gateways ✔ ✔ So we've got three

    main building blocks ! - event producers - processors - gateways ! We've looked at event producers in a bit of depth. ! Now we're going to have a look the gateways.
  59. processor notifier Email SMS Jabber PagerDuty Web API gateways Sensu

    Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios nagios-receiver You can see the full array of event producers, and the different components that make the processors. ! The last piece of the puzzle is the gateways.
  60. processor notifier Email SMS Jabber PagerDuty Web API Jabber PagerDuty

    Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios nagios-receiver http://tosbourn.com/wp-content/uploads/2013/12/redis-logo.png?e0df77
  61. Alert summarisation (rollup) http://www.flickr.com/photos/meltwater/420749031/sizes/o ! Flapjack also has basic rollup

    functionality. ! We call this alert summarisation, and it is configurable on a per-media, per-contact basis.
  62. • per media, • per contact • thresholds http://www.flickr.com/photos/sdphotography/1570906849/sizes/o !

    This means that every media can have different summarisation thresholds, which gives the operator the flexibility to tune how heavily they get notified when there are multiple failures.
  63. Contact Checks Checks Media Checks Checks Notification Rules History (maintenance,

    acks, state changes) Checks Checks Checks Checks Checks Entities Checks Checks Summary Threshold The summary threshold is a piece of data that hangs off each media, so contact have many summary thresholds through their media.
  64. Hands on https://www.flickr.com/photos/andrearosephotography/5573685204/ ! We're gonna do the final hands-on

    part of the tutorial now - simulating several failures, to trigger the alert summarisation for Ada in Flapjack.
  65. Trigger rollup ! • ./5-rollup.rb The this rollup demo is

    doing a couple of things: ! - simulates a single fail-and-recover for a check that matches tags that Ada is interested in - waits for 30 seconds - simulates 4 more fail-and-recovers for checks that match tags Ada is interested in ! After a little over a minute, you should start seeing summary alerts coming through, and a recovery.
  66. • github.com /flapjack • github.com /flapjack/flapjack • github.com /flapjack/flapjack-diner •

    github.com /flapjack/vagrant-flapjack • github.com /flapjack/omnibus-flapjack • github.com /flapjack/flapjack.io
  67. Performance How many events per second can Flapjack handle? !

    Depends on the hardware. Flapjack's bottlenecks are in the processor and notifier. ! Because both of these components are written in Ruby and run on CRuby (2.1.2), they are limited to a single core.
  68. processor gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios

    nagios-receiver notifier processor The good news is that Flapjack's architecture allows multiple instances of the processors and notifiers to run. ! You just spin more up, and they're all talk to the same Redis.
  69. Future JRuby? Limited by EventMachine. ! EM has been dropped

    in the 2.0 branch in favour of pure Ruby threads. ! 2.0 will be functionally equivalent to the 1.x series, just with a better technical foundation.
  70. gateways Sensu Sensu::Extension::Flapjack httpchecker oneoff httpbroker Icinga flapjackfeeder Nagios nagios-receiver

    notifier processor processor Flapjack's architecture addresses many availability concerns at the same time as the performance concerns. ! Just spin up more processors, notifiers, and gateways for redundancy. ! They'll all operate in an active-active mode, so you need to be careful about the extra demand a component failure will place on the rest of the system.
  71. There is one glaring failure with Flapjack's architecture though: it's

    reliance on Redis. ! Redis does not have a good story when it comes to availability, so we're essentially held at the behest of Redis's capabilities. ! Longer term we're trying to reduce our reliance on Redis for everything but working memory. We're looking at other data stores for storing historical state, and potentially contacts and other core data model objects.