Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sensu @ Yelp - A Guided Tour

Sensu @ Yelp - A Guided Tour

Yelp uses Sensu for its monitoring needs to great effect.

Sensu provides and easy to use api to integrate with all our stuff, and makes it easy for developer to emit events without dealing with "Operations" amirite? (Spoiler alert, I'm on the Operations team)

Video Part 1: https://vimeo.com/92770954
Video Part 2: https://vimeo.com/92838680

Kyle Anderson

April 24, 2014
Tweet

More Decks by Kyle Anderson

Other Decks in Technology

Transcript

  1. Disclaimer I’m just a dude. I know that when I

    watch a presentation by a company that I recognize, I think to myself, “Hmm, $company, I’ve heard of them. They probably have their stuff together. Lets see what they do…” I’m here to describe, not persuade. I may not have everything together. Just because I have things with “Unit Tests”, doesn’t mean I’ m “Right”. Especially with a “framework” like Sensu, there can be more than one way to do things. The trick is figuring out what works for you. I hope by giving a real concrete example, you might be inspired to step up your monitoring game?
  2. Outline 1. Overall Architecture 2. Sensu Server Setup a. Custom

    Base Handler 3. Client Configuration a. Sensu Check Puppet Wrapper 4. Yelp SOA Checks 5. AWS/Cloudwatch Checks 6. Dealing with Ephemeral EC Servers 7. Cron Job Monitoring 8. Future Work
  3. Overall Architecture • profile::sensu_client ◦ Sensu clients connect to RabbitMQ

    on one of the servers (DNS Round Robin) • profile::sensu_server ◦ Base HAProxy install ◦ RabbitMQ in Mirror Mode, load balanced via HAProxy ◦ Redis in Master/slave mode, load balanced via HAProxy. (only master passes healthcheck) ◦ Sensu Server installed, subscribes on RabbitMQ ◦ API Load balanced via HAProxy ◦ Dashboard Load balanced by HAProxy
  4. Addressing Complexity “Sensu has so many moving parts that I

    wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.” Laurie Denness https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a- bit-longer-thank-you-very-much/
  5. Addressing Complexity “I will be honest; I haven’t used Sensu,

    because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me. When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.” Laurie Denness https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a- bit-longer-thank-you-very-much/
  6. Pop Quiz: Determine what Servers are Puppetmasters? • A: Puppet

    manifests (include puppetmaster) • B: DNS (puppet.local A 10.5.x.x) • C: update-live script (for Server in ….) • D: The servers that have had the puppetmaster bootstrap script run on them • E: What MCollective says (mco find -C puppetmaster) Answer: All / None of the above!
  7. Sensu Server Detection # Use DNS to detect if this

    server is a sensu server $local_sensu_server_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com") $ip_address_array = split($::all_ipaddresses, ',') validate_array($local_sensu_server_array) validate_array($ip_address_array) $array_intersection = intersection($ip_address_array, $local_sensu_server_array) # If our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true } else { $is_sensu_server = false }
  8. HAProxy • Every server in the sensu cluster runs its

    own HAProxy • HAProxy listens on the “standard” ports, individual instances listen on standard + 1 • Having an array of sensu servers from DNS allows us to grow the backends • If HAProxy dies, clients will re-resolve, and reconnect.
  9. RabbitMQ • Every server in the sensu cluster runs a

    rabbitmq server in mirror mode (with autoheal for AP) • Lots of individual clusters, not doing shoveling. • Client authentication via SSL client certs (controlled by puppet) • Load balanced by haproxy • Sensu-clients automatically reconnect on failure
  10. Redis • Redis is the persistent store used by Sensu

    to keep track of heartbeats, what alerts are silenced, how many times a check has failed, etc • Redis is setup in a cluster mode, with redis-sentinel doing automatic master/slave promotion. (Kinda CP) • We use the redis-role haproxy master pattern suggestion from http://failshell.io/sensu/high- availability-sensu/
  11. Sensu API + Dashboard • sensu-api provides a rest api

    with json output for integration. • sensu-cli is provided for easy command line interactive use • Both the API and Dashboard use basic auth internally (shared secret), and then LDAP+SSL auth externally. • sensu-dashboard uses this api, and is behind our external facing apache for authentication.
  12. Sensu Servers: • Automatically does master election, good. Build for

    3. • Connects to RabbitMQ, pulls events off and acts on them • Runs “handlers” on the event data • Thats kinda it • Which leads to handlers….
  13. Sensu Timing Tunables Before/After Custom check definition key-values Custom key-values

    can be added to a check definition, which will be included in event data, enabling handler creativity. Common custom check definitions: • interval: How frequently (in seconds) the check will be executed • occurrences: Number of event occurrences before the handler should take action • refresh: Number of seconds handlers should wait before taking second action. Relies on sensu-plugin. Yelp Monitoring Check Definition Key Values The custom base handler interprets these values: • check_every = '5m', • alert_after = '0s', • realert_every = '1',
  14. Custom Base Handler def filter_repeated interval = @event['check']['interval'] || 0

    alert_after = @event['check']['alert_after'] || 0 realert_every = @event['check']['realert_every'] || 1 failing_for = @event['occurrences'].to_i * @event['check']['interval'].to_i if failing_for < alert_after bail "Only failing for #{failing_for}, less than #{alert_after}. Not performing any action yet." elsif interval > 0 and @event['action'] == 'create' initial_failing_occurrences = alert_after.fdiv(interval).to_i number_of_failed_attempts = @event['occurrences'] - initial_failing_occurrences unless number_of_failed_attempts == 0 || number_of_failed_attempts % realert_every == 0 bail 'only handling every ' + number.to_s + ' occurrences' end end end
  15. Other Handlers In Use • IRC (Triaged by who is

    “on-point”) • Email (not a thing) • Pagerduty (Handled by “on-call”) • OpsGenie (trialing) • aws_prune (only on ec2 nodes) • motd (sensu-report, not really a handler. Used for situation awareness) Future Handlers • JIRA (auto create/close a ticket after a while?) • Flapjack?
  16. Sensu Clients • Almost every server @yelp runs the sensu

    client (thank you omnibus packages!) • They connect to the Round-Robin dns entry local to their zone. • All checks are standalone, configured by puppet
  17. Monitoring Check Puppet Wrapper define monitoring_check ( $command, $runbook, $check_every

    = '5m', $alert_after = '0s', $realert_every = '1', $irc_channels = undef, $tip = false, $page = false, $wake = true, $needs_sudo = false, $sudo_user = 'root', $team = 'operations', $ensure = 'present', $dependencies = [], $sensu_custom = {}, ) { …… Lots of validation. Lots of tests. mandatory runbook! Human readable time units! Easy to add sudo rules! TIP: The one line runbook for lazy humans! Team defaults to ops for convenience. Usually set to $::profile::server::team
  18. Monitoring Check Puppet Wrapper Example # Make sure apt-mirroring is

    working by checking the age of the NEW file left over. monitoring_check { 'apt-mirror': check_every => '4h', team => 'operations', page => false, runbook => 'y/rb-package-mirroring', tip => 'Talk to kwa. Check /var/spool/apt-mirror/var/cron.log, then /nail/apt-mirror/var/apt-mirror.lock.', command => '/usr/lib/nagios/plugins/check_file_age /nail/apt-mirror/var/NEW -w 86400 -c 172800', }
  19. Why Not Use The Native Puppet Type? • The wrapper

    reduces the boilerplate and gives good defaults • Enforces site-specific policies and validation (team names, mandatory runbooks) • Allows us to modify all puppet-controlled sensu checks in the future from a single spot. • Custom tests • Allows us to be backend agnostic (maybe)
  20. Yelp SOA Checks • How do we (Yelp) empower our

    developers to monitor their services? • How can we safely and conveniently allow devs to define checks within our SOA framework? • How can Devs not be blocked by Ops for service deployment?
  21. Define the Meta Check # Defined on all hosts that

    run yelp SOA infrastructure monitoring_check { 'check-yelp_soa': check_every => '1m', alert_after => '10m', page => true, runbook => 'http://y/rb-check-yelpsoa', tip => 'Run /etc/sensu/plugins/check-yelp_soa.rb --debug to see what is wrong?', command => '/etc/sensu/plugins/check-yelp_soa.rb', require => Class['::yelp_soa'] }
  22. check-yelp_soa.rb redux def run # TODO: Parallelize? configs.each do |

    service, config | next unless services_that_run_here.include?(service) $log.debug "Processing #{service} as apparently it runs here" srv_configs = read_srv_configs(service) next unless srv_configs.include?('monitoring_check') monitoring_check = srv_configs['monitoring_check'] if numeric?(config['port']) ... if command == 'check_http' url = monitoring_check['check_url'] || '/status' $log.debug "Making a http check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}" output, status = check_http(port,url,http_expect,warn_timeout,crit_timeout) elsif monitoring_check['command'] == 'check_tcp' $log.debug "Making a tcp check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}" output, status = check_tcp(port,warn_timeout,crit_timeout) else $log.debug "Not spawning a check for #{service} because I don't know how to run #{command}" next end send_result_to_sensu(service, status, output, team, runbook, tip, page, alert_after, realert_every, irc_channels) services_checked << service end # End port check end # End for loop ok "Finished run. Ran checks on #{services_checked}" end
  23. What was that? 1. Iterate through the SOA services that

    are configured to run on a server. 2. Determine if that service has monitoring metadata defined by the authors 3. Operate on that metadata to check it (usually check_http) 4. Send the results of the check to the localhost:3030 socket as a *Different* check (“soa_$servicename”) See https://gist.github.com/joemiller/5806570 for another example
  24. An example service (request_blocking) # from request_blocking.yaml monitoring_check: team: 'infra'

    alert_after: 2m realert_every: 2 irc_channels: 'infra' url: '/status' tip: "no tips yet" warn_timout: 2.0 crit_timeout: 5.0
  25. AWS/Cloudwatch Checks • Pretty much the same thing, except: •

    Checks are executed on special monitoring hosts in the AZ (not on the ephemeral node) • Runs graphite/check_data.rb against the provided metric name • Written in python this time! (https://pypi.python. org/pypi/sensu)
  26. Dealing with Ephemeral EC2 Nodes • Yelps lives in a

    hybrid world, we have lots of “ephemeral” EC2 nodes that are baked and do NOT run puppet. Can Sensu still work on them? • How do we prevent ourselves from being spammed when hosts go away “normally”? • How do we know what a host is without logging into it? (EC2 metadata) • Baking………..
  27. EC2 Considerations • We use puppet to bake AMIs for

    ELBs, so we can control (via puppet) how Sensu is configured at bake time. • We can query the AWS API to know if a host has gone away, and prune it from the Queue to squelch alerts. • Using custom client metadata, we can add things like puppet cert name, AMI_ID, etc at runtime with a special init script.
  28. For Non-Ephemeral Instances if str2bool($::is_ec2) == true { $client_custom =

    { 'instance_id' => $::ec2_instanceid, 'keepalive' => { 'handlers' => [ 'aws_prune', 'default' ], 'team' => $team, 'page' => true } } } else { $client_custom = { 'team' => $team, 'page' => true } } Only EC2 Servers need the special aws_prune handler A Fact! Embed it for easy troubleshooting
  29. For Ephemeral (baked) Instances description "Fix Sensu clientinfo on startup

    for baked ec2 instances" author "Kyle Anderson <[email protected]>" start on starting sensu-client task script ADDRESS=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4) AMI_ID=$(curl -s http://169.254.169.254/latest/meta-data/ami-id) INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) /usr/bin/jq ".client.name = \"$(/usr/local/sbin/puppet-certname)\" | .client.address = \"$ADDRESS\" | .client.instance_id = \"$INSTANCE_ID\" | .client.ami_id = \"$AMI_ID\" " /etc/sensu/conf.d/client.json > /etc/sensu/conf.d/newclient.json mv /etc/sensu/conf.d/client.json /etc/sensu/conf.d/client.json.old mv /etc/sensu/conf.d/newclient.json /etc/sensu/conf.d/client.json end script Only run once, right before sensu-client Real data. Can’t lie. Overwrite what we were baked with. It is wrong. jq FTW
  30. Pruning Terminated EC2 Nodes • Modification of https://github.com/sensu/sensu-community- plugins/blob/master/handlers/other/ec2_node.rb •

    Instead we use a cron job to cache the results of the api call into json so we can be nice to AWS • Then we can have *every* check use this handler, as it is easy to just to check on disk if the instance_id is active. • Use the instance_id from the client data to figure out who you are. (which should be correct from the above)
  31. What Does It Look Like? file { '/etc/sensu/plugins/cache_instance_list.rb': owner =>

    'root', group => 'root', mode => '0500', source => 'puppet:///modules/profile/sensu/handlers/cache_instance_list.rb', } -> cron::d { 'cache_instance_list': minute => '*', user => 'root', command => "/etc/sensu/plugins/cache_instance_list.rb -a ${access_key} -r ${region} -k ${secret_key}", } -> monitoring_check { 'cache_instance_list-staleness': check_every => '10m', alert_after => '1h', team => 'test', runbook => 'y/rb-aws-prune', command => "/usr/lib/nagios/plugins/check_file_age /var/cache/instance_list.json -w 1800 -c 3600", page => false, }
  32. The Handler (puppet) $access_key = hiera('sensu::aws_key') $secret_key = hiera('sensu::aws_secret') $aws_config_hash

    = { access_key => $access_key, secret_key => $secret_key, region => $region, blacklist_name_array => [ 'bake_soa_ami', 'Packer Builder' ] } sensu::handler { 'aws_prune': type => 'pipe', source => 'puppet:///modules/profile/sensu/handlers/aws_prune.rb', config => $aws_config_hash, require => [ Package['rubygem-fog'], Package['rubygem-sensu-plugin'], Package['rubygem-unf'] ], } }
  33. The Handler (Ruby) def ec2_node_exists? running_instances = load_instances_cache instance_ids =

    running_instances.collect { |s| Hash[ 'id', s['id'], 'tags', s['tags'] ]} my_instance_id = @event['client']['instance_id'] instance_ids.each do |instance| # YELP SPECIFIC CODE instance_name = instance['tags']['Name'].to_s # Yelp specific: pretend that the node does not exist if we are in our blacklist return false if blacklist_name_array.include?(instance_name) return true if my_instance_id == instance['id'] end return false # no match found, node doesn't exist end
  34. Cron Job Monitoring • I believe cron sending emails is

    an anti-pattern and not *web-scale* • Lets use Sensu to monitor our cron jobs! • Use a combination of a cron puppet type wrapper and my Sensu-Shell-Helper • Modified sensu-shell-helper includes fields for team and page for yelp-specific things: https://github. com/solarkennedy/sensu-shell-helper
  35. What does it look like? $command = 'chgrp -R admin

    /nail/packages/' cron::d { 'fix-packages-permissions': mailto => '', minute => '10', user => 'root', comment => 'Make permissions group writable for collaboration purposes', command => “sensu-shell-helper -n fix-packages-permissions -p false -t operations ${command}”, ensure => 'present' } See https://github.com/torrancew/puppet-cron#cronjob for related work.
  36. Future Work • battle-test more of the pagerduty stuff (blocked

    on bogus aws nodes still) • sort out AWS pruning, harder (#61626) • make tools that work on nagios *and* sensu? • really monitor the sensu instances in nagios with alerts (#60164) • enable self-serve sensu alerts for services (#62201) • make a library for sending passive checks (#62440) • set up infrastructure for “aggregate” checks (cluster checks) • better test the alerting tunables we have (#61628) • enable sensu alerts for Asgardy services (#57450) • set up easy to use metric based alerting (like horsefly, blocked on #67000) • write my sensu-downtime tool • write an super-dashboard (hackathon) • write the sensu archive service (sensu-db?)