Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Yelp Uses Sensu to Monitor Things in a SOA World

Kyle Anderson
September 29, 2015

How Yelp Uses Sensu to Monitor Things in a SOA World

Video: https://vimeo.com/141231345

In this talk I describe how Yelp uses the dynamic monitoring abilities of Sensu to monitor services that are dynamically deployed by Mesos and dynamically routed by Smartstack.

Kyle Anderson

September 29, 2015
Tweet

More Decks by Kyle Anderson

Other Decks in Technology

Transcript

  1. Outline • Let’s visit the dark ages • How Sensu

    Works • Special (open source) Yelp + Sensu Sauce • Mini-Demo • How PaaSTA Uses Sensu • Second Demo
  2. The Dark Ages • One Word: Nagios • Monitoring for

    Services: “Also Nagios” • Probably alerts go to OPS anyway • Probably just making sure the LB is up • Very little developer visibility • Hard to articulate to nagios what you want
  3. An Aside: Map Versus Territory • Territory: The actual things

    in production running right now • Map: What your monitoring system *thinks* is running right now Who/What keeps these in sync?????
  4. How Sensu Works Client Server Check Results Any Events for

    me to handle? Some Host RabbitMQ Clients execute checks Servers don’t know what checks exist beforehand, they just operate on events
  5. How Sensu Works - In Words • Clients can Schedule

    and Execute checks, but just put the results on the queue • Servers handle results off the queue, route them to things like email, pagerduty, JIRA, etc. • Also API, CLI, check history, silencing, dashboard, etc.
  6. Special (Open Source) Yelp-Sensu Sauce • https://github.com/Yelp/sensu_handlers • “Smart” handlers

    that respond to Sensu events based on the event data • Team is the “primary key” when determining what to do
  7. Declare Your Teams sensu_handlers::teams: dev: pagerduty_api_key: 1234 pages_irc_channel: 'dev1-pages' notifications_irc_channel:

    'devs' ops: pagerduty_api_key: 78923 pages_irc_channel: 'ops-pages' notifications_irc_channel: 'operations-notifications' notification_email: 'operations@localhost' project: OPS hardware: # Uses the ops Pagerduty service for page-worthy events, # but otherwise just jira tickets pagerduty_api_key: 78923 project: METAL
  8. Mini - Demo What does it look like when you

    can dynamically define checks on Sensu clients in a team-centric way?
  9. { "name": "test_alert_for_kwa", "team": "kwa", "irc_channels": [], "notification_email": "[email protected]", "ticket":

    false, "project": false, "page": false, "output": "Test output from send-test-sensu-alert", "status": 2, "command": "send-test-sensu-alert", } What just happened?
  10. How PaaSTA Uses Sensu • Take advantage of Sensu’s ability

    to receive arbitrary events • We already know which team owns each service (started documenting that with the soa-configs) • We already know where services are deployed and what latency zones they are in
  11. Sensu + PaaSTA Demo What if your monitoring system knew

    all about your services and how they are supposed to be deployed?
  12. What just happened? • We “went behind PaaSTA’s back” to

    simulate a failure of an AZ • We got a replication alert because of of the latency zones didn’t meet our expected replication count. (0 out of 3) • We decided to “remediate” it by expanding our latency zone to “region” • Paasta “Made it so”, and our alert resolved and the status command reflected the fact that we are expecting 6 in that one region
  13. How Did Sensu “Know”? • Sensu doesn’t “Know” anything except

    for the “Teams” metadata hash • PaaSTA checks Haproxy in each latency zone because it can read the same SOA configs that SmartStack does! • PaaSTA “Knows” which team owns each service because we told it in SOA configs! • Sensu just processes the event like normal
  14. Conclusion • Use a monitoring system that can receive and

    process arbitrary events for easy integration (Sensu) • Keep service metadata in an easy-to-access place for pieces to integrate easily (SOA configs) • Monitor the exact thing you care about (replication in each latency zone)
  15. Reading Comprehension Question: (What was the purpose of this talk?)

    A. To Describe how cool Sensu is B. To Make viewers feel inadequate of their own Nagios installation C. To tease viewers about Sensu glue that is not open source yet D. To Inspire viewers to build their own dynamic Monitoring based on some of these ideas! E. Other?
  16. Reading Comprehension Question: (What was the purpose of this talk?)

    A. To Describe how cool Sensu is B. To Make viewers feel inadequate of their own Nagios installation C. To tease viewers about Sensu glue that is not open source yet D. To Inspire viewers to build their own dynamic Monitoring based on some of these ideas! E. Other?
  17. Tools Used: • Sensu: https://sensuapp.org/ • Yelp’s Sensu Handlers: https://github.

    com/Yelp/sensu_handlers • Mesos: http://mesos.apache.org/ • Marathon: https://mesosphere.github.io/marathon/ • Smartstack: http://nerds.airbnb.com/smartstack-service- discovery-cloud/