Save 37% off PRO during our Black Friday Sale! »

Who Watches The Watchmen?

Who Watches The Watchmen?

This is a talk that I gave at DevOps Days Austin (and previously at the Advanced AWS meetup in SF) about how PagerDuty monitors PagerDuty.

Ebe1d126c7c859171156efb4c08db14f?s=128

Arup Chakrabarti

May 05, 2014
Tweet

Transcript

  1. PagerDuty Arup Chakrabarti OPERATIONS ENGINEERING @arupchak arup@pagerduty.com DevOps Days Austin

  2. PagerDuty Who watches the Watchmen?

  3. PagerDuty Ops Guys know all too well... What is PagerDuty?

    • Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Alert the right person, every time
  4. PagerDuty What is PagerDuty?

  5. PagerDuty What we will cover • What is PagerDuty? (DONE!)

    • Monitoring philosophies • Monitoring tools we use • Distributed Systems Monitoring • Security Monitoring • Dependency Monitoring • How we cheat by using Chef • How we validate our monitoring • Q and A
  6. PagerDuty I did not come up with everything Disclaimer

  7. PagerDuty Thou shall….. Monitoring Philosophies • Use the right tool

    • Not all monitoring tools are for everything • Avoid single host monitoring
  8. PagerDuty Thou shall….. Monitoring Philosophies • Alert on what your

    customers care about • Alert on expected thresholds (both high and low) • Make it as Self Service as possible • Validate that your alerts work
  9. PagerDuty New Relic Monitoring Tools

  10. PagerDuty New Relic Monitoring Tools

  11. PagerDuty New Relic Monitoring Tools

  12. PagerDuty New Relic Monitoring Tools

  13. PagerDuty New Relic Monitoring Tools

  14. PagerDuty New Relic Monitoring Tools

  15. PagerDuty Great for small env’s New Relic / APMs in

    general • Pros • Great for new stacks • Helpful for tracing transactions • Gives a lot of data
  16. PagerDuty Not all roses Problems with APM’s • Cons •

    They can be overly prescriptive • They can be hard to tune/ customize • Gives a lot of data
  17. PagerDuty All hail self service metrics DataDog / StatsD •

    DataDog is the backend • StatsD is the client • Super easy to use • statsd.gauge(metric_name, val) • statsd.counter(metric_name) • statsd.histogram(metric_name,val)
  18. PagerDuty Custom Metrics

  19. PagerDuty Custom Alerts

  20. PagerDuty Custom Notifications • PagerDuty Integration • Email • HipChat

  21. PagerDuty StatsD / DataDog • Pros • Very customizable for

    your business reqs • Can change as you grow • Self Service
  22. PagerDuty StatsD / DataDog • Cons • Need to have

    Configuration Management • Hard to ramp teams up
  23. PagerDuty Logging as Monitoring SumoLogic • We ship all of

    our critical apps logs • Engineers setup alerts on patterns • “Too many 500’s in the last 10m” • Somewhat self-service • Initial setup is in Chef • Hard to use for realtime debugging
  24. PagerDuty Dumb health checks Simple External Monitoring • Wormly and

    Monitis • Completely bypass PagerDuty for backup alerts • Meant as a last ditch effort • Very naive in the health checks • Had to build out smarter health check page
  25. PagerDuty Dumb health checks made smarter Simple External Monitoring •

    Health Check Page • Lightly touches internal services • Gives back an expected value for each service • Alert on non-expected value
  26. PagerDuty Treat security as monitoring Security Monitoring • Intrusion detection

    via OSSEC • Monitor logs / Checksum dir’s • Port scanners via Gauntlt • Runs continuously • SQLMAP attacks • Not very useful against Rails
  27. PagerDuty PagerDuty at PagerDuty? YES

  28. PagerDuty The single host does not matter anymore Distributed Systems

    Monitoring • Alert on cluster level metrics • Overall number of 500’s • % of nodes down • Overall latency
  29. PagerDuty Cron should not be used for creating alerts Avoid

    single host alerting US West 1 US West 2 Linode Highly available Monitoring system
  30. PagerDuty Same model for service alerting Service A Service B

    Service C Highly available Monitoring system
  31. PagerDuty Stuff that you do not control Dependency Monitoring •

    Dependencies Everywhere • Operations • DNS • Monitoring Tools • Logging
  32. PagerDuty Stuff that you do not control Dependency Monitoring •

    How to monitor? • Operations • DNS -> Create/Delete records • Monitoring Tools -> Basic ping • Logging -> Validate that logs are being pushed
  33. PagerDuty What keeps us up at night Dependency Monitoring •

    Dependencies Everywhere • Software • Email • SMS • Phone • Push Notifications
  34. PagerDuty When SMS providers screw us over Quick Story •

    Primary SMS provider was “Up” • Customer was not getting their SMS • Found out in the worst way possible • Customer called us • Provider was working but T-Mobile prepaid was not passing our short code through
  35. PagerDuty aka how to abuse unlimited messaging plans End to

    End testing • Every minute we send a SMS alert • Every SMS provider we use • Main Carriers • Verizon • AT&T • T-Mobile • Sprint • Measure Response times
  36. PagerDuty “Device Lab”

  37. PagerDuty Sorry, cannot tell you which carrier is which Some

    stats (Averages) • Carrier A • 15 Seconds • Carrier B • 60 Seconds • Carrier C • 25 Seconds • Carrier D • 200 seconds
  38. PagerDuty Kinda spikey….

  39. PagerDuty Automate all the things How we cheat using Chef

    • All monitoring data consumption is setup • New Relic • DataDog • SumoLogic • OSSEC • Wormly and Monitis are not automated • Cluster alert setup is not automated
  40. PagerDuty Catch the easy stuff DataDog Alert API pd_datadog_alert "File

    System Filling on #{host}" do metric_name "system.disk.in_use" function "avg" greater_than 0.85 time_frame '1h' page [ 'ops' ] end
  41. PagerDuty Only if the environment is small enough Easy alerts

    • Load • CPU • Memory • Disk Utilization
  42. PagerDuty Sumologic Setup https://github.com/PagerDuty/chef-sumologic ! sumo_source 'syslog' do path '/var/log/syslog'

    category 'syslog' end
  43. PagerDuty Failure Friday How we validate our monitoring • We

    attack our own services • Process failure • Datacenter failure • Network failure • http://blog.pagerduty.com/2013/11/ failure-friday-at-pagerduty/
  44. PagerDuty What we have learned • Process monitoring is co-mingled

    with the process running • Only localhost checks on service • Requires outbound network conn from Failure Friday
  45. PagerDuty arup@pagerduty.com Thank you We are hiring http://pagerduty.com/jobs Arup Chakrabarti

    OPERATIONS ENGINEERING @arupchak