Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Who Watches The Watchmen?

Who Watches The Watchmen?

This is a talk that I gave at DevOps Days Austin (and previously at the Advanced AWS meetup in SF) about how PagerDuty monitors PagerDuty.

Arup Chakrabarti

May 05, 2014
Tweet

More Decks by Arup Chakrabarti

Other Decks in Technology

Transcript

  1. PagerDuty Ops Guys know all too well... What is PagerDuty?

    • Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Alert the right person, every time
  2. PagerDuty What we will cover • What is PagerDuty? (DONE!)

    • Monitoring philosophies • Monitoring tools we use • Distributed Systems Monitoring • Security Monitoring • Dependency Monitoring • How we cheat by using Chef • How we validate our monitoring • Q and A
  3. PagerDuty Thou shall….. Monitoring Philosophies • Use the right tool

    • Not all monitoring tools are for everything • Avoid single host monitoring
  4. PagerDuty Thou shall….. Monitoring Philosophies • Alert on what your

    customers care about • Alert on expected thresholds (both high and low) • Make it as Self Service as possible • Validate that your alerts work
  5. PagerDuty Great for small env’s New Relic / APMs in

    general • Pros • Great for new stacks • Helpful for tracing transactions • Gives a lot of data
  6. PagerDuty Not all roses Problems with APM’s • Cons •

    They can be overly prescriptive • They can be hard to tune/ customize • Gives a lot of data
  7. PagerDuty All hail self service metrics DataDog / StatsD •

    DataDog is the backend • StatsD is the client • Super easy to use • statsd.gauge(metric_name, val) • statsd.counter(metric_name) • statsd.histogram(metric_name,val)
  8. PagerDuty StatsD / DataDog • Pros • Very customizable for

    your business reqs • Can change as you grow • Self Service
  9. PagerDuty StatsD / DataDog • Cons • Need to have

    Configuration Management • Hard to ramp teams up
  10. PagerDuty Logging as Monitoring SumoLogic • We ship all of

    our critical apps logs • Engineers setup alerts on patterns • “Too many 500’s in the last 10m” • Somewhat self-service • Initial setup is in Chef • Hard to use for realtime debugging
  11. PagerDuty Dumb health checks Simple External Monitoring • Wormly and

    Monitis • Completely bypass PagerDuty for backup alerts • Meant as a last ditch effort • Very naive in the health checks • Had to build out smarter health check page
  12. PagerDuty Dumb health checks made smarter Simple External Monitoring •

    Health Check Page • Lightly touches internal services • Gives back an expected value for each service • Alert on non-expected value
  13. PagerDuty Treat security as monitoring Security Monitoring • Intrusion detection

    via OSSEC • Monitor logs / Checksum dir’s • Port scanners via Gauntlt • Runs continuously • SQLMAP attacks • Not very useful against Rails
  14. PagerDuty The single host does not matter anymore Distributed Systems

    Monitoring • Alert on cluster level metrics • Overall number of 500’s • % of nodes down • Overall latency
  15. PagerDuty Cron should not be used for creating alerts Avoid

    single host alerting US West 1 US West 2 Linode Highly available Monitoring system
  16. PagerDuty Same model for service alerting Service A Service B

    Service C Highly available Monitoring system
  17. PagerDuty Stuff that you do not control Dependency Monitoring •

    Dependencies Everywhere • Operations • DNS • Monitoring Tools • Logging
  18. PagerDuty Stuff that you do not control Dependency Monitoring •

    How to monitor? • Operations • DNS -> Create/Delete records • Monitoring Tools -> Basic ping • Logging -> Validate that logs are being pushed
  19. PagerDuty What keeps us up at night Dependency Monitoring •

    Dependencies Everywhere • Software • Email • SMS • Phone • Push Notifications
  20. PagerDuty When SMS providers screw us over Quick Story •

    Primary SMS provider was “Up” • Customer was not getting their SMS • Found out in the worst way possible • Customer called us • Provider was working but T-Mobile prepaid was not passing our short code through
  21. PagerDuty aka how to abuse unlimited messaging plans End to

    End testing • Every minute we send a SMS alert • Every SMS provider we use • Main Carriers • Verizon • AT&T • T-Mobile • Sprint • Measure Response times
  22. PagerDuty Sorry, cannot tell you which carrier is which Some

    stats (Averages) • Carrier A • 15 Seconds • Carrier B • 60 Seconds • Carrier C • 25 Seconds • Carrier D • 200 seconds
  23. PagerDuty Automate all the things How we cheat using Chef

    • All monitoring data consumption is setup • New Relic • DataDog • SumoLogic • OSSEC • Wormly and Monitis are not automated • Cluster alert setup is not automated
  24. PagerDuty Catch the easy stuff DataDog Alert API pd_datadog_alert "File

    System Filling on #{host}" do metric_name "system.disk.in_use" function "avg" greater_than 0.85 time_frame '1h' page [ 'ops' ] end
  25. PagerDuty Only if the environment is small enough Easy alerts

    • Load • CPU • Memory • Disk Utilization
  26. PagerDuty Failure Friday How we validate our monitoring • We

    attack our own services • Process failure • Datacenter failure • Network failure • http://blog.pagerduty.com/2013/11/ failure-friday-at-pagerduty/
  27. PagerDuty What we have learned • Process monitoring is co-mingled

    with the process running • Only localhost checks on service • Requires outbound network conn from Failure Friday