Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Who Watches The Watchmen?

Who Watches The Watchmen?

This is a talk that I gave at DevOps Days Austin (and previously at the Advanced AWS meetup in SF) about how PagerDuty monitors PagerDuty.

Arup Chakrabarti

May 05, 2014
Tweet

More Decks by Arup Chakrabarti

Other Decks in Technology

Transcript

  1. PagerDuty
    Arup Chakrabarti
    OPERATIONS ENGINEERING
    @arupchak
    [email protected]
    DevOps Days Austin

    View Slide

  2. PagerDuty
    Who watches the Watchmen?

    View Slide

  3. PagerDuty
    Ops Guys know all too well...
    What is PagerDuty?
    • Alert and Incident Tracking
    • On-Call Management
    • Integrates with monitoring tools
    • Alert the right person, every time

    View Slide

  4. PagerDuty
    What is PagerDuty?

    View Slide

  5. PagerDuty
    What we will cover
    • What is PagerDuty? (DONE!)
    • Monitoring philosophies
    • Monitoring tools we use
    • Distributed Systems Monitoring
    • Security Monitoring
    • Dependency Monitoring
    • How we cheat by using Chef
    • How we validate our monitoring
    • Q and A

    View Slide

  6. PagerDuty
    I did not come up with everything
    Disclaimer

    View Slide

  7. PagerDuty
    Thou shall…..
    Monitoring Philosophies
    • Use the right tool
    • Not all monitoring tools are for
    everything
    • Avoid single host monitoring

    View Slide

  8. PagerDuty
    Thou shall…..
    Monitoring Philosophies
    • Alert on what your customers care
    about
    • Alert on expected thresholds (both
    high and low)
    • Make it as Self Service as possible
    • Validate that your alerts work

    View Slide

  9. PagerDuty
    New Relic
    Monitoring Tools

    View Slide

  10. PagerDuty
    New Relic
    Monitoring Tools

    View Slide

  11. PagerDuty
    New Relic
    Monitoring Tools

    View Slide

  12. PagerDuty
    New Relic
    Monitoring Tools

    View Slide

  13. PagerDuty
    New Relic
    Monitoring Tools

    View Slide

  14. PagerDuty
    New Relic
    Monitoring Tools

    View Slide

  15. PagerDuty
    Great for small env’s
    New Relic / APMs in general
    • Pros
    • Great for new stacks
    • Helpful for tracing transactions
    • Gives a lot of data

    View Slide

  16. PagerDuty
    Not all roses
    Problems with APM’s
    • Cons
    • They can be overly prescriptive
    • They can be hard to tune/
    customize
    • Gives a lot of data

    View Slide

  17. PagerDuty
    All hail self service metrics
    DataDog / StatsD
    • DataDog is the backend
    • StatsD is the client
    • Super easy to use
    • statsd.gauge(metric_name, val)
    • statsd.counter(metric_name)
    • statsd.histogram(metric_name,val)

    View Slide

  18. PagerDuty
    Custom Metrics

    View Slide

  19. PagerDuty
    Custom Alerts

    View Slide

  20. PagerDuty
    Custom Notifications
    • PagerDuty Integration
    • Email
    • HipChat

    View Slide

  21. PagerDuty
    StatsD / DataDog
    • Pros
    • Very customizable for your
    business reqs
    • Can change as you grow
    • Self Service

    View Slide

  22. PagerDuty
    StatsD / DataDog
    • Cons
    • Need to have Configuration
    Management
    • Hard to ramp teams up

    View Slide

  23. PagerDuty
    Logging as Monitoring
    SumoLogic
    • We ship all of our critical apps logs
    • Engineers setup alerts on patterns
    • “Too many 500’s in the last 10m”
    • Somewhat self-service
    • Initial setup is in Chef
    • Hard to use for realtime debugging

    View Slide

  24. PagerDuty
    Dumb health checks
    Simple External Monitoring
    • Wormly and Monitis
    • Completely bypass PagerDuty for
    backup alerts
    • Meant as a last ditch effort
    • Very naive in the health checks
    • Had to build out smarter health
    check page

    View Slide

  25. PagerDuty
    Dumb health checks made smarter
    Simple External Monitoring
    • Health Check Page
    • Lightly touches internal services
    • Gives back an expected value for
    each service
    • Alert on non-expected value

    View Slide

  26. PagerDuty
    Treat security as monitoring
    Security Monitoring
    • Intrusion detection via OSSEC
    • Monitor logs / Checksum dir’s
    • Port scanners via Gauntlt
    • Runs continuously
    • SQLMAP attacks
    • Not very useful against Rails

    View Slide

  27. PagerDuty
    PagerDuty at PagerDuty?
    YES

    View Slide

  28. PagerDuty
    The single host does not matter anymore
    Distributed Systems Monitoring
    • Alert on cluster level metrics
    • Overall number of 500’s
    • % of nodes down
    • Overall latency

    View Slide

  29. PagerDuty
    Cron should not be used for creating alerts
    Avoid single host alerting
    US West 1
    US West 2
    Linode
    Highly available
    Monitoring system

    View Slide

  30. PagerDuty
    Same model for service alerting
    Service A
    Service B
    Service C
    Highly available
    Monitoring system

    View Slide

  31. PagerDuty
    Stuff that you do not control
    Dependency Monitoring
    • Dependencies Everywhere
    • Operations
    • DNS
    • Monitoring Tools
    • Logging

    View Slide

  32. PagerDuty
    Stuff that you do not control
    Dependency Monitoring
    • How to monitor?
    • Operations
    • DNS -> Create/Delete records
    • Monitoring Tools -> Basic ping
    • Logging -> Validate that logs are
    being pushed

    View Slide

  33. PagerDuty
    What keeps us up at night
    Dependency Monitoring
    • Dependencies Everywhere
    • Software
    • Email
    • SMS
    • Phone
    • Push Notifications

    View Slide

  34. PagerDuty
    When SMS providers screw us over
    Quick Story
    • Primary SMS provider was “Up”
    • Customer was not getting their SMS
    • Found out in the worst way possible
    • Customer called us
    • Provider was working but T-Mobile
    prepaid was not passing our short code
    through

    View Slide

  35. PagerDuty
    aka how to abuse unlimited messaging plans
    End to End testing
    • Every minute we send a SMS alert
    • Every SMS provider we use
    • Main Carriers
    • Verizon
    • AT&T
    • T-Mobile
    • Sprint
    • Measure Response times

    View Slide

  36. PagerDuty
    “Device Lab”

    View Slide

  37. PagerDuty
    Sorry, cannot tell you which carrier is which
    Some stats (Averages)
    • Carrier A
    • 15 Seconds
    • Carrier B
    • 60 Seconds
    • Carrier C
    • 25 Seconds
    • Carrier D
    • 200 seconds

    View Slide

  38. PagerDuty
    Kinda spikey….

    View Slide

  39. PagerDuty
    Automate all the things
    How we cheat using Chef
    • All monitoring data consumption is
    setup
    • New Relic
    • DataDog
    • SumoLogic
    • OSSEC
    • Wormly and Monitis are not automated
    • Cluster alert setup is not automated

    View Slide

  40. PagerDuty
    Catch the easy stuff
    DataDog Alert API
    pd_datadog_alert "File System Filling on #{host}" do
    metric_name "system.disk.in_use"
    function "avg"
    greater_than 0.85
    time_frame '1h'
    page [ 'ops' ]
    end

    View Slide

  41. PagerDuty
    Only if the environment is small enough
    Easy alerts
    • Load
    • CPU
    • Memory
    • Disk Utilization

    View Slide

  42. PagerDuty
    Sumologic Setup
    https://github.com/PagerDuty/chef-sumologic
    !
    sumo_source 'syslog' do
    path '/var/log/syslog'
    category 'syslog'
    end

    View Slide

  43. PagerDuty
    Failure Friday
    How we validate our monitoring
    • We attack our own services
    • Process failure
    • Datacenter failure
    • Network failure
    • http://blog.pagerduty.com/2013/11/
    failure-friday-at-pagerduty/

    View Slide

  44. PagerDuty
    What we have learned
    • Process monitoring is co-mingled
    with the process running
    • Only localhost checks on service
    • Requires outbound network conn
    from Failure Friday

    View Slide

  45. PagerDuty
    [email protected]
    Thank you We are hiring
    http://pagerduty.com/jobs
    Arup Chakrabarti
    OPERATIONS ENGINEERING
    @arupchak

    View Slide