PagerDuty Ops Guys know all too well... What is PagerDuty? • Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Alert the right person, every time
PagerDuty What we will cover • What is PagerDuty? (DONE!) • Monitoring philosophies • Monitoring tools we use • Distributed Systems Monitoring • Security Monitoring • Dependency Monitoring • How we cheat by using Chef • How we validate our monitoring • Q and A
PagerDuty Thou shall….. Monitoring Philosophies • Alert on what your customers care about • Alert on expected thresholds (both high and low) • Make it as Self Service as possible • Validate that your alerts work
PagerDuty All hail self service metrics DataDog / StatsD • DataDog is the backend • StatsD is the client • Super easy to use • statsd.gauge(metric_name, val) • statsd.counter(metric_name) • statsd.histogram(metric_name,val)
PagerDuty Logging as Monitoring SumoLogic • We ship all of our critical apps logs • Engineers setup alerts on patterns • “Too many 500’s in the last 10m” • Somewhat self-service • Initial setup is in Chef • Hard to use for realtime debugging
PagerDuty Dumb health checks Simple External Monitoring • Wormly and Monitis • Completely bypass PagerDuty for backup alerts • Meant as a last ditch effort • Very naive in the health checks • Had to build out smarter health check page
PagerDuty Dumb health checks made smarter Simple External Monitoring • Health Check Page • Lightly touches internal services • Gives back an expected value for each service • Alert on non-expected value
PagerDuty Treat security as monitoring Security Monitoring • Intrusion detection via OSSEC • Monitor logs / Checksum dir’s • Port scanners via Gauntlt • Runs continuously • SQLMAP attacks • Not very useful against Rails
PagerDuty The single host does not matter anymore Distributed Systems Monitoring • Alert on cluster level metrics • Overall number of 500’s • % of nodes down • Overall latency
PagerDuty Stuff that you do not control Dependency Monitoring • How to monitor? • Operations • DNS -> Create/Delete records • Monitoring Tools -> Basic ping • Logging -> Validate that logs are being pushed
PagerDuty When SMS providers screw us over Quick Story • Primary SMS provider was “Up” • Customer was not getting their SMS • Found out in the worst way possible • Customer called us • Provider was working but T-Mobile prepaid was not passing our short code through
PagerDuty aka how to abuse unlimited messaging plans End to End testing • Every minute we send a SMS alert • Every SMS provider we use • Main Carriers • Verizon • AT&T • T-Mobile • Sprint • Measure Response times
PagerDuty Sorry, cannot tell you which carrier is which Some stats (Averages) • Carrier A • 15 Seconds • Carrier B • 60 Seconds • Carrier C • 25 Seconds • Carrier D • 200 seconds
PagerDuty Automate all the things How we cheat using Chef • All monitoring data consumption is setup • New Relic • DataDog • SumoLogic • OSSEC • Wormly and Monitis are not automated • Cluster alert setup is not automated
PagerDuty Catch the easy stuff DataDog Alert API pd_datadog_alert "File System Filling on #{host}" do metric_name "system.disk.in_use" function "avg" greater_than 0.85 time_frame '1h' page [ 'ops' ] end
PagerDuty What we have learned • Process monitoring is co-mingled with the process running • Only localhost checks on service • Requires outbound network conn from Failure Friday