well... • Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Alert the right person, every time Who Watches the Watchmen? DevOps Days Chicago 2014
• Philosophies • Tools • Security • Distributed Systems • Dependency • How we cheat by using Chef • Validation • Q and A Who Watches the Watchmen? DevOps Days Chicago 2014
• StatsD is the client • DataDog is the backend • Super easy to use • statsd.gauge(metric_name, val) • statsd.counter(metric_name) • statsd.histogram(metric_name,val) Who Watches the Watchmen? DevOps Days Chicago 2014
Logs • Engineers setup alerts on patterns • “Too many 500’s in the last 10m” • Somewhat self-service • Initial setup is in Chef • Hard to use for realtime debugging Who Watches the Watchmen? DevOps Days Chicago 2014
and Monitis • Simple tools • Backup alerting • Very naive in the health checks • Had to build out smarter health check page Who Watches the Watchmen? DevOps Days Chicago 2014
• Health Check Page • Lightly touches internal services • Gives back an expected value for each service • Alert on non-expected value Who Watches the Watchmen? DevOps Days Chicago 2014
anymore • Alert on cluster level metrics • Overall number of 500’s • % of nodes down • Overall latency Who Watches the Watchmen? DevOps Days Chicago 2014
• How to monitor? • Operations • DNS -> Create/Delete records • Monitoring Tools -> Basic ping • Logging -> Validate that logs are being pushed • Status Pages Who Watches the Watchmen? DevOps Days Chicago 2014
• Primary SMS provider was “Up” • Customer was not getting their SMS • Found out in the worst way possible • Customer called us • Provider was working but T-Mobile prepaid was not passing our short code through Who Watches the Watchmen? DevOps Days Chicago 2014
unlimited messaging plans • Every minute we send a SMS alert • Every SMS provider we use • Main Carriers • Verizon • AT&T • T-Mobile • Sprint • Measure Response times Who Watches the Watchmen? DevOps Days Chicago 2014
carrier is which • Carrier A • 15 Seconds • Carrier B • 5 Seconds • Carrier C • 15 Seconds • Carrier D • 50 Seconds Who Watches the Watchmen? DevOps Days Chicago 2014
things • Install all the agents • New Relic • DataDog – Easy alerts as well • SumoLogic • OSSEC • Backup alerts are not automated • Cluster alert setup is not automated Who Watches the Watchmen? DevOps Days Chicago 2014
our own services • Process failure • Datacenter failure • Network failure • https://blog.pagerduty.com/failure-friday- at-pagerduty Who Watches the Watchmen? DevOps Days Chicago 2014
monitoring is co-mingled with the process running • Only localhost checks on service • Alerts require outbound network conn Who Watches the Watchmen? DevOps Days Chicago 2014