Who Watches The Watchmen?

PagerDuty Arup Chakrabarti OPERATIONS ENGINEERING @arupchak [email protected] DevOps Days Austin

PagerDuty Who watches the Watchmen?

PagerDuty Ops Guys know all too well... What is PagerDuty?
• Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Alert the right person, every time

PagerDuty What is PagerDuty?

PagerDuty What we will cover • What is PagerDuty? (DONE!)
• Monitoring philosophies • Monitoring tools we use • Distributed Systems Monitoring • Security Monitoring • Dependency Monitoring • How we cheat by using Chef • How we validate our monitoring • Q and A

PagerDuty I did not come up with everything Disclaimer

PagerDuty Thou shall….. Monitoring Philosophies • Use the right tool
• Not all monitoring tools are for everything • Avoid single host monitoring

PagerDuty Thou shall….. Monitoring Philosophies • Alert on what your
customers care about • Alert on expected thresholds (both high and low) • Make it as Self Service as possible • Validate that your alerts work

PagerDuty New Relic Monitoring Tools

PagerDuty Great for small env’s New Relic / APMs in
general • Pros • Great for new stacks • Helpful for tracing transactions • Gives a lot of data

PagerDuty Not all roses Problems with APM’s • Cons •
They can be overly prescriptive • They can be hard to tune/ customize • Gives a lot of data

PagerDuty All hail self service metrics DataDog / StatsD •
DataDog is the backend • StatsD is the client • Super easy to use • statsd.gauge(metric_name, val) • statsd.counter(metric_name) • statsd.histogram(metric_name,val)

PagerDuty Custom Metrics

PagerDuty Custom Alerts

PagerDuty Custom Notiﬁcations • PagerDuty Integration • Email • HipChat

PagerDuty StatsD / DataDog • Pros • Very customizable for
your business reqs • Can change as you grow • Self Service

PagerDuty StatsD / DataDog • Cons • Need to have
Conﬁguration Management • Hard to ramp teams up

PagerDuty Logging as Monitoring SumoLogic • We ship all of
our critical apps logs • Engineers setup alerts on patterns • “Too many 500’s in the last 10m” • Somewhat self-service • Initial setup is in Chef • Hard to use for realtime debugging

PagerDuty Dumb health checks Simple External Monitoring • Wormly and
Monitis • Completely bypass PagerDuty for backup alerts • Meant as a last ditch effort • Very naive in the health checks • Had to build out smarter health check page

PagerDuty Dumb health checks made smarter Simple External Monitoring •
Health Check Page • Lightly touches internal services • Gives back an expected value for each service • Alert on non-expected value

PagerDuty Treat security as monitoring Security Monitoring • Intrusion detection
via OSSEC • Monitor logs / Checksum dir’s • Port scanners via Gauntlt • Runs continuously • SQLMAP attacks • Not very useful against Rails

PagerDuty PagerDuty at PagerDuty? YES

PagerDuty The single host does not matter anymore Distributed Systems
Monitoring • Alert on cluster level metrics • Overall number of 500’s • % of nodes down • Overall latency

PagerDuty Cron should not be used for creating alerts Avoid
single host alerting US West 1 US West 2 Linode Highly available Monitoring system

PagerDuty Same model for service alerting Service A Service B
Service C Highly available Monitoring system

PagerDuty Stuff that you do not control Dependency Monitoring •
Dependencies Everywhere • Operations • DNS • Monitoring Tools • Logging

PagerDuty Stuff that you do not control Dependency Monitoring •
How to monitor? • Operations • DNS -> Create/Delete records • Monitoring Tools -> Basic ping • Logging -> Validate that logs are being pushed

PagerDuty What keeps us up at night Dependency Monitoring •
Dependencies Everywhere • Software • Email • SMS • Phone • Push Notiﬁcations

PagerDuty When SMS providers screw us over Quick Story •
Primary SMS provider was “Up” • Customer was not getting their SMS • Found out in the worst way possible • Customer called us • Provider was working but T-Mobile prepaid was not passing our short code through

PagerDuty aka how to abuse unlimited messaging plans End to
End testing • Every minute we send a SMS alert • Every SMS provider we use • Main Carriers • Verizon • AT&T • T-Mobile • Sprint • Measure Response times

PagerDuty “Device Lab”

PagerDuty Sorry, cannot tell you which carrier is which Some
stats (Averages) • Carrier A • 15 Seconds • Carrier B • 60 Seconds • Carrier C • 25 Seconds • Carrier D • 200 seconds

PagerDuty Kinda spikey….

PagerDuty Automate all the things How we cheat using Chef
• All monitoring data consumption is setup • New Relic • DataDog • SumoLogic • OSSEC • Wormly and Monitis are not automated • Cluster alert setup is not automated

PagerDuty Catch the easy stuff DataDog Alert API pd_datadog_alert "File
System Filling on #{host}" do metric_name "system.disk.in_use" function "avg" greater_than 0.85 time_frame '1h' page [ 'ops' ] end

PagerDuty Only if the environment is small enough Easy alerts
• Load • CPU • Memory • Disk Utilization

PagerDuty Sumologic Setup https://github.com/PagerDuty/chef-sumologic ! sumo_source 'syslog' do path '/var/log/syslog'
category 'syslog' end

PagerDuty Failure Friday How we validate our monitoring • We
attack our own services • Process failure • Datacenter failure • Network failure • http://blog.pagerduty.com/2013/11/ failure-friday-at-pagerduty/

PagerDuty What we have learned • Process monitoring is co-mingled
with the process running • Only localhost checks on service • Requires outbound network conn from Failure Friday

PagerDuty [email protected] Thank you We are hiring http://pagerduty.com/jobs Arup Chakrabarti
OPERATIONS ENGINEERING @arupchak

Who Watches The Watchmen?

Who Watches The Watchmen?

More Decks by Arup Chakrabarti

Other Decks in Technology

Featured

Transcript