Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Who Watches The Watchmen (Updated)

Who Watches The Watchmen (Updated)

This was a revised version of my talk on how we monitor PagerDuty. This was presented at DevOpsDays Chicago 2014. Video link:
https://www.youtube.com/watch?v=jUfd3XmJtfo

Arup Chakrabarti

October 08, 2014
Tweet

More Decks by Arup Chakrabarti

Other Decks in Technology

Transcript

  1. @arupchak What is PagerDuty? 10/23/14 Ops Guys know all too

    well... •  Alert and Incident Tracking •  On-Call Management •  Integrates with monitoring tools •  Alert the right person, every time Who Watches the Watchmen? DevOps Days Chicago 2014
  2. @arupchak Why do we care about monitoring? 10/23/14 Oct 2014

    US East Outage – Outgoing Traffic Who Watches the Watchmen? DevOps Days Chicago 2014
  3. @arupchak Today’s talk is about: 10/23/14 •  What is PagerDuty?

    •  Philosophies •  Tools •  Security •  Distributed Systems •  Dependency •  How we cheat by using Chef •  Validation •  Q and A Who Watches the Watchmen? DevOps Days Chicago 2014
  4. @arupchak Quick Disclaimer 10/23/14 I did not come up with

    everything Who Watches the Watchmen? DevOps Days Chicago 2014 •  I work with smart people •  Slides will be posted
  5. @arupchak Philosophies 10/23/14 Thou Shall: •  Use the right tool

    •  Avoid single host monitoring Who Watches the Watchmen? DevOps Days Chicago 2014
  6. @arupchak Philosophies 10/23/14 Thou Shall: •  Alert on what customers

    care about Who Watches the Watchmen? DevOps Days Chicago 2014
  7. @arupchak Philosophies 10/23/14 Thou Shall: •  Alert on expected values

    •  High and Low Who Watches the Watchmen? DevOps Days Chicago 2014
  8. @arupchak New Relic / APMs 10/23/14 Great for small env’s

    •  Pros •  Great for new stacks •  Helpful for tracing transactions •  Gives a lot of data Who Watches the Watchmen? DevOps Days Chicago 2014
  9. @arupchak Problem with APMs 10/23/14 Not a silver bullet • 

    Cons •  They can be overly prescriptive •  They can be hard to tune/ customize •  Gives a lot of data Who Watches the Watchmen? DevOps Days Chicago 2014
  10. @arupchak StatsD / DataDog 10/23/14 All hail self service metrics

    •  StatsD is the client •  DataDog is the backend •  Super easy to use •  statsd.gauge(metric_name, val) •  statsd.counter(metric_name) •  statsd.histogram(metric_name,val) Who Watches the Watchmen? DevOps Days Chicago 2014
  11. @arupchak StatsD / DataDog 10/23/14 Custom Notifications •  PagerDuty Integration

    •  Email •  HipChat Who Watches the Watchmen? DevOps Days Chicago 2014
  12. @arupchak StatsD / DataDog 10/23/14 Customize all the things • 

    Pros •  Very customizable •  Can change as you grow •  Self Service Who Watches the Watchmen? DevOps Days Chicago 2014
  13. @arupchak StatsD / DataDog 10/23/14 Needs some hand holding • 

    Cons •  Need to have Configuration Management •  Hard to ramp teams up Who Watches the Watchmen? DevOps Days Chicago 2014
  14. @arupchak SumoLogic 10/23/14 Logging as Monitoring •  Ship Critical App

    Logs •  Engineers setup alerts on patterns •  “Too many 500’s in the last 10m” •  Somewhat self-service •  Initial setup is in Chef •  Hard to use for realtime debugging Who Watches the Watchmen? DevOps Days Chicago 2014
  15. @arupchak Simple External Monitoring 10/23/14 Dumb health checks •  Wormly

    and Monitis •  Simple tools •  Backup alerting •  Very naive in the health checks •  Had to build out smarter health check page Who Watches the Watchmen? DevOps Days Chicago 2014
  16. @arupchak Simple External Monitoring 10/23/14 Dumb health checks made smarter

    •  Health Check Page •  Lightly touches internal services •  Gives back an expected value for each service •  Alert on non-expected value Who Watches the Watchmen? DevOps Days Chicago 2014
  17. @arupchak Security Monitoring 10/23/14 Why do security monitoring? •  Audits

    are tedious •  Continuous Audits •  Earlier alerting •  Easier fixing Who Watches the Watchmen? DevOps Days Chicago 2014
  18. @arupchak Security Monitoring 10/23/14 Audits are tedious •  IDS via

    OSSEC •  Monitor Logs / Checksum Dir’s •  Port scanning •  nmap •  Scrape IPSec data Who Watches the Watchmen? DevOps Days Chicago 2014
  19. @arupchak Distributed Systems 10/23/14 The single host does not matter

    anymore •  Alert on cluster level metrics •  Overall number of 500’s •  % of nodes down •  Overall latency Who Watches the Watchmen? DevOps Days Chicago 2014
  20. @arupchak Avoid Single Host Alerts 10/23/14 Crons should not be

    used for creating alerts Who Watches the Watchmen? DevOps Days Chicago 2014 PAGERDUTY US West 1 US West 2 Linode Monitoring System
  21. @arupchak Same model for service alerting 10/23/14 Who Watches the

    Watchmen? DevOps Days Chicago 2014 PAGERDUTY Service A Service B Service C Monitoring System
  22. @arupchak Dependency Monitoring 10/23/14 Stuff that you do not control

    •  Dependencies Everywhere •  Operations •  DNS •  Monitoring Tools •  Logging Who Watches the Watchmen? DevOps Days Chicago 2014
  23. @arupchak Dependency Monitoring 10/23/14 Stuff that you do not control

    •  How to monitor? •  Operations •  DNS -> Create/Delete records •  Monitoring Tools -> Basic ping •  Logging -> Validate that logs are being pushed •  Status Pages Who Watches the Watchmen? DevOps Days Chicago 2014
  24. @arupchak Dependency Monitoring 10/23/14 What keeps us up at night

    •  Dependencies Everywhere •  Software •  Email •  SMS •  Phone •  Push Notifications Who Watches the Watchmen? DevOps Days Chicago 2014
  25. @arupchak Quick Story 10/23/14 When SMS providers screw us over

    •  Primary SMS provider was “Up” •  Customer was not getting their SMS •  Found out in the worst way possible •  Customer called us •  Provider was working but T-Mobile prepaid was not passing our short code through Who Watches the Watchmen? DevOps Days Chicago 2014
  26. @arupchak End to End testing 10/23/14 aka how to abuse

    unlimited messaging plans •  Every minute we send a SMS alert •  Every SMS provider we use •  Main Carriers •  Verizon •  AT&T •  T-Mobile •  Sprint •  Measure Response times Who Watches the Watchmen? DevOps Days Chicago 2014
  27. @arupchak “Device Lab” It looks more official now 10/23/14 Who

    Watches the Watchmen? DevOps Days Chicago 2014
  28. @arupchak Some stats (Averages) 10/23/14 Sorry, cannot tell you which

    carrier is which •  Carrier A •  15 Seconds •  Carrier B •  5 Seconds •  Carrier C •  15 Seconds •  Carrier D •  50 Seconds Who Watches the Watchmen? DevOps Days Chicago 2014
  29. @arupchak How to cheat with Chef 10/23/14 Automate all the

    things •  Install all the agents •  New Relic •  DataDog – Easy alerts as well •  SumoLogic •  OSSEC •  Backup alerts are not automated •  Cluster alert setup is not automated Who Watches the Watchmen? DevOps Days Chicago 2014
  30. @arupchak How to Validate 10/23/14 Failure Friday •  We attack

    our own services •  Process failure •  Datacenter failure •  Network failure •  https://blog.pagerduty.com/failure-friday- at-pagerduty Who Watches the Watchmen? DevOps Days Chicago 2014
  31. @arupchak What we have learned 10/23/14 Failure Friday •  Process

    monitoring is co-mingled with the process running •  Only localhost checks on service •  Alerts require outbound network conn Who Watches the Watchmen? DevOps Days Chicago 2014