• What is the problem? • How did we get into this mess? • How can we avoid the problem? • What can you start doing today to fix it? • What is coming in the future? • Q and A
24/7 service • Participates in on-call one week out of every four • Has proper monitoring and alerting for his service • Has proper de-duping on his alerts
on metrics your customers care about • Resist the temptation to alert on everything • Customers do not care who gets paged - Alerts goto right people • Talk to your business guys for help
alerting data (PagerDuty!) • How many times are you getting paged per week? • Questions to ask per alert: • Was any action taken? • Was a customer affected? • Was this fully within my control?
quickly • Alert comes in at 3am • Engineer validates it is not an issue • Fixes root cause following day • Longer term, start tagging alerts for data collection