“Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident” Narrator: It wasn’t poorly formatted
PagerDuty Only use PagerDuty’s APIs, nothing internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Discover and alert on silent failures
Scala’s “scalatest” library • Only input is account data • Cron runs tests occasionally depending on priority • Updates Postgres with test results • Alert on test failures • Alert if no tests haven’t run in a while
internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Act IV - The Dark Ages
internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Many False positives lead to alert fatigue Did not alert adequately on major incidents Act IV - The Dark Ages