Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Silent, but Deadly - Production End to End Testing

Peter Kennedy
November 13, 2017

Silent, but Deadly - Production End to End Testing

Conference talk at Devopsdays Galway 2017.

Peter Kennedy

November 13, 2017
Tweet

Other Decks in Technology

Transcript

  1. 2 • Software Engineer at PagerDuty • Single digit years

    of experience • Really good at breaking Production @peterjskennedy
  2. 4

  3. 6 Act I - Coming of Age “We wake you

    up when shit breaks” PagerDuty Inc.
  4. 8 Act I - Coming of Age Customer’s Infrastructure Customer’s

    Incident Response “Notification pipeline”
  5. 9 Act I - Coming of Age Something broke Page

    engineer to fix “Notification pipeline” “Magic”
  6. 12 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E
  7. 13 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service • Fully covered unit tests • Loads of integration tests • Stable
  8. 14 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E
  9. 15 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E
  10. 17 Act I - Coming of Age Q: Who pages

    PagerDuty? A: PagerDuty …. and friends
  11. 18 Act I - Coming of Age “The Notification Pipeline”

    (ca: 2013-2014) New Service A New Service B New Service C New Service D … PagerDuty + other alerting tools You ok? You ok? You ok? You ok?
  12. 19 Act I - Coming of Age • “Who watches

    the Watchmen” • Arup Chakrabarti
  13. 22 “The Notification Pipeline” (ca 2014) New Service A New

    Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E Act II - “The Incident”
  14. 23 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! I did the thing” Act II - “The Incident”
  15. 24 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! I did the thing” Act II - “The Incident” Narrator: It didn’t do the thing
  16. 25 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident”
  17. 26 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident” Narrator: It wasn’t poorly formatted
  18. 27 Act II - “The Incident” “Hey gang, I’m having

    trouble receiving alerts for this integration.” “Anyone know what’s up?”
  19. 29 Act II - “The Incident” The path for a

    “lightly used” integration was obstructed
  20. 30 Act II - “The Incident” The path for a

    “lightly used” integration was obstructed … without us knowing
  21. 31 Act II - “The Incident” The path for a

    “lightly used” integration was obstructed … without us knowing … for several days
  22. 33 Act II - “The Incident” Found we had gaps

    in our testing and deployment
  23. 34 Act II - “The Incident” We stopped everything, until

    we could figure what the hell was going on
  24. 36 Act III - “The Dog” Notifications are mission critical

    We needed a way to validate notifications are being sent
  25. 37 Something that requires a responder Paging a responder “Notification

    pipeline” “Magic” Act III - “The Dog”
  26. 38 Send in data “Notification pipeline” Act III - “The

    Dog” Validate notifications are sent
  27. 40 Act III - “The Dog” Goal: Test framework for

    PagerDuty Only use PagerDuty’s APIs, nothing internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Discover and alert on silent failures
  28. 41 Act III - “The Dog” What Watchdog is: •

    Scala’s “scalatest” library • Only input is account data • Cron runs tests occasionally depending on priority • Updates Postgres with test results • Alert on test failures • Alert if no tests haven’t run in a while
  29. 42 Act III - “The Dog” What Watchdog is: Send

    in data Receive Notification Validate data
  30. 43 Act III - “The Dog” What Watchdog is: Send

    in data Receive Notification Validate data Alert PagerDuty Engineering
  31. 45 Act III - “The Dog” Using a PagerDuty account

    Create a PagerDuty service Send an event from a JSON file Verify an incident is created
  32. 49 Act IV - The Dark Ages Watchdog was born

    Development at PagerDuty resumed at full speed
  33. 50 Act IV - The Dark Ages Watchdog was born

    Development at PagerDuty resumed at full speed Exponentially increased
  34. 51 Act IV - The Dark Ages Our deploys would

    cause brief interruptions to customers There were race conditions in our software
  35. 52 Act IV - The Dark Ages Watchdog became integral

    to our development Engineers would run watchdog in pre-prod environments before deploying
  36. 56 Act IV - The Dark Ages Watchdog would create

    resources in PagerDuty via the API, if required for a test
  37. 57 Act IV - The Dark Ages Watchdog would create

    resources in PagerDuty via the API, if required for a test … every time it runs
  38. 60 Act IV - The Dark Ages 89% Because it

    doesn’t clean up after itself Often referred to as “Watchpuppy”
  39. 64 Act IV - The Dark Ages Most engineering teams

    at PagerDuty had a stake in Watchdog Each team would own their respective tests
  40. 65 Act IV - The Dark Ages Most engineering teams

    at PagerDuty had a stake in Watchdog Each team would own their respective tests Who manages and maintains watchdog as a service?
  41. 66 Test framework for PagerDuty Only use PagerDuty’s APIs, nothing

    internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Act IV - The Dark Ages
  42. 67 Test framework for PagerDuty Only use PagerDuty’s APIs, nothing

    internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Many False positives lead to alert fatigue Did not alert adequately on major incidents Act IV - The Dark Ages
  43. 68 Act IV - The Dark Ages Watchdog is extremely

    useful at PagerDuty However, as a monitoring tool, it did not scale with our business
  44. 71 Act V - Enlightenment End-to-end testing makes sense in

    development Maintaining it as an alerting tool in production does not
  45. 73 Act V - Enlightenment If you think you need

    end-to-end monitoring in production First consider how you’ve laid out your services
  46. 74 Act V - Enlightenment Are all of your services

    functionally independent and opinionated?
  47. 76 Act V - Enlightenment How do you deploy? Can

    you catch failures before your users?
  48. 80 Act V - Enlightenment Only the critical tests that

    matter Re-use resources to prevent bloat Containerized Runs in Nomad
  49. 81 Act V - Enlightenment Only the critical tests that

    matter Re-use resources to prevent bloat Containerized Runs in Nomad Going to be removed entirely
  50. 83 Act V - Enlightenment In 2014 we weren’t capable

    of deriving system health In 2017 we are
  51. 85 Act V - Enlightenment It’s ok to kill your

    dog You can derive system health without writing tons of code