Silent, but Deadly - Production End to End Testing

258e696b4652635aa9ba56e6fd0a6d70?s=47 Peter Kennedy
November 13, 2017

Silent, but Deadly - Production End to End Testing

Conference talk at Devopsdays Galway 2017.

258e696b4652635aa9ba56e6fd0a6d70?s=128

Peter Kennedy

November 13, 2017
Tweet

Transcript

  1. 1 Silent, but Deadly A Comedy in Five Acts

  2. 2 • Software Engineer at PagerDuty • Single digit years

    of experience • Really good at breaking Production @peterjskennedy
  3. 3 Prologue

  4. 4

  5. 5 Act I Coming of Age

  6. 6 Act I - Coming of Age “We wake you

    up when shit breaks” PagerDuty Inc.
  7. 7 Act I - Coming of Age Customer’s Infrastructure Customer’s

    Incident Response
  8. 8 Act I - Coming of Age Customer’s Infrastructure Customer’s

    Incident Response “Notification pipeline”
  9. 9 Act I - Coming of Age Something broke Page

    engineer to fix “Notification pipeline” “Magic”
  10. 10 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2010-2012)
  11. 11 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013ish) New Service A
  12. 12 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E
  13. 13 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service • Fully covered unit tests • Loads of integration tests • Stable
  14. 14 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E
  15. 15 Act I - Coming of Age “The Notification Pipeline”

    (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E
  16. 16 Act I - Coming of Age Q: Who pages

    PagerDuty?
  17. 17 Act I - Coming of Age Q: Who pages

    PagerDuty? A: PagerDuty …. and friends
  18. 18 Act I - Coming of Age “The Notification Pipeline”

    (ca: 2013-2014) New Service A New Service B New Service C New Service D … PagerDuty + other alerting tools You ok? You ok? You ok? You ok?
  19. 19 Act I - Coming of Age • “Who watches

    the Watchmen” • Arup Chakrabarti
  20. 20 Act I - Coming of Age

  21. 21 Act II "The Incident"

  22. 22 “The Notification Pipeline” (ca 2014) New Service A New

    Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E Act II - “The Incident”
  23. 23 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! I did the thing” Act II - “The Incident”
  24. 24 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! I did the thing” Act II - “The Incident” Narrator: It didn’t do the thing
  25. 25 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident”
  26. 26 “The Notification Pipeline” (ca: 2014) Service C Service B

    “Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident” Narrator: It wasn’t poorly formatted
  27. 27 Act II - “The Incident” “Hey gang, I’m having

    trouble receiving alerts for this integration.” “Anyone know what’s up?”
  28. 28 Act II - “The Incident”

  29. 29 Act II - “The Incident” The path for a

    “lightly used” integration was obstructed
  30. 30 Act II - “The Incident” The path for a

    “lightly used” integration was obstructed … without us knowing
  31. 31 Act II - “The Incident” The path for a

    “lightly used” integration was obstructed … without us knowing … for several days
  32. 32 Act II - “The Incident” Impact was Low

  33. 33 Act II - “The Incident” Found we had gaps

    in our testing and deployment
  34. 34 Act II - “The Incident” We stopped everything, until

    we could figure what the hell was going on
  35. 35 Act III “The Dog”

  36. 36 Act III - “The Dog” Notifications are mission critical

    We needed a way to validate notifications are being sent
  37. 37 Something that requires a responder Paging a responder “Notification

    pipeline” “Magic” Act III - “The Dog”
  38. 38 Send in data “Notification pipeline” Act III - “The

    Dog” Validate notifications are sent
  39. 39 Act III - “The Dog” Goal: Discover and alert

    on silent failures
  40. 40 Act III - “The Dog” Goal: Test framework for

    PagerDuty Only use PagerDuty’s APIs, nothing internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Discover and alert on silent failures
  41. 41 Act III - “The Dog” What Watchdog is: •

    Scala’s “scalatest” library • Only input is account data • Cron runs tests occasionally depending on priority • Updates Postgres with test results • Alert on test failures • Alert if no tests haven’t run in a while
  42. 42 Act III - “The Dog” What Watchdog is: Send

    in data Receive Notification Validate data
  43. 43 Act III - “The Dog” What Watchdog is: Send

    in data Receive Notification Validate data Alert PagerDuty Engineering
  44. 44 Act III - “The Dog”

  45. 45 Act III - “The Dog” Using a PagerDuty account

    Create a PagerDuty service Send an event from a JSON file Verify an incident is created
  46. 46 Act III - “The Dog”

  47. 47 Act III - “The Dog”

  48. 48 Act IV The Dark Ages

  49. 49 Act IV - The Dark Ages Watchdog was born

    Development at PagerDuty resumed at full speed
  50. 50 Act IV - The Dark Ages Watchdog was born

    Development at PagerDuty resumed at full speed Exponentially increased
  51. 51 Act IV - The Dark Ages Our deploys would

    cause brief interruptions to customers There were race conditions in our software
  52. 52 Act IV - The Dark Ages Watchdog became integral

    to our development Engineers would run watchdog in pre-prod environments before deploying
  53. 53 Act IV - The Dark Ages Watchdog itself had

    reliability problems
  54. 54 Act IV - The Dark Ages Watchdog itself was

    too slow
  55. 55 Act IV - The Dark Ages

  56. 56 Act IV - The Dark Ages Watchdog would create

    resources in PagerDuty via the API, if required for a test
  57. 57 Act IV - The Dark Ages Watchdog would create

    resources in PagerDuty via the API, if required for a test … every time it runs
  58. 58 Act IV - The Dark Ages 89%

  59. 59 Act IV - The Dark Ages 89% Often referred

    to as “Watchpuppy”
  60. 60 Act IV - The Dark Ages 89% Because it

    doesn’t clean up after itself Often referred to as “Watchpuppy”
  61. 61 Act IV - The Dark Ages

  62. 62 Act IV - The Dark Ages Scala is hard

  63. 63 Act IV - The Dark Ages

  64. 64 Act IV - The Dark Ages Most engineering teams

    at PagerDuty had a stake in Watchdog Each team would own their respective tests
  65. 65 Act IV - The Dark Ages Most engineering teams

    at PagerDuty had a stake in Watchdog Each team would own their respective tests Who manages and maintains watchdog as a service?
  66. 66 Test framework for PagerDuty Only use PagerDuty’s APIs, nothing

    internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Act IV - The Dark Ages
  67. 67 Test framework for PagerDuty Only use PagerDuty’s APIs, nothing

    internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Many False positives lead to alert fatigue Did not alert adequately on major incidents Act IV - The Dark Ages
  68. 68 Act IV - The Dark Ages Watchdog is extremely

    useful at PagerDuty However, as a monitoring tool, it did not scale with our business
  69. 69 Act V Enlightenment

  70. 70 Act V - Enlightenment End-to-end testing makes sense in

    development
  71. 71 Act V - Enlightenment End-to-end testing makes sense in

    development Maintaining it as an alerting tool in production does not
  72. 72 Act V - Enlightenment We can derive system health

    from our existing metrics
  73. 73 Act V - Enlightenment If you think you need

    end-to-end monitoring in production First consider how you’ve laid out your services
  74. 74 Act V - Enlightenment Are all of your services

    functionally independent and opinionated?
  75. 75 Act V - Enlightenment Are all of your services

    tested fully?
  76. 76 Act V - Enlightenment How do you deploy? Can

    you catch failures before your users?
  77. 77 Act V - Enlightenment Do you have a QA

    process?
  78. 78 Act V - Enlightenment

  79. 79 Act V - Enlightenment What happens to our beloved

    watchdog?
  80. 80 Act V - Enlightenment Only the critical tests that

    matter Re-use resources to prevent bloat Containerized Runs in Nomad
  81. 81 Act V - Enlightenment Only the critical tests that

    matter Re-use resources to prevent bloat Containerized Runs in Nomad Going to be removed entirely
  82. 82 Act V - Enlightenment In 2014 we weren’t capable

    of deriving system health
  83. 83 Act V - Enlightenment In 2014 we weren’t capable

    of deriving system health In 2017 we are
  84. 84 Act V - Enlightenment It’s ok to kill your

    dog
  85. 85 Act V - Enlightenment It’s ok to kill your

    dog You can derive system health without writing tons of code
  86. 86 Silent, but Deadly A Comedy in Five Acts @peterjskennedy