Slide 1

Slide 1 text

1 Silent, but Deadly A Comedy in Five Acts

Slide 2

Slide 2 text

2 • Software Engineer at PagerDuty • Single digit years of experience • Really good at breaking Production @peterjskennedy

Slide 3

Slide 3 text

3 Prologue

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5 Act I Coming of Age

Slide 6

Slide 6 text

6 Act I - Coming of Age “We wake you up when shit breaks” PagerDuty Inc.

Slide 7

Slide 7 text

7 Act I - Coming of Age Customer’s Infrastructure Customer’s Incident Response

Slide 8

Slide 8 text

8 Act I - Coming of Age Customer’s Infrastructure Customer’s Incident Response “Notification pipeline”

Slide 9

Slide 9 text

9 Act I - Coming of Age Something broke Page engineer to fix “Notification pipeline” “Magic”

Slide 10

Slide 10 text

10 Act I - Coming of Age “The Notification Pipeline” (ca. 2010-2012)

Slide 11

Slide 11 text

11 Act I - Coming of Age “The Notification Pipeline” (ca. 2013ish) New Service A

Slide 12

Slide 12 text

12 Act I - Coming of Age “The Notification Pipeline” (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E

Slide 13

Slide 13 text

13 Act I - Coming of Age “The Notification Pipeline” (ca. 2013-2014) New Service • Fully covered unit tests • Loads of integration tests • Stable

Slide 14

Slide 14 text

14 Act I - Coming of Age “The Notification Pipeline” (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E

Slide 15

Slide 15 text

15 Act I - Coming of Age “The Notification Pipeline” (ca. 2013-2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E

Slide 16

Slide 16 text

16 Act I - Coming of Age Q: Who pages PagerDuty?

Slide 17

Slide 17 text

17 Act I - Coming of Age Q: Who pages PagerDuty? A: PagerDuty …. and friends

Slide 18

Slide 18 text

18 Act I - Coming of Age “The Notification Pipeline” (ca: 2013-2014) New Service A New Service B New Service C New Service D … PagerDuty + other alerting tools You ok? You ok? You ok? You ok?

Slide 19

Slide 19 text

19 Act I - Coming of Age • “Who watches the Watchmen” • Arup Chakrabarti

Slide 20

Slide 20 text

20 Act I - Coming of Age

Slide 21

Slide 21 text

21 Act II "The Incident"

Slide 22

Slide 22 text

22 “The Notification Pipeline” (ca 2014) New Service A New Service B New Service C New Service D New Service E TEAM B TEAM C TEAM D TEAM A TEAM E Act II - “The Incident”

Slide 23

Slide 23 text

23 “The Notification Pipeline” (ca: 2014) Service C Service B “Can you please do the thing?” “Success! I did the thing” Act II - “The Incident”

Slide 24

Slide 24 text

24 “The Notification Pipeline” (ca: 2014) Service C Service B “Can you please do the thing?” “Success! I did the thing” Act II - “The Incident” Narrator: It didn’t do the thing

Slide 25

Slide 25 text

25 “The Notification Pipeline” (ca: 2014) Service C Service B “Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident”

Slide 26

Slide 26 text

26 “The Notification Pipeline” (ca: 2014) Service C Service B “Can you please do the thing?” “Success! This data is poorly formatted and you can drop it safely” Act II - “The Incident” Narrator: It wasn’t poorly formatted

Slide 27

Slide 27 text

27 Act II - “The Incident” “Hey gang, I’m having trouble receiving alerts for this integration.” “Anyone know what’s up?”

Slide 28

Slide 28 text

28 Act II - “The Incident”

Slide 29

Slide 29 text

29 Act II - “The Incident” The path for a “lightly used” integration was obstructed

Slide 30

Slide 30 text

30 Act II - “The Incident” The path for a “lightly used” integration was obstructed … without us knowing

Slide 31

Slide 31 text

31 Act II - “The Incident” The path for a “lightly used” integration was obstructed … without us knowing … for several days

Slide 32

Slide 32 text

32 Act II - “The Incident” Impact was Low

Slide 33

Slide 33 text

33 Act II - “The Incident” Found we had gaps in our testing and deployment

Slide 34

Slide 34 text

34 Act II - “The Incident” We stopped everything, until we could figure what the hell was going on

Slide 35

Slide 35 text

35 Act III “The Dog”

Slide 36

Slide 36 text

36 Act III - “The Dog” Notifications are mission critical We needed a way to validate notifications are being sent

Slide 37

Slide 37 text

37 Something that requires a responder Paging a responder “Notification pipeline” “Magic” Act III - “The Dog”

Slide 38

Slide 38 text

38 Send in data “Notification pipeline” Act III - “The Dog” Validate notifications are sent

Slide 39

Slide 39 text

39 Act III - “The Dog” Goal: Discover and alert on silent failures

Slide 40

Slide 40 text

40 Act III - “The Dog” Goal: Test framework for PagerDuty Only use PagerDuty’s APIs, nothing internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Discover and alert on silent failures

Slide 41

Slide 41 text

41 Act III - “The Dog” What Watchdog is: • Scala’s “scalatest” library • Only input is account data • Cron runs tests occasionally depending on priority • Updates Postgres with test results • Alert on test failures • Alert if no tests haven’t run in a while

Slide 42

Slide 42 text

42 Act III - “The Dog” What Watchdog is: Send in data Receive Notification Validate data

Slide 43

Slide 43 text

43 Act III - “The Dog” What Watchdog is: Send in data Receive Notification Validate data Alert PagerDuty Engineering

Slide 44

Slide 44 text

44 Act III - “The Dog”

Slide 45

Slide 45 text

45 Act III - “The Dog” Using a PagerDuty account Create a PagerDuty service Send an event from a JSON file Verify an incident is created

Slide 46

Slide 46 text

46 Act III - “The Dog”

Slide 47

Slide 47 text

47 Act III - “The Dog”

Slide 48

Slide 48 text

48 Act IV The Dark Ages

Slide 49

Slide 49 text

49 Act IV - The Dark Ages Watchdog was born Development at PagerDuty resumed at full speed

Slide 50

Slide 50 text

50 Act IV - The Dark Ages Watchdog was born Development at PagerDuty resumed at full speed Exponentially increased

Slide 51

Slide 51 text

51 Act IV - The Dark Ages Our deploys would cause brief interruptions to customers There were race conditions in our software

Slide 52

Slide 52 text

52 Act IV - The Dark Ages Watchdog became integral to our development Engineers would run watchdog in pre-prod environments before deploying

Slide 53

Slide 53 text

53 Act IV - The Dark Ages Watchdog itself had reliability problems

Slide 54

Slide 54 text

54 Act IV - The Dark Ages Watchdog itself was too slow

Slide 55

Slide 55 text

55 Act IV - The Dark Ages

Slide 56

Slide 56 text

56 Act IV - The Dark Ages Watchdog would create resources in PagerDuty via the API, if required for a test

Slide 57

Slide 57 text

57 Act IV - The Dark Ages Watchdog would create resources in PagerDuty via the API, if required for a test … every time it runs

Slide 58

Slide 58 text

58 Act IV - The Dark Ages 89%

Slide 59

Slide 59 text

59 Act IV - The Dark Ages 89% Often referred to as “Watchpuppy”

Slide 60

Slide 60 text

60 Act IV - The Dark Ages 89% Because it doesn’t clean up after itself Often referred to as “Watchpuppy”

Slide 61

Slide 61 text

61 Act IV - The Dark Ages

Slide 62

Slide 62 text

62 Act IV - The Dark Ages Scala is hard

Slide 63

Slide 63 text

63 Act IV - The Dark Ages

Slide 64

Slide 64 text

64 Act IV - The Dark Ages Most engineering teams at PagerDuty had a stake in Watchdog Each team would own their respective tests

Slide 65

Slide 65 text

65 Act IV - The Dark Ages Most engineering teams at PagerDuty had a stake in Watchdog Each team would own their respective tests Who manages and maintains watchdog as a service?

Slide 66

Slide 66 text

66 Test framework for PagerDuty Only use PagerDuty’s APIs, nothing internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Act IV - The Dark Ages

Slide 67

Slide 67 text

67 Test framework for PagerDuty Only use PagerDuty’s APIs, nothing internal Do things as customers would Enable other teams at PagerDuty to write their own tests Bar to write tests is set extremely low Test failures page the on-call Many False positives lead to alert fatigue Did not alert adequately on major incidents Act IV - The Dark Ages

Slide 68

Slide 68 text

68 Act IV - The Dark Ages Watchdog is extremely useful at PagerDuty However, as a monitoring tool, it did not scale with our business

Slide 69

Slide 69 text

69 Act V Enlightenment

Slide 70

Slide 70 text

70 Act V - Enlightenment End-to-end testing makes sense in development

Slide 71

Slide 71 text

71 Act V - Enlightenment End-to-end testing makes sense in development Maintaining it as an alerting tool in production does not

Slide 72

Slide 72 text

72 Act V - Enlightenment We can derive system health from our existing metrics

Slide 73

Slide 73 text

73 Act V - Enlightenment If you think you need end-to-end monitoring in production First consider how you’ve laid out your services

Slide 74

Slide 74 text

74 Act V - Enlightenment Are all of your services functionally independent and opinionated?

Slide 75

Slide 75 text

75 Act V - Enlightenment Are all of your services tested fully?

Slide 76

Slide 76 text

76 Act V - Enlightenment How do you deploy? Can you catch failures before your users?

Slide 77

Slide 77 text

77 Act V - Enlightenment Do you have a QA process?

Slide 78

Slide 78 text

78 Act V - Enlightenment

Slide 79

Slide 79 text

79 Act V - Enlightenment What happens to our beloved watchdog?

Slide 80

Slide 80 text

80 Act V - Enlightenment Only the critical tests that matter Re-use resources to prevent bloat Containerized Runs in Nomad

Slide 81

Slide 81 text

81 Act V - Enlightenment Only the critical tests that matter Re-use resources to prevent bloat Containerized Runs in Nomad Going to be removed entirely

Slide 82

Slide 82 text

82 Act V - Enlightenment In 2014 we weren’t capable of deriving system health

Slide 83

Slide 83 text

83 Act V - Enlightenment In 2014 we weren’t capable of deriving system health In 2017 we are

Slide 84

Slide 84 text

84 Act V - Enlightenment It’s ok to kill your dog

Slide 85

Slide 85 text

85 Act V - Enlightenment It’s ok to kill your dog You can derive system health without writing tons of code

Slide 86

Slide 86 text

86 Silent, but Deadly A Comedy in Five Acts @peterjskennedy