Slide 1

Slide 1 text

2019 Think Big: How to Chaos Test a Monolith SEPTEMBER 26, 2019 Caroline Dickey, Site Reliability Engineer @CarolineEDickey " 1

Slide 2

Slide 2 text

2019 What’s a monolith? “We aren’t ready for chaos engineering” Three approaches to testing a monolith Three use cases for chaos engineering Agenda 2

Slide 3

Slide 3 text

2019 Monolithic Architecture Microservice Architecture Business Logic UI Data Access Layer UI ⚙ ⚙ ⚙ ⚙ Microservices 3

Slide 4

Slide 4 text

2019 4

Slide 5

Slide 5 text

2019 5

Slide 6

Slide 6 text

2019 6

Slide 7

Slide 7 text

2019 7

Slide 8

Slide 8 text

2019 8

Slide 9

Slide 9 text

2019 9

Slide 10

Slide 10 text

2019 10

Slide 11

Slide 11 text

2019 11

Slide 12

Slide 12 text

2019 12

Slide 13

Slide 13 text

2019 Some people don’t like monoliths. 13

Slide 14

Slide 14 text

2019 22,977,131 14

Slide 15

Slide 15 text

2019 22,977,131 15

Slide 16

Slide 16 text

2019 22,977,131 16

Slide 17

Slide 17 text

2019 17 Our customers depend on us

Slide 18

Slide 18 text

2019 18 Our customers depend on us They don’t care about our tech

Slide 19

Slide 19 text

2019 19 Our customers depend on us They don’t care about our tech They’ve trusted us - let’s not let them down

Slide 20

Slide 20 text

2019 20 Our customers depend on us They don’t care about our tech They’ve trusted us - let’s not let them down Chaos engineering can help

Slide 21

Slide 21 text

2019 “I don’t have a monolith. Why should I care?” - You, maybe 21

Slide 22

Slide 22 text

2019 What’s a monolith? “We aren’t ready for chaos engineering” Three approaches to testing a monolith Three use cases for chaos engineering Agenda 22

Slide 23

Slide 23 text

2019 “We aren’t ready for chaos engineering” 23 2019

Slide 24

Slide 24 text

2019 24 24 2019

Slide 25

Slide 25 text

2019 25 There will always be things that we can’t control in software engineering. Can’t control Can control

Slide 26

Slide 26 text

2019 26 We can’t control bad pushes 26 2019

Slide 27

Slide 27 text

2019 27 We can’t control database maintenance events 27 2019

Slide 28

Slide 28 text

2019 28 We can’t control backhoes 28 2019

Slide 29

Slide 29 text

2019 29 But there are also plenty of potential problem areas that we can do something about. Can’t control Can control

Slide 30

Slide 30 text

2019 30 We can control built-in redundancy 2019

Slide 31

Slide 31 text

2019 31 We can control error handling for failed dependencies 2019

Slide 32

Slide 32 text

2019 32 We can control our monitoring and alerting 2019

Slide 33

Slide 33 text

2019 33 Chaos engineering is all about validating our assumptions about these things Can’t control Can control

Slide 34

Slide 34 text

2019 34 Can’t control Can control But also, making this bigger. Things we know about

Slide 35

Slide 35 text

2019 What’s a monolith? “We aren’t ready for chaos engineering” Three approaches to testing a monolith Three use cases for chaos engineering Agenda 35

Slide 36

Slide 36 text

2019 Use an architecture diagram Validate changes Test your dependencies 36 Three approaches to testing a monolith

Slide 37

Slide 37 text

2019 User Secondary LB (warm) Primary LB Secondary CDN App Servers Primary 37

Slide 38

Slide 38 text

Scenario Load balancer failover 38 2019

Slide 39

Slide 39 text

2019 User Secondary LB (warm) Primary LB Secondary CDN App Servers Primary ✕ 39

Slide 40

Slide 40 text

2019 40

Slide 41

Slide 41 text

2019 41

Slide 42

Slide 42 text

2019 42

Slide 43

Slide 43 text

2019 43

Slide 44

Slide 44 text

2019 44

Slide 45

Slide 45 text

2019 User Secondary LB (warm) Primary LB Secondary CDN App Servers Primary ✕ 45

Slide 46

Slide 46 text

2019 User Secondary LB (warm) Primary LB Secondary CDN App Servers Primary 46

Slide 47

Slide 47 text

Scenario Make the database read-only 47 2019

Slide 48

Slide 48 text

2019 User Secondary LB (warm) Primary LB Secondary CDN App Servers Primary 48

Slide 49

Slide 49 text

2019 User Secondary LB (warm) Primary LB Secondary CDN App Servers Primary 49

Slide 50

Slide 50 text

2019 50 Outcome: Unexpected SQL error due to missing error handling in a legacy class

Slide 51

Slide 51 text

2019 51 Outcome: Unexpected SQL error due to missing error handling in a legacy class

Slide 52

Slide 52 text

2019 52 Pro-tip: market chaos engineering internally with an email newsletter!

Slide 53

Slide 53 text

2019 Use an architecture diagram Validate changes Test your dependencies 53 Three approaches to testing a monolith

Slide 54

Slide 54 text

54 Scenario New filesystem on application servers 2019

Slide 55

Slide 55 text

2019 55 Ceph: A versatile, distributed storage system that can be used for many different types of storage services. Ceph-fuse: A FUSE (File system in USErspace) client for Ceph distributed file system. It will mount a ceph file system.

Slide 56

Slide 56 text

2019 56

Slide 57

Slide 57 text

2019 57

Slide 58

Slide 58 text

2019 58 Outcome: NO alerting

Slide 59

Slide 59 text

2019 59 Outcome: NO alerting and a homegrown Python script shows up attempting to remount, and fails.

Slide 60

Slide 60 text

2019 60 Outcome: NO alerting and a homegrown Python script shows up attempting to remount, and fails.

Slide 61

Slide 61 text

61 Scenario New caching library 2019

Slide 62

Slide 62 text

2019 62 Memcached: Memcached is a distributed memory caching system. It speeds up websites having large dynamic databasing by storing database object in Dynamic Memory. https://www.cloudways.com/blog/memcached-with-php/

Slide 63

Slide 63 text

2019 63 We have Memcached caches on all of our application servers, which interact with each other to access cached data.

Slide 64

Slide 64 text

2019 64 We made a fix to a new Memcached client library to prevent timeouts if a server is unreachable.

Slide 65

Slide 65 text

2019 65

Slide 66

Slide 66 text

2019 66

Slide 67

Slide 67 text

2019 67

Slide 68

Slide 68 text

2019 Use an architecture diagram Validate changes Test your dependencies 68 Three approaches to testing a monolith

Slide 69

Slide 69 text

Scenario Internal dependencies* *Sometimes known as services/microservices 69 2019

Slide 70

Slide 70 text

2019 70

Slide 71

Slide 71 text

2019 71 Requestmapper is a service that maps URLs from one format ( Pretty Campaign URLs or custom landing pages) to their internal format.

Slide 72

Slide 72 text

2019 72 Requestmapper is a service that maps URLs from one form ( Pretty Campaign URLs or custom landing pages) to their internal form.

Slide 73

Slide 73 text

2019 RM - = 73

Slide 74

Slide 74 text

2019 74

Slide 75

Slide 75 text

2019 75

Slide 76

Slide 76 text

2019 76

Slide 77

Slide 77 text

2019 77

Slide 78

Slide 78 text

2019 78 Pro-tip: market chaos engineering internally with an internal blog post!

Slide 79

Slide 79 text

79 Scenario 3rd party API calls 2019

Slide 80

Slide 80 text

2019 80

Slide 81

Slide 81 text

2019 81

Slide 82

Slide 82 text

2019 82

Slide 83

Slide 83 text

2019 83

Slide 84

Slide 84 text

2019 Use an architecture diagram Validate changes Test your dependencies Carefully 84 Three Four approaches to testing a monolith

Slide 85

Slide 85 text

2019 85 Should you chaos test in production?

Slide 86

Slide 86 text

2019 86 Our approach Default to testing in stage/dev Over-communicate about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)

Slide 87

Slide 87 text

2019 87 Our approach Default to testing in stage/dev Over-communicate about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)

Slide 88

Slide 88 text

2019 88 Our approach Default to testing in stage/dev Over-communicate about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)

Slide 89

Slide 89 text

2019 89 Our approach Default to testing in stage/dev Over-communicate about GameDays If anyone feels uncomfortable, we don’t proceed Build confidence by testing in production incrementally

Slide 90

Slide 90 text

2019 What’s a monolith? “We aren’t ready for chaos engineering” Three approaches to testing a monolith Three use cases for chaos engineering Agenda 90

Slide 91

Slide 91 text

2019 Training Post-mortem counterpart Application performance 91 Three use cases for chaos engineering

Slide 92

Slide 92 text

Scenario Team training 92 2019

Slide 93

Slide 93 text

2019 93

Slide 94

Slide 94 text

2019 94

Slide 95

Slide 95 text

2019 95

Slide 96

Slide 96 text

2019 96

Slide 97

Slide 97 text

2019 Training Post-mortem counterpart Application performance 97 Three use cases for chaos engineering

Slide 98

Slide 98 text

2019 98 Incidents happen

Slide 99

Slide 99 text

2019 99 Incidents happen Blameless post-mortems ✔

Slide 100

Slide 100 text

2019 100 Incidents happen Blameless post-mortems ✔ Explore human factors

Slide 101

Slide 101 text

2019 101 Incidents happen Blameless post-mortems ✔ Explore human factors Root Cause Analysis can be limiting

Slide 102

Slide 102 text

2019 102 Incidents happen Blameless post-mortems ✔ Explore human factors Root Cause Analysis can be limiting Use GameDays to fill in the gaps

Slide 103

Slide 103 text

103 Scenario Recreate incidents 2019

Slide 104

Slide 104 text

104 Scenario Recreate incidents 2019

Slide 105

Slide 105 text

2019 105

Slide 106

Slide 106 text

2019 106

Slide 107

Slide 107 text

2019 107

Slide 108

Slide 108 text

2019 Training Post-mortem counterpart Application performance 108 Three use cases for chaos engineering

Slide 109

Slide 109 text

Proprietary & Confidential 2019 109 Chaos Performance Training Post-Mortem Load Testing Disaster Recovery Incident Simulation Support Tickets Cross-Team Eng. Performance Research

Slide 110

Slide 110 text

2019 110 Application Performance GameDay: A time-boxed opportunity to dive deeply into a specific topic, system, or set of tickets. Pulling from -  Engineering reports -  Support tickets

Slide 111

Slide 111 text

2019 111 Application Performance GameDay: A time-boxed opportunity to dive deeply into a specific topic, system, or set of tickets. Pulling from -  Engineering reports -  Support tickets

Slide 112

Slide 112 text

2019 112

Slide 113

Slide 113 text

2019 113 Conclusions If you don’t know where to start, try validating isolated parts of your infrastructure, application, or data storage. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 113 2019

Slide 114

Slide 114 text

2019 114 Conclusions Chaos engineering can help make any application more resilient, regardless of architecture. If you don’t know where to start, try validating isolated parts of your infrastructure, application, or data storage. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 114 2019

Slide 115

Slide 115 text

2019 115 Conclusions Chaos engineering can help make any application more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 115 2019

Slide 116

Slide 116 text

2019 116 Conclusions Chaos engineering can help make any application more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 116 2019

Slide 117

Slide 117 text

2019 117 Conclusions Chaos engineering can help make any application more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 117 2019

Slide 118

Slide 118 text

2019 Thank you! Caroline Dickey caroline@mailchimp.com @CarolineEDickey " 118