Slide 1

Slide 1 text

Building Confidence in a Distributed Stream Processor [email protected] @SeanTAllen How Did I Get Here?

Slide 2

Slide 2 text

Sean T. Allen

Slide 3

Slide 3 text

T

Slide 4

Slide 4 text

T

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Experience Report

Slide 10

Slide 10 text

Stateful Stream Processor

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Sendence Wallaroo

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Prototype Started January 2016

Slide 16

Slide 16 text

Prototype Started January 2016

Slide 17

Slide 17 text

Production Version 1 December 2016

Slide 18

Slide 18 text

Version 1 December 2016 Production

Slide 19

Slide 19 text

America is all about speed. Hot, nasty, bad-ass speed. — Eleanor Roosevelt

Slide 20

Slide 20 text

High Throughput Goals

Slide 21

Slide 21 text

Low Latency Goals

Slide 22

Slide 22 text

Less Hardware Goals

Slide 23

Slide 23 text

2014 class MacBook Pro: 220k events a second 99.99% processed in less than 500 µs

Slide 24

Slide 24 text

America is all about data quality. Quiet, demure data quality. — Andrew Jackson

Slide 25

Slide 25 text

High Fidelity Goals

Slide 26

Slide 26 text

Stream Processing

Slide 27

Slide 27 text

Message at a time

Slide 28

Slide 28 text

Never ending

Slide 29

Slide 29 text

Failure

Slide 30

Slide 30 text

Machine Failure

Slide 31

Slide 31 text

Slow Machine

Slide 32

Slide 32 text

Segfaulting Process

Slide 33

Slide 33 text

GC Pause

Slide 34

Slide 34 text

Network Error

Slide 35

Slide 35 text

Failure Happens

Slide 36

Slide 36 text

Delivery Guarantees

Slide 37

Slide 37 text

At-Most-Once

Slide 38

Slide 38 text

At-Most-Once Best Effort

Slide 39

Slide 39 text

At-Least-Once

Slide 40

Slide 40 text

At-Least-Once ACK or resend

Slide 41

Slide 41 text

Exactly-Once

Slide 42

Slide 42 text

Exactly-Once At-Least-Once + Idempotence

Slide 43

Slide 43 text

Exactly-Once

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

Confidence

Slide 56

Slide 56 text

Black Box Testing

Slide 57

Slide 57 text

Black Box Testing

Slide 58

Slide 58 text

Black Box Testing

Slide 59

Slide 59 text

Black Box Testing

Slide 60

Slide 60 text

Black Box Testing

Slide 61

Slide 61 text

System Under Test Black Box Testing

Slide 62

Slide 62 text

Input Source Black Box Testing

Slide 63

Slide 63 text

Output Receiver Black Box Testing

Slide 64

Slide 64 text

Unit Testing because isn't enough Black Box Testing

Slide 65

Slide 65 text

Integration Testing because isn't enough Black Box Testing

Slide 66

Slide 66 text

composed components because have interesting new failure modes Black Box Testing

Slide 67

Slide 67 text

Test The Entire System Black Box Testing

Slide 68

Slide 68 text

Test The Entire System end to end Black Box Testing

Slide 69

Slide 69 text

Test The Entire System end to end Black Box Testing and verify your expectations

Slide 70

Slide 70 text

Wesley Expectation verification for Buffy

Slide 71

Slide 71 text

Wesley

Slide 72

Slide 72 text

Wesley Input

Slide 73

Slide 73 text

Wesley Output

Slide 74

Slide 74 text

Wesley Input Output

Slide 75

Slide 75 text

Input Source Wesley

Slide 76

Slide 76 text

Input Source Wesley Output Receiver

Slide 77

Slide 77 text

Wesley Input Source Records sent data 1,2,3,4

Slide 78

Slide 78 text

Wesley Input Source Records sent data Records received data 2,4,6,8 1,2,3,4 Output Receiver

Slide 79

Slide 79 text

Wesley

Slide 80

Slide 80 text

Wesley Analyze!

Slide 81

Slide 81 text

Wesley

Slide 82

Slide 82 text

Wesley

Slide 83

Slide 83 text

Wesley

Slide 84

Slide 84 text

Wesley

Slide 85

Slide 85 text

Wesley

Slide 86

Slide 86 text

Wesley

Slide 87

Slide 87 text

Wesley

Slide 88

Slide 88 text

Wesley

Slide 89

Slide 89 text

Wesley It Works!

Slide 90

Slide 90 text

Spike Fault injection for Buffy

Slide 91

Slide 91 text

Fault Injection

Slide 92

Slide 92 text

Lineage-driven fault injection

Slide 93

Slide 93 text

Start from a good result Spike: LDFI

Slide 94

Slide 94 text

Input Spike: LDFI

Slide 95

Slide 95 text

Output Spike: LDFI

Slide 96

Slide 96 text

Figure out what can go wrong Spike: LDFI

Slide 97

Slide 97 text

Nemesis Spike: LDFI Each "wrong" is a possible

Slide 98

Slide 98 text

The Network Spike: LDFI Our first nemesis:

Slide 99

Slide 99 text

Determinism is key Spike

Slide 100

Slide 100 text

Repeated runs with different results == Mostly Useless Spike

Slide 101

Slide 101 text

Spike

Slide 102

Slide 102 text

Spike Inject failures as informed by TCP

Slide 103

Slide 103 text

Spike TCP Guarantees:

Slide 104

Slide 104 text

Spike TCP Guarantees: Per connection in order delivery

Slide 105

Slide 105 text

Spike Per connection in order delivery Per connection duplicate detection TCP Guarantees:

Slide 106

Slide 106 text

Spike Per connection in order delivery Per connection duplicate detection Per connection retransmission of lost data TCP Guarantees:

Slide 107

Slide 107 text

TCP in Pony: Event Driven

Slide 108

Slide 108 text

TCP in Pony: Event Driven

Slide 109

Slide 109 text

TCP in Pony: Event Driven

Slide 110

Slide 110 text

TCP in Pony: Event Driven

Slide 111

Slide 111 text

TCP in Pony: Event Driven

Slide 112

Slide 112 text

Useless Notifier

Slide 113

Slide 113 text

Useless Notifier

Slide 114

Slide 114 text

Useless Notifier

Slide 115

Slide 115 text

Dropped Connections Nemesis #1:

Slide 116

Slide 116 text

Spike: Drop Connection

Slide 117

Slide 117 text

Spike: Drop Connection

Slide 118

Slide 118 text

Spike: Drop Connection

Slide 119

Slide 119 text

Spike: Drop Connection

Slide 120

Slide 120 text

Spike: Drop Connection

Slide 121

Slide 121 text

Spike: Drop Connection • Incoming connection accepted

Slide 122

Slide 122 text

Spike: Drop Connection • Incoming connection accepted • Attempting outgoing connection

Slide 123

Slide 123 text

Spike: Drop Connection • Incoming connection accepted • Attempting outgoing connection • Connection established

Slide 124

Slide 124 text

Spike: Drop Connection • Incoming connection accepted • Attempting outgoing connection • Connection established • Data sent

Slide 125

Slide 125 text

Spike: Drop Connection • Incoming connection accepted • Attempting outgoing connection • Connection established • Data sent • Data received

Slide 126

Slide 126 text

Integrating Spike "Double and Halve" app

Slide 127

Slide 127 text

Integrating Spike "Double and Halve" app

Slide 128

Slide 128 text

Integrating Spike "Double and Halve" app

Slide 129

Slide 129 text

Integrating Spike "Double and Halve" app

Slide 130

Slide 130 text

Integrating Spike "Double and Halve" app

Slide 131

Slide 131 text

Integrating Spike "Double and Halve" app

Slide 132

Slide 132 text

Integrating Spike "Double and Halve" app

Slide 133

Slide 133 text

Integrating Spike "Double and Halve" app

Slide 134

Slide 134 text

Integrating Spike "Double and Halve" app

Slide 135

Slide 135 text

Integrating Spike "Double and Halve" app

Slide 136

Slide 136 text

• Easy to verify Integrating Spike "Double and Halve" app

Slide 137

Slide 137 text

• Easy to verify • Messages cross process boundary Integrating Spike "Double and Halve" app

Slide 138

Slide 138 text

• Easy to verify • Messages cross process boundary • Messages cross network boundary Integrating Spike "Double and Halve" app

Slide 139

Slide 139 text

Integrating Spike • Double and Halve App

Slide 140

Slide 140 text

Integrating Spike • Double and Halve App • No Spiking

Slide 141

Slide 141 text

Integrating Spike • Double and Halve App • No Spiking • Test, Test, Test

Slide 142

Slide 142 text

Integrating Spike • Double and Halve App • No Spiking • Test, Test, Test • Wesley: It passes! It passes! It passes!

Slide 143

Slide 143 text

Integrating Spike • Double and Halve App

Slide 144

Slide 144 text

Integrating Spike • Double and Halve App • Spike with “drop connection”

Slide 145

Slide 145 text

Integrating Spike • Double and Halve App • Spike with “drop connection” • Test, Test, Test

Slide 146

Slide 146 text

Integrating Spike • Double and Halve App • Spike with “drop connection” • Test, Test, Test • Wesley: It fails! It fails! It fails!

Slide 147

Slide 147 text

Integrating Spike

Slide 148

Slide 148 text

Integrating Spike == Session Recovery!

Slide 149

Slide 149 text

Integrating Spike • Double and Halve App

Slide 150

Slide 150 text

Integrating Spike • Double and Halve App • Spike with “drop connection”

Slide 151

Slide 151 text

Integrating Spike • Double and Halve App • Spike with “drop connection” • Test, Test, Test

Slide 152

Slide 152 text

Integrating Spike • Double and Halve App • Spike with “drop connection” • Test, Test, Test • Wesley: It passes! It passes! It passes!

Slide 153

Slide 153 text

Repeated runs with different results == Mostly Useless Spike

Slide 154

Slide 154 text

Determinism & Spike

Slide 155

Slide 155 text

It's easy to get wrong Determinism & Spike

Slide 156

Slide 156 text

Determinism & Spike TCP delivery is not deterministic

Slide 157

Slide 157 text

Determinism & Spike TCP guarantees: Per connection in order delivery

Slide 158

Slide 158 text

Determinism & Spike Per connection in order delivery Per connection duplicate detection TCP guarantees:

Slide 159

Slide 159 text

Determinism & Spike Per connection in order delivery Per connection duplicate detection Per connection retransmission of lost data TCP guarantees:

Slide 160

Slide 160 text

Determinism & Spike Per connection in order delivery Per connection duplicate detection Per connection retransmission of lost data but it doesn't guarantee determinism TCP guarantees:

Slide 161

Slide 161 text

Determinism & Spike TCP delivery is not deterministic

Slide 162

Slide 162 text

Determinism & Spike TCP delivery is not deterministic

Slide 163

Slide 163 text

Determinism & Spike TCP delivery is not deterministic

Slide 164

Slide 164 text

Determinism & Spike TCP delivery is not deterministic Per method call Spiking won't work

Slide 165

Slide 165 text

Determinism & Spike TCP delivery is not deterministic Per method call Spiking won't work unless we make it work…

Slide 166

Slide 166 text

Determinism & Spike TCP message framing

Slide 167

Slide 167 text

Determinism & Spike TCP message framing

Slide 168

Slide 168 text

Determinism & Spike TCP message framing

Slide 169

Slide 169 text

Determinism & Spike TCP message framing

Slide 170

Slide 170 text

Determinism & Spike TCP message framing

Slide 171

Slide 171 text

Determinism & Spike TCP message framing

Slide 172

Slide 172 text

Determinism & Spike TCP message framing

Slide 173

Slide 173 text

Determinism & Spike TCP message framing

Slide 174

Slide 174 text

Determinism & Spike TCP message framing

Slide 175

Slide 175 text

Determinism & Spike Expect in action

Slide 176

Slide 176 text

Determinism & Spike Expect in action

Slide 177

Slide 177 text

Determinism & Spike Expect in action

Slide 178

Slide 178 text

Determinism & Spike Expect in action

Slide 179

Slide 179 text

Determinism & Spike Expect in action

Slide 180

Slide 180 text

Determinism & Spike Expect in action

Slide 181

Slide 181 text

Determinism & Spike Expect in action

Slide 182

Slide 182 text

Determinism & Spike Expect in action

Slide 183

Slide 183 text

Determinism & Spike Expect in action

Slide 184

Slide 184 text

Determinism & Spike Expect in action

Slide 185

Slide 185 text

Determinism & Spike Expect makes received deterministic

Slide 186

Slide 186 text

Determinism & Spike Expect makes received deterministic

Slide 187

Slide 187 text

Determinism & Spike Expect makes received deterministic

Slide 188

Slide 188 text

Determinism & Spike Expect makes received deterministic

Slide 189

Slide 189 text

Determinism & Spike Expect makes received deterministic

Slide 190

Slide 190 text

Determinism & Spike Expect makes received deterministic

Slide 191

Slide 191 text

Determinism & Spike Expect makes received deterministic

Slide 192

Slide 192 text

Determinism & Spike Received gets called with

Slide 193

Slide 193 text

Determinism & Spike then…

Slide 194

Slide 194 text

Determinism & Spike and then another…

Slide 195

Slide 195 text

Determinism & Spike and finally…

Slide 196

Slide 196 text

Same number of notifier method calls Determinism & Spike no matter how the data arrives

Slide 197

Slide 197 text

Drop Connection & Expect fast deterministic friends Determinism & Spike Determinism & Spike

Slide 198

Slide 198 text

Slow Connections Nemesis #2:

Slide 199

Slide 199 text

Spike: Delay

Slide 200

Slide 200 text

Spike: Delay

Slide 201

Slide 201 text

Spike: Delay

Slide 202

Slide 202 text

Spike: Delay

Slide 203

Slide 203 text

Spike: Delay Delay overrides expect

Slide 204

Slide 204 text

Spike: Delay Delay overrides expect and controls the flow of bytes

Slide 205

Slide 205 text

Spike: Delay Delay overrides expect and controls the flow of bytes to maintain determinism

Slide 206

Slide 206 text

Spike: Delay

Slide 207

Slide 207 text

Spike: Delay

Slide 208

Slide 208 text

Spike: Delay

Slide 209

Slide 209 text

Spike: Delay

Slide 210

Slide 210 text

Spike: Delay r TCP Spike

Slide 211

Slide 211 text

Spike: Delay r TCP Spike

Slide 212

Slide 212 text

Spike: Delay r TCP Spike

Slide 213

Slide 213 text

Spike: Delay TCP

Slide 214

Slide 214 text

Spike: Delay TCP TCP Spike

Slide 215

Slide 215 text

Spike: Delay TCP TCP TCP Spike Spike

Slide 216

Slide 216 text

Results

Slide 217

Slide 217 text

Results • Bugs in Session Recovery Found…

Slide 218

Slide 218 text

Results • Bugs in Session Recovery • Bug in Pony standard library Found…

Slide 219

Slide 219 text

Results • Bugs in Session Recovery • Bug in Pony standard library • Bugs in Spike Found…

Slide 220

Slide 220 text

Results • Bugs in Session Recovery • Bug in Pony standard library • Bugs in Spike • And more bugs… Found…

Slide 221

Slide 221 text

Determinism is key Results Found…

Slide 222

Slide 222 text

Determinism is key Results but hard to achieve Found…

Slide 223

Slide 223 text

Data Lineage

Slide 224

Slide 224 text

WARNING!!! Vaporware ahead

Slide 225

Slide 225 text

Output Data Lineage How did I get here?

Slide 226

Slide 226 text

Output Data Lineage

Slide 227

Slide 227 text

Data Lineage Input: 1,2,3

Slide 228

Slide 228 text

Data Lineage Input: 1,2,3 Expect: 2,4,6

Slide 229

Slide 229 text

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6

Slide 230

Slide 230 text

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6 How did we get here? these are not our beautiful results

Slide 231

Slide 231 text

Data Lineage Input: 1,2,3

Slide 232

Slide 232 text

Data Lineage Input: 1,2,3 Expect: 2,4,6

Slide 233

Slide 233 text

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12

Slide 234

Slide 234 text

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12 ¯\_(ϑ)_/¯

Slide 235

Slide 235 text

Data Lineage to the Rescue!

Slide 236

Slide 236 text

Data Lineage Externally verify determinism

Slide 237

Slide 237 text

Data Lineage Externally verify determinism is it REALLY deterministic?

Slide 238

Slide 238 text

Data Lineage Find incorrect executions

Slide 239

Slide 239 text

Data Lineage Find incorrect executions bugs in Wallaroo

Slide 240

Slide 240 text

Data Lineage Input: 1 Expected: 2 Got: 4 ¯\_(ϑ)_/¯

Slide 241

Slide 241 text

Data Lineage Execution path was… when it should have been

Slide 242

Slide 242 text

Data Lineage when it should have been Execution path was…

Slide 243

Slide 243 text

Data Lineage Useful outside of development

Slide 244

Slide 244 text

Data Lineage Production Debugging

Slide 245

Slide 245 text

Data Lineage Production Debugging how did I get here?

Slide 246

Slide 246 text

Data Lineage Audit Log

Slide 247

Slide 247 text

Data Lineage Audit Log why did you do that?

Slide 248

Slide 248 text

Data Lineage Hindsight Machine

Slide 249

Slide 249 text

Building Confidence is difficult

Slide 250

Slide 250 text

and frustrating

Slide 251

Slide 251 text

No content

Slide 252

Slide 252 text

Peter Alvaro http://www.cs.berkeley.edu/~palvaro/molly.pdf @palvaro https://www.youtube.com/watch?v=ggCffvKEJmQ Lineage-driven Fault Injection: Outwards from the Middle of the Maze:

Slide 253

Slide 253 text

Kyle Kingsbury https://aphyr.com/tags/Jepsen @aphyr Jepsen:

Slide 254

Slide 254 text

Will Wilson https://www.youtube.com/watch?v=4fFDFbi3toc Testing Distributed Systems w/ Deterministic Simulation:

Slide 255

Slide 255 text

Catie McCaffrey http://queue.acm.org/detail.cfm?ref=rss&id=2889274 @caitie The Verification of a Distributed System The Verification of a Distributed System: A practitioner's guide to increasing confidence in system correctness https://www.infoq.com/presentations/distributed-systems- verification

Slide 256

Slide 256 text

Inés Sombra https://www.youtube.com/watch?v=KSdNYi55kjg Testing in a Distributed World: @randommood

Slide 257

Slide 257 text

http://principlesofchaos.org Principles of Chaos Engineering: Chaos Engineering

Slide 258

Slide 258 text

No content

Slide 259

Slide 259 text

Thanks Peter Alvaro Sylvan Clebsch Zeeshan Lakhani John Mumm Rob Roland Andrew Turley

Slide 260

Slide 260 text

@SeanTAllen Note: The 'T' is very important