How did I get here? Building Confidence in a Distributed Stream Processor

3c53e91d2a6ceb1b7f202d709f638b1b?s=47 Sean T Allen
November 04, 2016

How did I get here? Building Confidence in a Distributed Stream Processor

When we build a distributed application, how do we have confidence that our results are correct? We can test our business logic over and over but if the engine executing it isn't trustworthy, we can't trust our results.

How can we build trust in our execution engines? We need to test them. It's hard enough to test a stream processor that runs on a single machine, it gets even more complicated when you start talking about a distributed stream processor. As Kyle Kingsbury's Jepsen series has shown, we have a long way to go creating tests that can provide confidence that our systems are trustworthy.

At Sendence, we're building a distributed streaming data analytics engine that we want to prove is trustworthy. This talk will focus on the various means we have come up with to create repeatable tests that allow us to start trusting that our system will give correct results. You’ll learn how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Together, we’ll move through the design process repeatedly answering the questions “What do we have to do to trust this result?” and “If we get the wrong result, how can we determine what went wrong so we can fix it?”. Hopefully you’ll leave this talk inspired to apply a similar process to your own projects.

Talk objectives:

- Understand the need for verification of distributed systems.
- Learn approaches and techniques for verification with distributed systems.
- Understand some of the different challenges and solutions for verification with stream processing systems.

Target audience:

- Developers and Architects interested in practical approaches to verify correctness in a distributed system.

3c53e91d2a6ceb1b7f202d709f638b1b?s=128

Sean T Allen

November 04, 2016
Tweet

Transcript

  1. Building Confidence in a Distributed Stream Processor sean@sendence.com @SeanTAllen How

    Did I Get Here?
  2. Sean T. Allen

  3. T

  4. T

  5. None
  6. None
  7. None
  8. None
  9. Experience Report

  10. Stateful Stream Processor

  11. None
  12. Sendence Wallaroo

  13. None
  14. None
  15. Prototype Started January 2016

  16. Prototype Started January 2016

  17. Production Version 1 December 2016

  18. Version 1 December 2016 Production

  19. America is all about speed. Hot, nasty, bad-ass speed. —

    Eleanor Roosevelt
  20. High Throughput Goals

  21. Low Latency Goals

  22. Less Hardware Goals

  23. 2014 class MacBook Pro: 220k events a second 99.99% processed

    in less than 500 µs
  24. America is all about data quality. Quiet, demure data quality.

    — Andrew Jackson
  25. High Fidelity Goals

  26. Stream Processing

  27. Message at a time

  28. Never ending

  29. Failure

  30. Machine Failure

  31. Slow Machine

  32. Segfaulting Process

  33. GC Pause

  34. Network Error

  35. Failure Happens

  36. Delivery Guarantees

  37. At-Most-Once

  38. At-Most-Once Best Effort

  39. At-Least-Once

  40. At-Least-Once ACK or resend

  41. Exactly-Once

  42. Exactly-Once At-Least-Once + Idempotence

  43. Exactly-Once

  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. Confidence

  56. Black Box Testing

  57. Black Box Testing

  58. Black Box Testing

  59. Black Box Testing

  60. Black Box Testing

  61. System Under Test Black Box Testing

  62. Input Source Black Box Testing

  63. Output Receiver Black Box Testing

  64. Unit Testing because isn't enough Black Box Testing

  65. Integration Testing because isn't enough Black Box Testing

  66. composed components because have interesting new failure modes Black Box

    Testing
  67. Test The Entire System Black Box Testing

  68. Test The Entire System end to end Black Box Testing

  69. Test The Entire System end to end Black Box Testing

    and verify your expectations
  70. Wesley Expectation verification for Buffy

  71. Wesley

  72. Wesley Input

  73. Wesley Output

  74. Wesley Input Output

  75. Input Source Wesley

  76. Input Source Wesley Output Receiver

  77. Wesley Input Source Records sent data 1,2,3,4

  78. Wesley Input Source Records sent data Records received data 2,4,6,8

    1,2,3,4 Output Receiver
  79. Wesley

  80. Wesley Analyze!

  81. Wesley

  82. Wesley

  83. Wesley

  84. Wesley

  85. Wesley

  86. Wesley

  87. Wesley

  88. Wesley

  89. Wesley It Works!

  90. Spike Fault injection for Buffy

  91. Fault Injection

  92. Lineage-driven fault injection

  93. Start from a good result Spike: LDFI

  94. Input Spike: LDFI

  95. Output Spike: LDFI

  96. Figure out what can go wrong Spike: LDFI

  97. Nemesis Spike: LDFI Each "wrong" is a possible

  98. The Network Spike: LDFI Our first nemesis:

  99. Determinism is key Spike

  100. Repeated runs with different results == Mostly Useless Spike

  101. Spike

  102. Spike Inject failures as informed by TCP

  103. Spike TCP Guarantees:

  104. Spike TCP Guarantees: Per connection in order delivery

  105. Spike Per connection in order delivery Per connection duplicate detection

    TCP Guarantees:
  106. Spike Per connection in order delivery Per connection duplicate detection

    Per connection retransmission of lost data TCP Guarantees:
  107. TCP in Pony: Event Driven

  108. TCP in Pony: Event Driven

  109. TCP in Pony: Event Driven

  110. TCP in Pony: Event Driven

  111. TCP in Pony: Event Driven

  112. Useless Notifier

  113. Useless Notifier

  114. Useless Notifier

  115. Dropped Connections Nemesis #1:

  116. Spike: Drop Connection

  117. Spike: Drop Connection

  118. Spike: Drop Connection

  119. Spike: Drop Connection

  120. Spike: Drop Connection

  121. Spike: Drop Connection • Incoming connection accepted

  122. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection
  123. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established
  124. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established • Data sent
  125. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established • Data sent • Data received
  126. Integrating Spike "Double and Halve" app

  127. Integrating Spike "Double and Halve" app

  128. Integrating Spike "Double and Halve" app

  129. Integrating Spike "Double and Halve" app

  130. Integrating Spike "Double and Halve" app

  131. Integrating Spike "Double and Halve" app

  132. Integrating Spike "Double and Halve" app

  133. Integrating Spike "Double and Halve" app

  134. Integrating Spike "Double and Halve" app

  135. Integrating Spike "Double and Halve" app

  136. • Easy to verify Integrating Spike "Double and Halve" app

  137. • Easy to verify • Messages cross process boundary Integrating

    Spike "Double and Halve" app
  138. • Easy to verify • Messages cross process boundary •

    Messages cross network boundary Integrating Spike "Double and Halve" app
  139. Integrating Spike • Double and Halve App

  140. Integrating Spike • Double and Halve App • No Spiking

  141. Integrating Spike • Double and Halve App • No Spiking

    • Test, Test, Test
  142. Integrating Spike • Double and Halve App • No Spiking

    • Test, Test, Test • Wesley: It passes! It passes! It passes!
  143. Integrating Spike • Double and Halve App

  144. Integrating Spike • Double and Halve App • Spike with

    “drop connection”
  145. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test
  146. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test • Wesley: It fails! It fails! It fails!
  147. Integrating Spike

  148. Integrating Spike == Session Recovery!

  149. Integrating Spike • Double and Halve App

  150. Integrating Spike • Double and Halve App • Spike with

    “drop connection”
  151. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test
  152. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test • Wesley: It passes! It passes! It passes!
  153. Repeated runs with different results == Mostly Useless Spike

  154. Determinism & Spike

  155. It's easy to get wrong Determinism & Spike

  156. Determinism & Spike TCP delivery is not deterministic

  157. Determinism & Spike TCP guarantees: Per connection in order delivery

  158. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection TCP guarantees:
  159. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection Per connection retransmission of lost data TCP guarantees:
  160. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection Per connection retransmission of lost data but it doesn't guarantee determinism TCP guarantees:
  161. Determinism & Spike TCP delivery is not deterministic

  162. Determinism & Spike TCP delivery is not deterministic

  163. Determinism & Spike TCP delivery is not deterministic

  164. Determinism & Spike TCP delivery is not deterministic Per method

    call Spiking won't work
  165. Determinism & Spike TCP delivery is not deterministic Per method

    call Spiking won't work unless we make it work…
  166. Determinism & Spike TCP message framing

  167. Determinism & Spike TCP message framing

  168. Determinism & Spike TCP message framing

  169. Determinism & Spike TCP message framing

  170. Determinism & Spike TCP message framing

  171. Determinism & Spike TCP message framing

  172. Determinism & Spike TCP message framing

  173. Determinism & Spike TCP message framing

  174. Determinism & Spike TCP message framing

  175. Determinism & Spike Expect in action

  176. Determinism & Spike Expect in action

  177. Determinism & Spike Expect in action

  178. Determinism & Spike Expect in action

  179. Determinism & Spike Expect in action

  180. Determinism & Spike Expect in action

  181. Determinism & Spike Expect in action

  182. Determinism & Spike Expect in action

  183. Determinism & Spike Expect in action

  184. Determinism & Spike Expect in action

  185. Determinism & Spike Expect makes received deterministic

  186. Determinism & Spike Expect makes received deterministic

  187. Determinism & Spike Expect makes received deterministic

  188. Determinism & Spike Expect makes received deterministic

  189. Determinism & Spike Expect makes received deterministic

  190. Determinism & Spike Expect makes received deterministic

  191. Determinism & Spike Expect makes received deterministic

  192. Determinism & Spike Received gets called with

  193. Determinism & Spike then…

  194. Determinism & Spike and then another…

  195. Determinism & Spike and finally…

  196. Same number of notifier method calls Determinism & Spike no

    matter how the data arrives
  197. Drop Connection & Expect fast deterministic friends Determinism & Spike

    Determinism & Spike
  198. Slow Connections Nemesis #2:

  199. Spike: Delay

  200. Spike: Delay

  201. Spike: Delay

  202. Spike: Delay

  203. Spike: Delay Delay overrides expect

  204. Spike: Delay Delay overrides expect and controls the flow of

    bytes
  205. Spike: Delay Delay overrides expect and controls the flow of

    bytes to maintain determinism
  206. Spike: Delay

  207. Spike: Delay

  208. Spike: Delay

  209. Spike: Delay

  210. Spike: Delay r TCP Spike

  211. Spike: Delay r TCP Spike

  212. Spike: Delay r TCP Spike

  213. Spike: Delay TCP

  214. Spike: Delay TCP TCP Spike

  215. Spike: Delay TCP TCP TCP Spike Spike

  216. Results

  217. Results • Bugs in Session Recovery Found…

  218. Results • Bugs in Session Recovery • Bug in Pony

    standard library Found…
  219. Results • Bugs in Session Recovery • Bug in Pony

    standard library • Bugs in Spike Found…
  220. Results • Bugs in Session Recovery • Bug in Pony

    standard library • Bugs in Spike • And more bugs… Found…
  221. Determinism is key Results Found…

  222. Determinism is key Results but hard to achieve Found…

  223. Data Lineage

  224. WARNING!!! Vaporware ahead

  225. Output Data Lineage How did I get here?

  226. Output Data Lineage

  227. Data Lineage Input: 1,2,3

  228. Data Lineage Input: 1,2,3 Expect: 2,4,6

  229. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6

  230. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6 How did

    we get here? these are not our beautiful results
  231. Data Lineage Input: 1,2,3

  232. Data Lineage Input: 1,2,3 Expect: 2,4,6

  233. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12

  234. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12 ¯\_(ϑ)_/¯

  235. Data Lineage to the Rescue!

  236. Data Lineage Externally verify determinism

  237. Data Lineage Externally verify determinism is it REALLY deterministic?

  238. Data Lineage Find incorrect executions

  239. Data Lineage Find incorrect executions bugs in Wallaroo

  240. Data Lineage Input: 1 Expected: 2 Got: 4 ¯\_(ϑ)_/¯

  241. Data Lineage Execution path was… when it should have been

  242. Data Lineage when it should have been Execution path was…

  243. Data Lineage Useful outside of development

  244. Data Lineage Production Debugging

  245. Data Lineage Production Debugging how did I get here?

  246. Data Lineage Audit Log

  247. Data Lineage Audit Log why did you do that?

  248. Data Lineage Hindsight Machine

  249. Building Confidence is difficult

  250. and frustrating

  251. None
  252. Peter Alvaro http://www.cs.berkeley.edu/~palvaro/molly.pdf @palvaro https://www.youtube.com/watch?v=ggCffvKEJmQ Lineage-driven Fault Injection: Outwards from

    the Middle of the Maze:
  253. Kyle Kingsbury https://aphyr.com/tags/Jepsen @aphyr Jepsen:

  254. Will Wilson https://www.youtube.com/watch?v=4fFDFbi3toc Testing Distributed Systems w/ Deterministic Simulation:

  255. Catie McCaffrey http://queue.acm.org/detail.cfm?ref=rss&id=2889274 @caitie The Verification of a Distributed System

    The Verification of a Distributed System: A practitioner's guide to increasing confidence in system correctness https://www.infoq.com/presentations/distributed-systems- verification
  256. Inés Sombra https://www.youtube.com/watch?v=KSdNYi55kjg Testing in a Distributed World: @randommood

  257. http://principlesofchaos.org Principles of Chaos Engineering: Chaos Engineering

  258. None
  259. Thanks Peter Alvaro Sylvan Clebsch Zeeshan Lakhani John Mumm Rob

    Roland Andrew Turley
  260. @SeanTAllen Note: The 'T' is very important