Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How did I get here? Building confidence in a distributed stream processor

How did I get here? Building confidence in a distributed stream processor

When we build a distributed application, how do we have confidence that our results are correct? We can test our business logic over and over but if the engine executing it isn't trustworthy, we can't trust our results.

How can we build trust in our execution engines? We need to test them. It's hard enough to test a stream processor that runs on a single machine, it gets even more complicated when you start talking about a distributed stream processor. As Kyle Kingsbury's Jepsen series has shown, we have a long way to go creating tests that can provide confidence that our systems are trustworthy.

At Sendence, we're building a distributed streaming data analytics engine that we want to prove is trustworthy. This talk will focus on the various means we have come up with to create repeatable tests that allow us to start trusting that our system will give correct results. You’ll learn how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Together, we’ll move through the design process repeatedly answering the questions “What do we have to do to trust this result?” and “If we get the wrong result, how can we determine what went wrong so we can fix it?”. Hopefully you’ll leave this talk inspired to apply a similar process to your own projects.

3c53e91d2a6ceb1b7f202d709f638b1b?s=128

Sean T Allen

June 13, 2016
Tweet

Transcript

  1. How Did I Get Here? Building Confidence in a Distributed

    Stream Processor
  2. Sean T. Allen

  3. T

  4. T

  5. None
  6. None
  7. None
  8. None
  9. Experience Report

  10. Stream Processor

  11. None
  12. Prototype Started January 2016

  13. Prototype Started January 2016

  14. Production Started April 2016

  15. Production Started April 2016

  16. America is all about speed. Hot, nasty, bad-ass speed. —

    Eleanor Roosevelt
  17. High Throughput Buffy: Goals

  18. Low Latency Buffy: Goals

  19. Less Hardware Buffy: Goals

  20. America is all about data quality. Quiet, demure data quality.

    — Andrew Jackson
  21. High Fidelity Buffy: Goals

  22. Stream Processing

  23. Message at a time

  24. Never ending

  25. Failure

  26. Machine Failure

  27. Slow Machine

  28. Segfaulting Process

  29. GC Pause

  30. Network Error

  31. Failure Happens

  32. Delivery Guarantees

  33. At-Most-Once

  34. At-Most-Once Best Effort

  35. At-Least-Once

  36. At-Least-Once ACK or resend

  37. Exactly-Once

  38. Exactly-Once At-Least-Once + Idempotence

  39. Exactly-Once

  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None
  61. None
  62. Confidence

  63. Black Box Testing

  64. Black Box Testing

  65. Black Box Testing

  66. Black Box Testing

  67. Black Box Testing

  68. System Under Test Black Box Testing

  69. Input Source Black Box Testing

  70. Output Receiver Black Box Testing

  71. Unit Testing because isn't enough Black Box Testing

  72. Integration Testing because isn't enough Black Box Testing

  73. composed components because have interesting new failure modes Black Box

    Testing
  74. Test The Entire System Black Box Testing

  75. Test The Entire System end to end Black Box Testing

  76. Test The Entire System end to end Black Box Testing

    and verify your expectations
  77. Wesley Expectation verification for Buffy

  78. Wesley

  79. Wesley Input

  80. Wesley Output

  81. Wesley Input Output

  82. Input Source Wesley

  83. Input Source Wesley Output Receiver

  84. Wesley Input Source Records sent data 1,2,3,4

  85. Wesley Input Source Records sent data Records received data 2,4,6,8

    1,2,3,4 Output Receiver
  86. Wesley

  87. Wesley Analyze!

  88. Wesley

  89. Wesley

  90. Wesley

  91. Wesley

  92. Wesley

  93. Wesley

  94. Wesley

  95. Wesley

  96. Wesley It Works!

  97. Spike Fault injection for Buffy

  98. Fault Injection

  99. Lineage-driven fault injection

  100. Start from a good result Spike: LDFI

  101. Input Spike: LDFI

  102. Output Spike: LDFI

  103. Figure out what can go wrong Spike: LDFI

  104. Nemesis Spike: LDFI Each "wrong" is a possible

  105. The Network Spike: LDFI Our first nemesis:

  106. Determinism is key Spike

  107. Repeated runs with different results == Mostly Useless Spike

  108. Spike

  109. Spike Inject failures as informed by TCP

  110. Spike TCP Guarantees:

  111. Spike TCP Guarantees: Per connection in order delivery

  112. Spike Per connection in order delivery Per connection duplicate detection

    TCP Guarantees:
  113. Spike Per connection in order delivery Per connection duplicate detection

    Per connection retransmission of lost data TCP Guarantees:
  114. TCP in Pony: Event Driven

  115. TCP in Pony: Event Driven

  116. TCP in Pony: Event Driven

  117. TCP in Pony: Event Driven

  118. TCP in Pony: Event Driven

  119. Useless Notifier

  120. Useless Notifier

  121. Useless Notifier

  122. Dropped Connections Nemesis #1:

  123. Spike: Drop Connection

  124. Spike: Drop Connection

  125. Spike: Drop Connection

  126. Spike: Drop Connection

  127. Spike: Drop Connection

  128. Spike: Drop Connection • Incoming connection accepted

  129. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection
  130. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established
  131. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established • Data sent
  132. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established • Data sent • Data received
  133. Integrating Spike "Double and Halve" app

  134. Integrating Spike "Double and Halve" app

  135. Integrating Spike "Double and Halve" app

  136. Integrating Spike "Double and Halve" app

  137. Integrating Spike "Double and Halve" app

  138. Integrating Spike "Double and Halve" app

  139. Integrating Spike "Double and Halve" app

  140. Integrating Spike "Double and Halve" app

  141. Integrating Spike "Double and Halve" app

  142. Integrating Spike "Double and Halve" app

  143. • Easy to verify Integrating Spike "Double and Halve" app

  144. • Easy to verify • Messages cross process boundary Integrating

    Spike "Double and Halve" app
  145. • Easy to verify • Messages cross process boundary •

    Messages cross network boundary Integrating Spike "Double and Halve" app
  146. Integrating Spike • Double and Halve App

  147. Integrating Spike • Double and Halve App • No Spiking

  148. Integrating Spike • Double and Halve App • No Spiking

    • Test, Test, Test
  149. Integrating Spike • Double and Halve App • No Spiking

    • Test, Test, Test • Wesley: It passes! It passes! It passes!
  150. Integrating Spike • Double and Halve App

  151. Integrating Spike • Double and Halve App • Spike with

    “drop connection”
  152. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test
  153. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test • Wesley: It fails! It fails! It fails!
  154. Integrating Spike

  155. Integrating Spike == Session Recovery!

  156. Integrating Spike • Double and Halve App

  157. Integrating Spike • Double and Halve App • Spike with

    “drop connection”
  158. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test
  159. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test • Wesley: It passes! It passes! It passes!
  160. Repeated runs with different results == Mostly Useless Spike

  161. Determinism & Spike

  162. It's easy to get wrong Determinism & Spike

  163. Determinism & Spike TCP delivery is not deterministic

  164. Determinism & Spike TCP guarantees: Per connection in order delivery

  165. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection TCP guarantees:
  166. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection Per connection retransmission of lost data TCP guarantees:
  167. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection Per connection retransmission of lost data but it doesn't guarantee determinism TCP guarantees:
  168. Determinism & Spike TCP delivery is not deterministic

  169. Determinism & Spike TCP delivery is not deterministic

  170. Determinism & Spike TCP delivery is not deterministic

  171. Determinism & Spike TCP delivery is not deterministic Per method

    call Spiking won't work
  172. Determinism & Spike TCP delivery is not deterministic Per method

    call Spiking won't work unless we make it work…
  173. Determinism & Spike TCP message framing

  174. Determinism & Spike TCP message framing

  175. Determinism & Spike TCP message framing

  176. Determinism & Spike TCP message framing

  177. Determinism & Spike TCP message framing

  178. Determinism & Spike TCP message framing

  179. Determinism & Spike TCP message framing

  180. Determinism & Spike TCP message framing

  181. Determinism & Spike TCP message framing

  182. Determinism & Spike Expect in action

  183. Determinism & Spike Expect in action

  184. Determinism & Spike Expect in action

  185. Determinism & Spike Expect in action

  186. Determinism & Spike Expect in action

  187. Determinism & Spike Expect in action

  188. Determinism & Spike Expect in action

  189. Determinism & Spike Expect in action

  190. Determinism & Spike Expect in action

  191. Determinism & Spike Expect in action

  192. Determinism & Spike Expect makes received deterministic

  193. Determinism & Spike Expect makes received deterministic

  194. Determinism & Spike Expect makes received deterministic

  195. Determinism & Spike Expect makes received deterministic

  196. Determinism & Spike Expect makes received deterministic

  197. Determinism & Spike Expect makes received deterministic

  198. Determinism & Spike Expect makes received deterministic

  199. Determinism & Spike Received gets called with

  200. Determinism & Spike then…

  201. Determinism & Spike and then another…

  202. Determinism & Spike and finally…

  203. Same number of notifier method calls Determinism & Spike no

    matter how the data arrives
  204. Drop Connection & Expect fast deterministic friends Determinism & Spike

    Determinism & Spike
  205. Slow Connections Nemesis #1:

  206. Spike: Delay

  207. Spike: Delay

  208. Spike: Delay

  209. Spike: Delay

  210. Spike: Delay Delay overrides expect

  211. Spike: Delay Delay overrides expect and controls the flow of

    bytes
  212. Spike: Delay Delay overrides expect and controls the flow of

    bytes to maintain determinism
  213. Spike: Delay

  214. Spike: Delay

  215. Spike: Delay

  216. Spike: Delay

  217. Spike: Delay r TCP Spike

  218. Spike: Delay r TCP Spike

  219. Spike: Delay r TCP Spike

  220. Spike: Delay TCP

  221. Spike: Delay TCP TCP Spike

  222. Spike: Delay TCP TCP TCP Spike Spike

  223. Early Results

  224. Early Results • Bugs in Session Recovery Found…

  225. Early Results • Bugs in Session Recovery • Bug in

    Pony standard library Found…
  226. Early Results • Bugs in Session Recovery • Bug in

    Pony standard library • Bugs in Spike Found…
  227. Early Results • Bugs in Session Recovery • Bug in

    Pony standard library • Bugs in Spike • And more bugs… Found…
  228. Determinism is key Early Results Found…

  229. Determinism is key Early Results but hard to achieve Found…

  230. Data Lineage

  231. WARNING!!! Vaporware ahead

  232. Output Data Lineage How did I get here?

  233. Output Data Lineage

  234. Data Lineage Input: 1,2,3

  235. Data Lineage Input: 1,2,3 Expect: 2,4,6

  236. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6

  237. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6 How did

    we get here? these are not our beautiful results
  238. Data Lineage Input: 1,2,3

  239. Data Lineage Input: 1,2,3 Expect: 2,4,6

  240. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12

  241. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12 ¯\_(ϑ)_/¯

  242. Data Lineage to the Rescue!

  243. Data Lineage Externally verify determinism

  244. Data Lineage Externally verify determinism is it REALLY deterministic?

  245. Data Lineage Find incorrect executions

  246. Data Lineage Find incorrect executions bugs in Buffy

  247. Data Lineage Input: 1 Expected: 2 Got: 4 ¯\_(ϑ)_/¯

  248. Data Lineage Execution path was… when it should have been

  249. Data Lineage when it should have been Execution path was…

  250. Data Lineage Useful outside of development

  251. Data Lineage Production Debugging

  252. Data Lineage Production Debugging how did I get here?

  253. Data Lineage Audit Log

  254. Data Lineage Audit Log why did you do that?

  255. Data Lineage Hindsight Machine

  256. Building Confidence is difficult

  257. and frustrating

  258. None
  259. Don't be this dog

  260. Be this dog

  261. None
  262. Peter Alvaro http://www.cs.berkeley.edu/~palvaro/molly.pdf @palvaro https://www.youtube.com/watch?v=ggCffvKEJmQ Lineage-driven Fault Injection: Outwards from

    the Middle of the Maze:
  263. Kyle Kingsbury https://aphyr.com/tags/Jepsen @aphyr Jepsen:

  264. Will Wilson https://www.youtube.com/watch?v=4fFDFbi3toc Testing Distributed Systems w/ Deterministic Simulation:

  265. Catie McCaffrey http://queue.acm.org/detail.cfm?ref=rss&id=2889274 @caitie The Verification of a Distributed System

    The Verification of a Distributed System: A practitioner's guide to increasing confidence in system correctness 2:55 PM Tomorrow in Salon E
  266. Inés Sombra https://www.youtube.com/watch?v=KSdNYi55kjg Testing in a Distributed World: @randommood

  267. http://principlesofchaos.org Principles of Chaos Engineering: Chaos Engineering

  268. None
  269. Thanks Peter Alvaro Sylvan Clebsch Zeeshan Lakhani John Mumm Rob

    Roland Andrew Turley
  270. @SeanTAllen Note: The 'T' is very important