Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How did I get here? Building Confidence in a Distributed Stream Processor

Sean T Allen
November 04, 2016

How did I get here? Building Confidence in a Distributed Stream Processor

When we build a distributed application, how do we have confidence that our results are correct? We can test our business logic over and over but if the engine executing it isn't trustworthy, we can't trust our results.

How can we build trust in our execution engines? We need to test them. It's hard enough to test a stream processor that runs on a single machine, it gets even more complicated when you start talking about a distributed stream processor. As Kyle Kingsbury's Jepsen series has shown, we have a long way to go creating tests that can provide confidence that our systems are trustworthy.

At Sendence, we're building a distributed streaming data analytics engine that we want to prove is trustworthy. This talk will focus on the various means we have come up with to create repeatable tests that allow us to start trusting that our system will give correct results. You’ll learn how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Together, we’ll move through the design process repeatedly answering the questions “What do we have to do to trust this result?” and “If we get the wrong result, how can we determine what went wrong so we can fix it?”. Hopefully you’ll leave this talk inspired to apply a similar process to your own projects.

Talk objectives:

- Understand the need for verification of distributed systems.
- Learn approaches and techniques for verification with distributed systems.
- Understand some of the different challenges and solutions for verification with stream processing systems.

Target audience:

- Developers and Architects interested in practical approaches to verify correctness in a distributed system.

Sean T Allen

November 04, 2016
Tweet

More Decks by Sean T Allen

Other Decks in Technology

Transcript

  1. Building Confidence in a Distributed Stream Processor
    [email protected]
    @SeanTAllen
    How Did I Get Here?

    View Slide

  2. Sean T. Allen

    View Slide

  3. T

    View Slide

  4. T

    View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. Experience Report

    View Slide

  10. Stateful Stream
    Processor

    View Slide

  11. View Slide

  12. Sendence Wallaroo

    View Slide

  13. View Slide

  14. View Slide

  15. Prototype
    Started January 2016

    View Slide

  16. Prototype
    Started January 2016

    View Slide

  17. Production
    Version 1 December 2016

    View Slide

  18. Version 1 December 2016
    Production

    View Slide

  19. America
    is all about speed.
    Hot, nasty, bad-ass speed.
    — Eleanor Roosevelt

    View Slide

  20. High Throughput
    Goals

    View Slide

  21. Low Latency
    Goals

    View Slide

  22. Less Hardware
    Goals

    View Slide

  23. 2014 class MacBook Pro:
    220k events a second
    99.99% processed in less than 500 µs

    View Slide

  24. America
    is all about data quality.
    Quiet, demure data quality.
    — Andrew Jackson

    View Slide

  25. High Fidelity
    Goals

    View Slide

  26. Stream Processing

    View Slide

  27. Message at a time

    View Slide

  28. Never ending

    View Slide

  29. Failure

    View Slide

  30. Machine Failure

    View Slide

  31. Slow Machine

    View Slide

  32. Segfaulting Process

    View Slide

  33. GC Pause

    View Slide

  34. Network Error

    View Slide

  35. Failure Happens

    View Slide

  36. Delivery Guarantees

    View Slide

  37. At-Most-Once

    View Slide

  38. At-Most-Once
    Best Effort

    View Slide

  39. At-Least-Once

    View Slide

  40. At-Least-Once
    ACK or resend

    View Slide

  41. Exactly-Once

    View Slide

  42. Exactly-Once
    At-Least-Once + Idempotence

    View Slide

  43. Exactly-Once

    View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. Confidence

    View Slide

  56. Black Box Testing

    View Slide

  57. Black Box Testing

    View Slide

  58. Black Box Testing

    View Slide

  59. Black Box Testing

    View Slide

  60. Black Box Testing

    View Slide

  61. System Under Test
    Black Box Testing

    View Slide

  62. Input Source
    Black Box Testing

    View Slide

  63. Output Receiver
    Black Box Testing

    View Slide

  64. Unit Testing
    because
    isn't enough
    Black Box Testing

    View Slide

  65. Integration Testing
    because
    isn't enough
    Black Box Testing

    View Slide

  66. composed components
    because
    have interesting new failure modes
    Black Box Testing

    View Slide

  67. Test The Entire System
    Black Box Testing

    View Slide

  68. Test The Entire System
    end to end
    Black Box Testing

    View Slide

  69. Test The Entire System
    end to end
    Black Box Testing
    and verify your expectations

    View Slide

  70. Wesley
    Expectation verification for Buffy

    View Slide

  71. Wesley

    View Slide

  72. Wesley
    Input

    View Slide

  73. Wesley
    Output

    View Slide

  74. Wesley
    Input Output

    View Slide

  75. Input Source
    Wesley

    View Slide

  76. Input Source
    Wesley
    Output Receiver

    View Slide

  77. Wesley
    Input Source
    Records sent data
    1,2,3,4

    View Slide

  78. Wesley
    Input Source
    Records sent data Records received data
    2,4,6,8
    1,2,3,4
    Output Receiver

    View Slide

  79. Wesley

    View Slide

  80. Wesley
    Analyze!

    View Slide

  81. Wesley

    View Slide

  82. Wesley

    View Slide

  83. Wesley

    View Slide

  84. Wesley

    View Slide

  85. Wesley

    View Slide

  86. Wesley

    View Slide

  87. Wesley

    View Slide

  88. Wesley

    View Slide

  89. Wesley
    It Works!

    View Slide

  90. Spike
    Fault injection for Buffy

    View Slide

  91. Fault Injection

    View Slide

  92. Lineage-driven
    fault injection

    View Slide

  93. Start from a good result
    Spike: LDFI

    View Slide

  94. Input
    Spike: LDFI

    View Slide

  95. Output
    Spike: LDFI

    View Slide

  96. Figure out what can go
    wrong
    Spike: LDFI

    View Slide

  97. Nemesis
    Spike: LDFI
    Each "wrong" is a possible

    View Slide

  98. The Network
    Spike: LDFI
    Our first nemesis:

    View Slide

  99. Determinism is key
    Spike

    View Slide

  100. Repeated runs with different results
    ==
    Mostly Useless
    Spike

    View Slide

  101. Spike

    View Slide

  102. Spike
    Inject failures as informed by TCP

    View Slide

  103. Spike
    TCP Guarantees:

    View Slide

  104. Spike
    TCP Guarantees:
    Per connection in order delivery

    View Slide

  105. Spike
    Per connection in order delivery
    Per connection duplicate detection
    TCP Guarantees:

    View Slide

  106. Spike
    Per connection in order delivery
    Per connection duplicate detection
    Per connection retransmission of lost data
    TCP Guarantees:

    View Slide

  107. TCP in Pony: Event Driven

    View Slide

  108. TCP in Pony: Event Driven

    View Slide

  109. TCP in Pony: Event Driven

    View Slide

  110. TCP in Pony: Event Driven

    View Slide

  111. TCP in Pony: Event Driven

    View Slide

  112. Useless Notifier

    View Slide

  113. Useless Notifier

    View Slide

  114. Useless Notifier

    View Slide

  115. Dropped Connections
    Nemesis #1:

    View Slide

  116. Spike: Drop Connection

    View Slide

  117. Spike: Drop Connection

    View Slide

  118. Spike: Drop Connection

    View Slide

  119. Spike: Drop Connection

    View Slide

  120. Spike: Drop Connection

    View Slide

  121. Spike: Drop Connection
    • Incoming connection accepted

    View Slide

  122. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection

    View Slide

  123. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection
    • Connection established

    View Slide

  124. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection
    • Connection established
    • Data sent

    View Slide

  125. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection
    • Connection established
    • Data sent
    • Data received

    View Slide

  126. Integrating Spike
    "Double and Halve" app

    View Slide

  127. Integrating Spike
    "Double and Halve" app

    View Slide

  128. Integrating Spike
    "Double and Halve" app

    View Slide

  129. Integrating Spike
    "Double and Halve" app

    View Slide

  130. Integrating Spike
    "Double and Halve" app

    View Slide

  131. Integrating Spike
    "Double and Halve" app

    View Slide

  132. Integrating Spike
    "Double and Halve" app

    View Slide

  133. Integrating Spike
    "Double and Halve" app

    View Slide

  134. Integrating Spike
    "Double and Halve" app

    View Slide

  135. Integrating Spike
    "Double and Halve" app

    View Slide

  136. • Easy to verify
    Integrating Spike
    "Double and Halve" app

    View Slide

  137. • Easy to verify
    • Messages cross process boundary
    Integrating Spike
    "Double and Halve" app

    View Slide

  138. • Easy to verify
    • Messages cross process boundary
    • Messages cross network boundary
    Integrating Spike
    "Double and Halve" app

    View Slide

  139. Integrating Spike
    • Double and Halve App

    View Slide

  140. Integrating Spike
    • Double and Halve App
    • No Spiking

    View Slide

  141. Integrating Spike
    • Double and Halve App
    • No Spiking
    • Test, Test, Test

    View Slide

  142. Integrating Spike
    • Double and Halve App
    • No Spiking
    • Test, Test, Test
    • Wesley: It passes! It passes! It passes!

    View Slide

  143. Integrating Spike
    • Double and Halve App

    View Slide

  144. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”

    View Slide

  145. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test

    View Slide

  146. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test
    • Wesley: It fails! It fails! It fails!

    View Slide

  147. Integrating Spike

    View Slide

  148. Integrating Spike
    == Session Recovery!

    View Slide

  149. Integrating Spike
    • Double and Halve App

    View Slide

  150. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”

    View Slide

  151. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test

    View Slide

  152. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test
    • Wesley: It passes! It passes! It passes!

    View Slide

  153. Repeated runs with different results
    ==
    Mostly Useless
    Spike

    View Slide

  154. Determinism & Spike

    View Slide

  155. It's easy to get wrong
    Determinism & Spike

    View Slide

  156. Determinism & Spike
    TCP delivery is not deterministic

    View Slide

  157. Determinism & Spike
    TCP guarantees:
    Per connection in order delivery

    View Slide

  158. Determinism & Spike
    Per connection in order delivery
    Per connection duplicate detection
    TCP guarantees:

    View Slide

  159. Determinism & Spike
    Per connection in order delivery
    Per connection duplicate detection
    Per connection retransmission of lost data
    TCP guarantees:

    View Slide

  160. Determinism & Spike
    Per connection in order delivery
    Per connection duplicate detection
    Per connection retransmission of lost data
    but it doesn't guarantee determinism
    TCP guarantees:

    View Slide

  161. Determinism & Spike
    TCP delivery is not deterministic

    View Slide

  162. Determinism & Spike
    TCP delivery is not deterministic

    View Slide

  163. Determinism & Spike
    TCP delivery is not deterministic

    View Slide

  164. Determinism & Spike
    TCP delivery is not deterministic
    Per method call Spiking won't work

    View Slide

  165. Determinism & Spike
    TCP delivery is not deterministic
    Per method call Spiking won't work
    unless we make it work…

    View Slide

  166. Determinism & Spike
    TCP message framing

    View Slide

  167. Determinism & Spike
    TCP message framing

    View Slide

  168. Determinism & Spike
    TCP message framing

    View Slide

  169. Determinism & Spike
    TCP message framing

    View Slide

  170. Determinism & Spike
    TCP message framing

    View Slide

  171. Determinism & Spike
    TCP message framing

    View Slide

  172. Determinism & Spike
    TCP message framing

    View Slide

  173. Determinism & Spike
    TCP message framing

    View Slide

  174. Determinism & Spike
    TCP message framing

    View Slide

  175. Determinism & Spike
    Expect in action

    View Slide

  176. Determinism & Spike
    Expect in action

    View Slide

  177. Determinism & Spike
    Expect in action

    View Slide

  178. Determinism & Spike
    Expect in action

    View Slide

  179. Determinism & Spike
    Expect in action

    View Slide

  180. Determinism & Spike
    Expect in action

    View Slide

  181. Determinism & Spike
    Expect in action

    View Slide

  182. Determinism & Spike
    Expect in action

    View Slide

  183. Determinism & Spike
    Expect in action

    View Slide

  184. Determinism & Spike
    Expect in action

    View Slide

  185. Determinism & Spike
    Expect makes received deterministic

    View Slide

  186. Determinism & Spike
    Expect makes received deterministic

    View Slide

  187. Determinism & Spike
    Expect makes received deterministic

    View Slide

  188. Determinism & Spike
    Expect makes received deterministic

    View Slide

  189. Determinism & Spike
    Expect makes received deterministic

    View Slide

  190. Determinism & Spike
    Expect makes received deterministic

    View Slide

  191. Determinism & Spike
    Expect makes received deterministic

    View Slide

  192. Determinism & Spike
    Received gets called with

    View Slide

  193. Determinism & Spike
    then…

    View Slide

  194. Determinism & Spike
    and then another…

    View Slide

  195. Determinism & Spike
    and finally…

    View Slide

  196. Same number of notifier method calls
    Determinism & Spike
    no matter how the data arrives

    View Slide

  197. Drop Connection & Expect
    fast deterministic friends
    Determinism & Spike
    Determinism & Spike

    View Slide

  198. Slow Connections
    Nemesis #2:

    View Slide

  199. Spike: Delay

    View Slide

  200. Spike: Delay

    View Slide

  201. Spike: Delay

    View Slide

  202. Spike: Delay

    View Slide

  203. Spike: Delay
    Delay overrides expect

    View Slide

  204. Spike: Delay
    Delay overrides expect
    and controls the flow of bytes

    View Slide

  205. Spike: Delay
    Delay overrides expect
    and controls the flow of bytes
    to maintain determinism

    View Slide

  206. Spike: Delay

    View Slide

  207. Spike: Delay

    View Slide

  208. Spike: Delay

    View Slide

  209. Spike: Delay

    View Slide

  210. Spike: Delay
    r TCP
    Spike

    View Slide

  211. Spike: Delay
    r TCP
    Spike

    View Slide

  212. Spike: Delay
    r TCP
    Spike

    View Slide

  213. Spike: Delay
    TCP

    View Slide

  214. Spike: Delay
    TCP
    TCP
    Spike

    View Slide

  215. Spike: Delay
    TCP
    TCP
    TCP
    Spike
    Spike

    View Slide

  216. Results

    View Slide

  217. Results
    • Bugs in Session Recovery
    Found…

    View Slide

  218. Results
    • Bugs in Session Recovery
    • Bug in Pony standard library
    Found…

    View Slide

  219. Results
    • Bugs in Session Recovery
    • Bug in Pony standard library
    • Bugs in Spike
    Found…

    View Slide

  220. Results
    • Bugs in Session Recovery
    • Bug in Pony standard library
    • Bugs in Spike
    • And more bugs…
    Found…

    View Slide

  221. Determinism is key
    Results
    Found…

    View Slide

  222. Determinism is key
    Results
    but hard to achieve
    Found…

    View Slide

  223. Data Lineage

    View Slide

  224. WARNING!!!
    Vaporware ahead

    View Slide

  225. Output
    Data Lineage
    How did I get here?

    View Slide

  226. Output
    Data Lineage

    View Slide

  227. Data Lineage
    Input: 1,2,3

    View Slide

  228. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6

    View Slide

  229. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 4,6

    View Slide

  230. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 4,6
    How did we get here?
    these are not our beautiful results

    View Slide

  231. Data Lineage
    Input: 1,2,3

    View Slide

  232. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6

    View Slide

  233. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 2,6,12

    View Slide

  234. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 2,6,12
    ¯\_(ϑ)_/¯

    View Slide

  235. Data Lineage to the Rescue!

    View Slide

  236. Data Lineage
    Externally verify determinism

    View Slide

  237. Data Lineage
    Externally verify determinism
    is it REALLY deterministic?

    View Slide

  238. Data Lineage
    Find incorrect executions

    View Slide

  239. Data Lineage
    Find incorrect executions
    bugs in Wallaroo

    View Slide

  240. Data Lineage
    Input: 1
    Expected: 2
    Got: 4
    ¯\_(ϑ)_/¯

    View Slide

  241. Data Lineage
    Execution path was…
    when it should have been

    View Slide

  242. Data Lineage
    when it should have been
    Execution path was…

    View Slide

  243. Data Lineage
    Useful outside of
    development

    View Slide

  244. Data Lineage
    Production Debugging

    View Slide

  245. Data Lineage
    Production Debugging
    how did I get here?

    View Slide

  246. Data Lineage
    Audit Log

    View Slide

  247. Data Lineage
    Audit Log
    why did you do that?

    View Slide

  248. Data Lineage
    Hindsight Machine

    View Slide

  249. Building Confidence
    is difficult

    View Slide

  250. and frustrating

    View Slide

  251. View Slide

  252. Peter Alvaro
    http://www.cs.berkeley.edu/~palvaro/molly.pdf
    @palvaro
    https://www.youtube.com/watch?v=ggCffvKEJmQ
    Lineage-driven Fault Injection:
    Outwards from the Middle of the Maze:

    View Slide

  253. Kyle Kingsbury
    https://aphyr.com/tags/Jepsen
    @aphyr
    Jepsen:

    View Slide

  254. Will Wilson
    https://www.youtube.com/watch?v=4fFDFbi3toc
    Testing Distributed Systems w/ Deterministic Simulation:

    View Slide

  255. Catie McCaffrey
    http://queue.acm.org/detail.cfm?ref=rss&id=2889274
    @caitie
    The Verification of a Distributed System
    The Verification of a Distributed System:
    A practitioner's guide to increasing confidence in system correctness
    https://www.infoq.com/presentations/distributed-systems-
    verification

    View Slide

  256. Inés Sombra
    https://www.youtube.com/watch?v=KSdNYi55kjg
    Testing in a Distributed World:
    @randommood

    View Slide

  257. http://principlesofchaos.org
    Principles of Chaos Engineering:
    Chaos Engineering

    View Slide

  258. View Slide

  259. Thanks
    Peter Alvaro
    Sylvan Clebsch
    Zeeshan Lakhani
    John Mumm
    Rob Roland
    Andrew Turley

    View Slide

  260. @SeanTAllen
    Note:
    The 'T' is very important

    View Slide