Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How did I get here? Building Confidence in a Distributed Stream Processor

Sean T Allen
November 04, 2016

How did I get here? Building Confidence in a Distributed Stream Processor

When we build a distributed application, how do we have confidence that our results are correct? We can test our business logic over and over but if the engine executing it isn't trustworthy, we can't trust our results.

How can we build trust in our execution engines? We need to test them. It's hard enough to test a stream processor that runs on a single machine, it gets even more complicated when you start talking about a distributed stream processor. As Kyle Kingsbury's Jepsen series has shown, we have a long way to go creating tests that can provide confidence that our systems are trustworthy.

At Sendence, we're building a distributed streaming data analytics engine that we want to prove is trustworthy. This talk will focus on the various means we have come up with to create repeatable tests that allow us to start trusting that our system will give correct results. You’ll learn how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Together, we’ll move through the design process repeatedly answering the questions “What do we have to do to trust this result?” and “If we get the wrong result, how can we determine what went wrong so we can fix it?”. Hopefully you’ll leave this talk inspired to apply a similar process to your own projects.

Talk objectives:

- Understand the need for verification of distributed systems.
- Learn approaches and techniques for verification with distributed systems.
- Understand some of the different challenges and solutions for verification with stream processing systems.

Target audience:

- Developers and Architects interested in practical approaches to verify correctness in a distributed system.

Sean T Allen

November 04, 2016
Tweet

More Decks by Sean T Allen

Other Decks in Technology

Transcript

  1. Building Confidence in a Distributed Stream Processor
    [email protected]
    @SeanTAllen
    How Did I Get Here?

    View full-size slide

  2. Sean T. Allen

    View full-size slide

  3. Experience Report

    View full-size slide

  4. Stateful Stream
    Processor

    View full-size slide

  5. Sendence Wallaroo

    View full-size slide

  6. Prototype
    Started January 2016

    View full-size slide

  7. Prototype
    Started January 2016

    View full-size slide

  8. Production
    Version 1 December 2016

    View full-size slide

  9. Version 1 December 2016
    Production

    View full-size slide

  10. America
    is all about speed.
    Hot, nasty, bad-ass speed.
    — Eleanor Roosevelt

    View full-size slide

  11. High Throughput
    Goals

    View full-size slide

  12. Low Latency
    Goals

    View full-size slide

  13. Less Hardware
    Goals

    View full-size slide

  14. 2014 class MacBook Pro:
    220k events a second
    99.99% processed in less than 500 µs

    View full-size slide

  15. America
    is all about data quality.
    Quiet, demure data quality.
    — Andrew Jackson

    View full-size slide

  16. High Fidelity
    Goals

    View full-size slide

  17. Stream Processing

    View full-size slide

  18. Message at a time

    View full-size slide

  19. Never ending

    View full-size slide

  20. Machine Failure

    View full-size slide

  21. Slow Machine

    View full-size slide

  22. Segfaulting Process

    View full-size slide

  23. Network Error

    View full-size slide

  24. Failure Happens

    View full-size slide

  25. Delivery Guarantees

    View full-size slide

  26. At-Most-Once

    View full-size slide

  27. At-Most-Once
    Best Effort

    View full-size slide

  28. At-Least-Once

    View full-size slide

  29. At-Least-Once
    ACK or resend

    View full-size slide

  30. Exactly-Once

    View full-size slide

  31. Exactly-Once
    At-Least-Once + Idempotence

    View full-size slide

  32. Exactly-Once

    View full-size slide

  33. Black Box Testing

    View full-size slide

  34. Black Box Testing

    View full-size slide

  35. Black Box Testing

    View full-size slide

  36. Black Box Testing

    View full-size slide

  37. Black Box Testing

    View full-size slide

  38. System Under Test
    Black Box Testing

    View full-size slide

  39. Input Source
    Black Box Testing

    View full-size slide

  40. Output Receiver
    Black Box Testing

    View full-size slide

  41. Unit Testing
    because
    isn't enough
    Black Box Testing

    View full-size slide

  42. Integration Testing
    because
    isn't enough
    Black Box Testing

    View full-size slide

  43. composed components
    because
    have interesting new failure modes
    Black Box Testing

    View full-size slide

  44. Test The Entire System
    Black Box Testing

    View full-size slide

  45. Test The Entire System
    end to end
    Black Box Testing

    View full-size slide

  46. Test The Entire System
    end to end
    Black Box Testing
    and verify your expectations

    View full-size slide

  47. Wesley
    Expectation verification for Buffy

    View full-size slide

  48. Wesley
    Output

    View full-size slide

  49. Wesley
    Input Output

    View full-size slide

  50. Input Source
    Wesley

    View full-size slide

  51. Input Source
    Wesley
    Output Receiver

    View full-size slide

  52. Wesley
    Input Source
    Records sent data
    1,2,3,4

    View full-size slide

  53. Wesley
    Input Source
    Records sent data Records received data
    2,4,6,8
    1,2,3,4
    Output Receiver

    View full-size slide

  54. Wesley
    Analyze!

    View full-size slide

  55. Wesley
    It Works!

    View full-size slide

  56. Spike
    Fault injection for Buffy

    View full-size slide

  57. Fault Injection

    View full-size slide

  58. Lineage-driven
    fault injection

    View full-size slide

  59. Start from a good result
    Spike: LDFI

    View full-size slide

  60. Input
    Spike: LDFI

    View full-size slide

  61. Output
    Spike: LDFI

    View full-size slide

  62. Figure out what can go
    wrong
    Spike: LDFI

    View full-size slide

  63. Nemesis
    Spike: LDFI
    Each "wrong" is a possible

    View full-size slide

  64. The Network
    Spike: LDFI
    Our first nemesis:

    View full-size slide

  65. Determinism is key
    Spike

    View full-size slide

  66. Repeated runs with different results
    ==
    Mostly Useless
    Spike

    View full-size slide

  67. Spike
    Inject failures as informed by TCP

    View full-size slide

  68. Spike
    TCP Guarantees:

    View full-size slide

  69. Spike
    TCP Guarantees:
    Per connection in order delivery

    View full-size slide

  70. Spike
    Per connection in order delivery
    Per connection duplicate detection
    TCP Guarantees:

    View full-size slide

  71. Spike
    Per connection in order delivery
    Per connection duplicate detection
    Per connection retransmission of lost data
    TCP Guarantees:

    View full-size slide

  72. TCP in Pony: Event Driven

    View full-size slide

  73. TCP in Pony: Event Driven

    View full-size slide

  74. TCP in Pony: Event Driven

    View full-size slide

  75. TCP in Pony: Event Driven

    View full-size slide

  76. TCP in Pony: Event Driven

    View full-size slide

  77. Useless Notifier

    View full-size slide

  78. Useless Notifier

    View full-size slide

  79. Useless Notifier

    View full-size slide

  80. Dropped Connections
    Nemesis #1:

    View full-size slide

  81. Spike: Drop Connection

    View full-size slide

  82. Spike: Drop Connection

    View full-size slide

  83. Spike: Drop Connection

    View full-size slide

  84. Spike: Drop Connection

    View full-size slide

  85. Spike: Drop Connection

    View full-size slide

  86. Spike: Drop Connection
    • Incoming connection accepted

    View full-size slide

  87. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection

    View full-size slide

  88. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection
    • Connection established

    View full-size slide

  89. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection
    • Connection established
    • Data sent

    View full-size slide

  90. Spike: Drop Connection
    • Incoming connection accepted
    • Attempting outgoing connection
    • Connection established
    • Data sent
    • Data received

    View full-size slide

  91. Integrating Spike
    "Double and Halve" app

    View full-size slide

  92. Integrating Spike
    "Double and Halve" app

    View full-size slide

  93. Integrating Spike
    "Double and Halve" app

    View full-size slide

  94. Integrating Spike
    "Double and Halve" app

    View full-size slide

  95. Integrating Spike
    "Double and Halve" app

    View full-size slide

  96. Integrating Spike
    "Double and Halve" app

    View full-size slide

  97. Integrating Spike
    "Double and Halve" app

    View full-size slide

  98. Integrating Spike
    "Double and Halve" app

    View full-size slide

  99. Integrating Spike
    "Double and Halve" app

    View full-size slide

  100. Integrating Spike
    "Double and Halve" app

    View full-size slide

  101. • Easy to verify
    Integrating Spike
    "Double and Halve" app

    View full-size slide

  102. • Easy to verify
    • Messages cross process boundary
    Integrating Spike
    "Double and Halve" app

    View full-size slide

  103. • Easy to verify
    • Messages cross process boundary
    • Messages cross network boundary
    Integrating Spike
    "Double and Halve" app

    View full-size slide

  104. Integrating Spike
    • Double and Halve App

    View full-size slide

  105. Integrating Spike
    • Double and Halve App
    • No Spiking

    View full-size slide

  106. Integrating Spike
    • Double and Halve App
    • No Spiking
    • Test, Test, Test

    View full-size slide

  107. Integrating Spike
    • Double and Halve App
    • No Spiking
    • Test, Test, Test
    • Wesley: It passes! It passes! It passes!

    View full-size slide

  108. Integrating Spike
    • Double and Halve App

    View full-size slide

  109. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”

    View full-size slide

  110. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test

    View full-size slide

  111. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test
    • Wesley: It fails! It fails! It fails!

    View full-size slide

  112. Integrating Spike

    View full-size slide

  113. Integrating Spike
    == Session Recovery!

    View full-size slide

  114. Integrating Spike
    • Double and Halve App

    View full-size slide

  115. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”

    View full-size slide

  116. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test

    View full-size slide

  117. Integrating Spike
    • Double and Halve App
    • Spike with “drop connection”
    • Test, Test, Test
    • Wesley: It passes! It passes! It passes!

    View full-size slide

  118. Repeated runs with different results
    ==
    Mostly Useless
    Spike

    View full-size slide

  119. Determinism & Spike

    View full-size slide

  120. It's easy to get wrong
    Determinism & Spike

    View full-size slide

  121. Determinism & Spike
    TCP delivery is not deterministic

    View full-size slide

  122. Determinism & Spike
    TCP guarantees:
    Per connection in order delivery

    View full-size slide

  123. Determinism & Spike
    Per connection in order delivery
    Per connection duplicate detection
    TCP guarantees:

    View full-size slide

  124. Determinism & Spike
    Per connection in order delivery
    Per connection duplicate detection
    Per connection retransmission of lost data
    TCP guarantees:

    View full-size slide

  125. Determinism & Spike
    Per connection in order delivery
    Per connection duplicate detection
    Per connection retransmission of lost data
    but it doesn't guarantee determinism
    TCP guarantees:

    View full-size slide

  126. Determinism & Spike
    TCP delivery is not deterministic

    View full-size slide

  127. Determinism & Spike
    TCP delivery is not deterministic

    View full-size slide

  128. Determinism & Spike
    TCP delivery is not deterministic

    View full-size slide

  129. Determinism & Spike
    TCP delivery is not deterministic
    Per method call Spiking won't work

    View full-size slide

  130. Determinism & Spike
    TCP delivery is not deterministic
    Per method call Spiking won't work
    unless we make it work…

    View full-size slide

  131. Determinism & Spike
    TCP message framing

    View full-size slide

  132. Determinism & Spike
    TCP message framing

    View full-size slide

  133. Determinism & Spike
    TCP message framing

    View full-size slide

  134. Determinism & Spike
    TCP message framing

    View full-size slide

  135. Determinism & Spike
    TCP message framing

    View full-size slide

  136. Determinism & Spike
    TCP message framing

    View full-size slide

  137. Determinism & Spike
    TCP message framing

    View full-size slide

  138. Determinism & Spike
    TCP message framing

    View full-size slide

  139. Determinism & Spike
    TCP message framing

    View full-size slide

  140. Determinism & Spike
    Expect in action

    View full-size slide

  141. Determinism & Spike
    Expect in action

    View full-size slide

  142. Determinism & Spike
    Expect in action

    View full-size slide

  143. Determinism & Spike
    Expect in action

    View full-size slide

  144. Determinism & Spike
    Expect in action

    View full-size slide

  145. Determinism & Spike
    Expect in action

    View full-size slide

  146. Determinism & Spike
    Expect in action

    View full-size slide

  147. Determinism & Spike
    Expect in action

    View full-size slide

  148. Determinism & Spike
    Expect in action

    View full-size slide

  149. Determinism & Spike
    Expect in action

    View full-size slide

  150. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  151. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  152. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  153. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  154. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  155. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  156. Determinism & Spike
    Expect makes received deterministic

    View full-size slide

  157. Determinism & Spike
    Received gets called with

    View full-size slide

  158. Determinism & Spike
    then…

    View full-size slide

  159. Determinism & Spike
    and then another…

    View full-size slide

  160. Determinism & Spike
    and finally…

    View full-size slide

  161. Same number of notifier method calls
    Determinism & Spike
    no matter how the data arrives

    View full-size slide

  162. Drop Connection & Expect
    fast deterministic friends
    Determinism & Spike
    Determinism & Spike

    View full-size slide

  163. Slow Connections
    Nemesis #2:

    View full-size slide

  164. Spike: Delay

    View full-size slide

  165. Spike: Delay

    View full-size slide

  166. Spike: Delay

    View full-size slide

  167. Spike: Delay

    View full-size slide

  168. Spike: Delay
    Delay overrides expect

    View full-size slide

  169. Spike: Delay
    Delay overrides expect
    and controls the flow of bytes

    View full-size slide

  170. Spike: Delay
    Delay overrides expect
    and controls the flow of bytes
    to maintain determinism

    View full-size slide

  171. Spike: Delay

    View full-size slide

  172. Spike: Delay

    View full-size slide

  173. Spike: Delay

    View full-size slide

  174. Spike: Delay

    View full-size slide

  175. Spike: Delay
    r TCP
    Spike

    View full-size slide

  176. Spike: Delay
    r TCP
    Spike

    View full-size slide

  177. Spike: Delay
    r TCP
    Spike

    View full-size slide

  178. Spike: Delay
    TCP

    View full-size slide

  179. Spike: Delay
    TCP
    TCP
    Spike

    View full-size slide

  180. Spike: Delay
    TCP
    TCP
    TCP
    Spike
    Spike

    View full-size slide

  181. Results
    • Bugs in Session Recovery
    Found…

    View full-size slide

  182. Results
    • Bugs in Session Recovery
    • Bug in Pony standard library
    Found…

    View full-size slide

  183. Results
    • Bugs in Session Recovery
    • Bug in Pony standard library
    • Bugs in Spike
    Found…

    View full-size slide

  184. Results
    • Bugs in Session Recovery
    • Bug in Pony standard library
    • Bugs in Spike
    • And more bugs…
    Found…

    View full-size slide

  185. Determinism is key
    Results
    Found…

    View full-size slide

  186. Determinism is key
    Results
    but hard to achieve
    Found…

    View full-size slide

  187. Data Lineage

    View full-size slide

  188. WARNING!!!
    Vaporware ahead

    View full-size slide

  189. Output
    Data Lineage
    How did I get here?

    View full-size slide

  190. Output
    Data Lineage

    View full-size slide

  191. Data Lineage
    Input: 1,2,3

    View full-size slide

  192. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6

    View full-size slide

  193. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 4,6

    View full-size slide

  194. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 4,6
    How did we get here?
    these are not our beautiful results

    View full-size slide

  195. Data Lineage
    Input: 1,2,3

    View full-size slide

  196. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6

    View full-size slide

  197. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 2,6,12

    View full-size slide

  198. Data Lineage
    Input: 1,2,3
    Expect: 2,4,6
    Get: 2,6,12
    ¯\_(ϑ)_/¯

    View full-size slide

  199. Data Lineage to the Rescue!

    View full-size slide

  200. Data Lineage
    Externally verify determinism

    View full-size slide

  201. Data Lineage
    Externally verify determinism
    is it REALLY deterministic?

    View full-size slide

  202. Data Lineage
    Find incorrect executions

    View full-size slide

  203. Data Lineage
    Find incorrect executions
    bugs in Wallaroo

    View full-size slide

  204. Data Lineage
    Input: 1
    Expected: 2
    Got: 4
    ¯\_(ϑ)_/¯

    View full-size slide

  205. Data Lineage
    Execution path was…
    when it should have been

    View full-size slide

  206. Data Lineage
    when it should have been
    Execution path was…

    View full-size slide

  207. Data Lineage
    Useful outside of
    development

    View full-size slide

  208. Data Lineage
    Production Debugging

    View full-size slide

  209. Data Lineage
    Production Debugging
    how did I get here?

    View full-size slide

  210. Data Lineage
    Audit Log

    View full-size slide

  211. Data Lineage
    Audit Log
    why did you do that?

    View full-size slide

  212. Data Lineage
    Hindsight Machine

    View full-size slide

  213. Building Confidence
    is difficult

    View full-size slide

  214. and frustrating

    View full-size slide

  215. Peter Alvaro
    http://www.cs.berkeley.edu/~palvaro/molly.pdf
    @palvaro
    https://www.youtube.com/watch?v=ggCffvKEJmQ
    Lineage-driven Fault Injection:
    Outwards from the Middle of the Maze:

    View full-size slide

  216. Kyle Kingsbury
    https://aphyr.com/tags/Jepsen
    @aphyr
    Jepsen:

    View full-size slide

  217. Will Wilson
    https://www.youtube.com/watch?v=4fFDFbi3toc
    Testing Distributed Systems w/ Deterministic Simulation:

    View full-size slide

  218. Catie McCaffrey
    http://queue.acm.org/detail.cfm?ref=rss&id=2889274
    @caitie
    The Verification of a Distributed System
    The Verification of a Distributed System:
    A practitioner's guide to increasing confidence in system correctness
    https://www.infoq.com/presentations/distributed-systems-
    verification

    View full-size slide

  219. Inés Sombra
    https://www.youtube.com/watch?v=KSdNYi55kjg
    Testing in a Distributed World:
    @randommood

    View full-size slide

  220. http://principlesofchaos.org
    Principles of Chaos Engineering:
    Chaos Engineering

    View full-size slide

  221. Thanks
    Peter Alvaro
    Sylvan Clebsch
    Zeeshan Lakhani
    John Mumm
    Rob Roland
    Andrew Turley

    View full-size slide

  222. @SeanTAllen
    Note:
    The 'T' is very important

    View full-size slide