Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How did I get here? Building Confidence in a Di...

How did I get here? Building Confidence in a Distributed Stream Processor

When we build a distributed application, how do we have confidence that our results are correct? We can test our business logic over and over but if the engine executing it isn't trustworthy, we can't trust our results.

How can we build trust in our execution engines? We need to test them. It's hard enough to test a stream processor that runs on a single machine, it gets even more complicated when you start talking about a distributed stream processor. As Kyle Kingsbury's Jepsen series has shown, we have a long way to go creating tests that can provide confidence that our systems are trustworthy.

At Sendence, we're building a distributed streaming data analytics engine that we want to prove is trustworthy. This talk will focus on the various means we have come up with to create repeatable tests that allow us to start trusting that our system will give correct results. You’ll learn how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Together, we’ll move through the design process repeatedly answering the questions “What do we have to do to trust this result?” and “If we get the wrong result, how can we determine what went wrong so we can fix it?”. Hopefully you’ll leave this talk inspired to apply a similar process to your own projects.

Talk objectives:

- Understand the need for verification of distributed systems.
- Learn approaches and techniques for verification with distributed systems.
- Understand some of the different challenges and solutions for verification with stream processing systems.

Target audience:

- Developers and Architects interested in practical approaches to verify correctness in a distributed system.

Sean T Allen

March 07, 2017
Tweet

More Decks by Sean T Allen

Other Decks in Technology

Transcript

  1. T

  2. T

  3. 16 core AWS m4 instance: 3 million events a second

    99.99% processed in less than 1 ms
  4. Spike Per connection in order delivery Per connection duplicate detection

    Per connection retransmission of lost data TCP Guarantees:
  5. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established • Data sent
  6. Spike: Drop Connection • Incoming connection accepted • Attempting outgoing

    connection • Connection established • Data sent • Data received
  7. • Easy to verify • Messages cross process boundary •

    Messages cross network boundary Integrating Spike "Double and Halve" app
  8. Integrating Spike • Double and Halve App • No Spiking

    • Test, Test, Test • Wesley: It passes! It passes! It passes!
  9. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test
  10. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test • Wesley: It fails! It fails! It fails!
  11. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test
  12. Integrating Spike • Double and Halve App • Spike with

    “drop connection” • Test, Test, Test • Wesley: It passes! It passes! It passes!
  13. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection Per connection retransmission of lost data TCP guarantees:
  14. Determinism & Spike Per connection in order delivery Per connection

    duplicate detection Per connection retransmission of lost data but it doesn't guarantee determinism TCP guarantees:
  15. Determinism & Spike TCP delivery is not deterministic Per method

    call Spiking won't work unless we make it work…
  16. Results • Bugs in Session Recovery • Bug in Pony

    standard library • Bugs in Spike Found…
  17. Results • Bugs in Session Recovery • Bug in Pony

    standard library • Bugs in Spike • And more bugs… Found…
  18. Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6 How did

    we get here? these are not our beautiful results
  19. Catie McCaffrey http://queue.acm.org/detail.cfm?ref=rss&id=2889274 @caitie The Verification of a Distributed System

    The Verification of a Distributed System: A practitioner's guide to increasing confidence in system correctness https://www.infoq.com/presentations/distributed-systems- verification
  20. Stay informed about developments & our open source release: Sendence.com

    Follow us on Twitter: @SendenceEng Join our mailing list: text WALLAROO to 444999