How did I get here? Building confidence in a distributed stream processor

How Did I Get Here? Building Conﬁdence in a Distributed
Stream Processor

Sean T. Allen

Experience Report

Stream Processor

Prototype Started January 2016

Production Started April 2016

America is all about speed. Hot, nasty, bad-ass speed. —
Eleanor Roosevelt

High Throughput Buffy: Goals

Low Latency Buffy: Goals

Less Hardware Buffy: Goals

America is all about data quality. Quiet, demure data quality.
— Andrew Jackson

High Fidelity Buffy: Goals

Stream Processing

Message at a time

Never ending

Failure

Machine Failure

Slow Machine

Segfaulting Process

GC Pause

Network Error

Failure Happens

Delivery Guarantees

At-Most-Once

At-Most-Once Best Effort

At-Least-Once

At-Least-Once ACK or resend

Exactly-Once

Exactly-Once At-Least-Once + Idempotence

Exactly-Once

Conﬁdence

Black Box Testing

System Under Test Black Box Testing

Input Source Black Box Testing

Output Receiver Black Box Testing

Unit Testing because isn't enough Black Box Testing

Integration Testing because isn't enough Black Box Testing

composed components because have interesting new failure modes Black Box
Testing

Test The Entire System Black Box Testing

Test The Entire System end to end Black Box Testing

Test The Entire System end to end Black Box Testing
and verify your expectations

Wesley Expectation veriﬁcation for Buffy

Wesley

Wesley Input

Wesley Output

Wesley Input Output

Input Source Wesley

Input Source Wesley Output Receiver

Wesley Input Source Records sent data 1,2,3,4

Wesley Input Source Records sent data Records received data 2,4,6,8
1,2,3,4 Output Receiver

Wesley

Wesley Analyze!

Wesley

Wesley It Works!

Spike Fault injection for Buffy

Fault Injection

Lineage-driven fault injection

Start from a good result Spike: LDFI

Input Spike: LDFI

Output Spike: LDFI

Figure out what can go wrong Spike: LDFI

Nemesis Spike: LDFI Each "wrong" is a possible

The Network Spike: LDFI Our ﬁrst nemesis:

Determinism is key Spike

Repeated runs with different results == Mostly Useless Spike

Spike Inject failures as informed by TCP

Spike TCP Guarantees:

Spike TCP Guarantees: Per connection in order delivery

Spike Per connection in order delivery Per connection duplicate detection
TCP Guarantees:

Spike Per connection in order delivery Per connection duplicate detection
Per connection retransmission of lost data TCP Guarantees:

TCP in Pony: Event Driven

Useless Notiﬁer

Dropped Connections Nemesis #1:

Spike: Drop Connection

Spike: Drop Connection • Incoming connection accepted

Spike: Drop Connection • Incoming connection accepted • Attempting outgoing
connection

connection • Connection established

connection • Connection established • Data sent

connection • Connection established • Data sent • Data received

Integrating Spike "Double and Halve" app

• Easy to verify Integrating Spike "Double and Halve" app

• Easy to verify • Messages cross process boundary Integrating
Spike "Double and Halve" app

• Easy to verify • Messages cross process boundary •
Messages cross network boundary Integrating Spike "Double and Halve" app

Integrating Spike • Double and Halve App

Integrating Spike • Double and Halve App • No Spiking

• Test, Test, Test

• Test, Test, Test • Wesley: It passes! It passes! It passes!

Integrating Spike • Double and Halve App • Spike with
“drop connection”

“drop connection” • Test, Test, Test

“drop connection” • Test, Test, Test • Wesley: It fails! It fails! It fails!

Integrating Spike

Integrating Spike == Session Recovery!

“drop connection”

“drop connection” • Test, Test, Test

“drop connection” • Test, Test, Test • Wesley: It passes! It passes! It passes!

Repeated runs with different results == Mostly Useless Spike

Determinism & Spike

It's easy to get wrong Determinism & Spike

Determinism & Spike TCP delivery is not deterministic

Determinism & Spike TCP guarantees: Per connection in order delivery

Determinism & Spike Per connection in order delivery Per connection
duplicate detection TCP guarantees:

duplicate detection Per connection retransmission of lost data TCP guarantees:

duplicate detection Per connection retransmission of lost data but it doesn't guarantee determinism TCP guarantees:

Determinism & Spike TCP delivery is not deterministic

Determinism & Spike TCP delivery is not deterministic Per method
call Spiking won't work

Determinism & Spike TCP delivery is not deterministic Per method
call Spiking won't work unless we make it work…

Determinism & Spike TCP message framing

Determinism & Spike Expect in action

Determinism & Spike Expect makes received deterministic

Determinism & Spike Received gets called with

Determinism & Spike then…

Determinism & Spike and then another…

Determinism & Spike and ﬁnally…

Same number of notiﬁer method calls Determinism & Spike no
matter how the data arrives

Drop Connection & Expect fast deterministic friends Determinism & Spike
Determinism & Spike

Slow Connections Nemesis #1:

Spike: Delay

Spike: Delay Delay overrides expect

Spike: Delay Delay overrides expect and controls the ﬂow of
bytes

Spike: Delay Delay overrides expect and controls the ﬂow of
bytes to maintain determinism

Spike: Delay

Spike: Delay r TCP Spike

Spike: Delay TCP

Spike: Delay TCP TCP Spike

Spike: Delay TCP TCP TCP Spike Spike

Early Results

Early Results • Bugs in Session Recovery Found…

Early Results • Bugs in Session Recovery • Bug in
Pony standard library Found…

Pony standard library • Bugs in Spike Found…

Pony standard library • Bugs in Spike • And more bugs… Found…

Determinism is key Early Results Found…

Determinism is key Early Results but hard to achieve Found…

Data Lineage

WARNING!!! Vaporware ahead

Output Data Lineage How did I get here?

Output Data Lineage

Data Lineage Input: 1,2,3

Data Lineage Input: 1,2,3 Expect: 2,4,6

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 4,6 How did
we get here? these are not our beautiful results

Data Lineage Input: 1,2,3

Data Lineage Input: 1,2,3 Expect: 2,4,6

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12

Data Lineage Input: 1,2,3 Expect: 2,4,6 Get: 2,6,12 ¯\_(ϑ)_/¯

Data Lineage to the Rescue!

Data Lineage Externally verify determinism

Data Lineage Externally verify determinism is it REALLY deterministic?

Data Lineage Find incorrect executions

Data Lineage Find incorrect executions bugs in Buffy

Data Lineage Input: 1 Expected: 2 Got: 4 ¯\_(ϑ)_/¯

Data Lineage Execution path was… when it should have been

Data Lineage when it should have been Execution path was…

Data Lineage Useful outside of development

Data Lineage Production Debugging

Data Lineage Production Debugging how did I get here?

Data Lineage Audit Log

Data Lineage Audit Log why did you do that?

Data Lineage Hindsight Machine

Building Conﬁdence is difﬁcult

and frustrating

Don't be this dog

Be this dog

Peter Alvaro http://www.cs.berkeley.edu/~palvaro/molly.pdf @palvaro https://www.youtube.com/watch?v=ggCffvKEJmQ Lineage-driven Fault Injection: Outwards from
the Middle of the Maze:

Kyle Kingsbury https://aphyr.com/tags/Jepsen @aphyr Jepsen:

Will Wilson https://www.youtube.com/watch?v=4fFDFbi3toc Testing Distributed Systems w/ Deterministic Simulation:

Catie McCaffrey http://queue.acm.org/detail.cfm?ref=rss&id=2889274 @caitie The Verification of a Distributed System
The Verification of a Distributed System: A practitioner's guide to increasing confidence in system correctness 2:55 PM Tomorrow in Salon E

Inés Sombra https://www.youtube.com/watch?v=KSdNYi55kjg Testing in a Distributed World: @randommood

http://principlesofchaos.org Principles of Chaos Engineering: Chaos Engineering

Thanks Peter Alvaro Sylvan Clebsch Zeeshan Lakhani John Mumm Rob
Roland Andrew Turley

@SeanTAllen Note: The 'T' is very important

How did I get here? Building confidence in a d...

How did I get here? Building confidence in a distributed stream processor

More Decks by Sean T Allen

Other Decks in Technology

Featured

Transcript