Minimizing Faulty Executions of Distributed Systems

Minimizing Faulty Executions of Distributed Systems Colin Scott, Aurojit Panda,
Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, Scott Shenker

Software Developer

(GBs) Software Developer

Node1 ? … Node2 Node3 Node4 Node5 Node6 Node7 Node8
Node9 Node10 Node11 Node12 … Software Developer

1 LaToza, Venolia, DeLine, ICSE’ 06 49% of developers’ time
spent on debugging!1

1 LaToza, Venolia, DeLine, ICSE’ 06 49% of developers’ time
spent on debugging!1 Understanding How Bug Is Triggered Fixing Problematic Code

Our Goal Allow Developers To Focus on Fixing the Underlying
Bug

Problem Statement Identify a minimal causal sequence of events that
triggers the bug

Why Minimization? Smaller event traces are easier to understand G.
A. Miller. The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. Psychological Review ’56.

Outline Introduction Background Node 1 Node N Test Coordinator QA
Testbed Software Under Test Fuzz Testing w/ DEMi S2 S3 S1 S3 Computational Model Minimization Evaluation Conclusion

Computational Model Distributed System: Collection of N processes Each process
p: Has unbounded memory Starts in a known initial state Changes states deterministically a b c d e

Computational Model The network maintains a buffer of sent but
not yet delivered messages a b c d e

Computational Model The network maintains a buffer of sent but
not yet delivered messages a b c d e msg dst: d

Computational Model Message deliveries occur one at a time: destination
enters a new state according to old state & message destination sends a ﬁnite set of messages to other processes* *May include timer messages to be delivered to itself later a b c d e msg dst: d

Computational Model Message deliveries occur one at a time: destination
enters a new state according to old state & message destination sends a ﬁnite set of messages to other processes* *May include timer messages to be delivered to itself later a b c d e timer dst: d msg dst: a

Computational Model Steps may also be external: External message is
sent Process is created Process crash-recovers a b c d e timer dst: d msg dst: a

Computational Model Steps may also be external: External message is
sent Process is created Process crash-recovers a b c d e timer dst: d msg dst: a msg dst: e

Computational Model A schedule τ is a sequence of events
(either external or internal message deliveries) that can be applied in turn starting from the initial conﬁguration. process start message delivery message delivery message delivery external message message delivery e1 i1 i2 i3 i4 e2

Invariant Checking An invariant is a predicate P over the
state of all processes. a b c d e { ✔ ✗

Invariant Checking An invariant is a predicate P over the
state of all processes. a b c d e { ✔ ✗ ✗ A faulty execution is one that ends in an invariant violation. e1 i1 i2 i3 i4 e2

Formal Problem Statement Find: locally minimal reproducing sequence τ’: τ’
violates P, |τ’| ≤ |τ| τ’ contains a subsequence of the external events of τ if we remove any external event e from τ’, ¬∃ τ’’ containing same external events - e, s.t. τ’’ violates P Given: schedule τ that results in violation of P

Formal Problem Statement After ﬁnding τ’: remove extraneous message deliveries
from τ’

Fuzz Testing with DEMi App RPC lib OS App RPC
lib OS App RPC lib OS

Fuzz Testing with DEMi App RPC lib OS AspectJ App
RPC lib OS AspectJ App RPC lib OS AspectJ

RPC lib OS AspectJ App RPC lib OS AspectJ msg dst: b

RPC lib OS AspectJ App RPC lib OS AspectJ msg dst: b message delivery

RPC lib OS AspectJ App RPC lib OS AspectJ timer dst: b msg dst: a message delivery

RPC lib OS AspectJ App RPC lib OS AspectJ timer dst: b msg dst: a message delivery crash recovery

Running Example: Raft Consensus a b c d

Running Example: Raft Consensus a b c d votes: {a,b,c}

Running Example: Raft Consensus a b c d client request

client request client request client request

client request client request ACK ACK ACK client request

client request client request commit commit commit client request

RequestVote RequestVote RequestVote RequestVote VoteGranted VoteGranted VoteGranted VoteGranted

Minimization τ : Given … ✗ e1 i1 i2 i4
e2 en im

Minimization τ : Given Straightforward approach: Enumerate all schedules |τ’|
≤ |τ|, Pick shortest sequence that reproduces ✗ τ Schedule Space … ✗ e1 i1 i2 i4 e2 en im

i2 i3 ↛i3 ↛i2 dst(i2) ≠ dst(i3) i3 i2 Observation
#1: many schedules are commutative

Observation #1: many schedules are commutative i3 i2 Step n:
i2 i3 ↛i3 ↛i2 dst(i2) ≠ dst(i3)

i3 i2 Step n: Step n+1: i2 i3 ↛i3 ↛i2
dst(i2) ≠ dst(i3) Observation #1: many schedules are commutative

i3 i2 i3 Step n: Step n+1: Step n+2: i2
i3 ↛i3 ↛i2 dst(i2) ≠ dst(i3) Observation #1: many schedules are commutative

i3 i2 i2 i3 Step n: Step n+1: Step n+2:
i2 i3 ↛i3 ↛i2 dst(i2) ≠ dst(i3) Observation #1: many schedules are commutative

Observation #1: many schedules are commutative Adopt DPOR: Dynamic Partial
Order Reduction C. Flanagan, P. Godefroid, “Dynamic Partial-Order Reduction for Model Checking Software”, POPL ‘05

O( !) n k

Approach: prioritize schedule space exploration

Approach: prioritize schedule space exploration Assume: ﬁxed time budget Objective:
quickly ﬁnd small failing schedules

Given: Prioritization function

Given: Prioritization function Produce: Program under test Initial execution s.t.
prioritization makes scant progress

Conjecture: Systems we care about exhibit program properties amenable with
prioritization

{x=1,y=2} {x=1,y=3} {x=5,y=5} {x=2,y=2} {x=4,y=1} {x=-1,y=-2} {x=-1,y=-1}

{x=1,y=2} {x=1,y=3} {x=5,y=5} {x=4,y=1} {x=-1,y=-2} {x=-1,y=-1} {x=2,y=2}

{x=1,y=2} {x=1,y=3} {x=5,y=5} {x=4,y=1} {x=-1,y=-2} {x=-1,y=-1} {x=2,y=2} Invariant deﬁned over
small subset of processes’ variables

{x=1,y=2} {x=1,y=3} {x=5,y=5} {x=4,y=1} {x=-1,y=-2} {x=-1,y=-1} Each event aﬀects a
small subset of receiver’s variables {x=2,y=2} Invariant deﬁned over small subset of processes’ variables

{x=1,y=2} {x=1,y=3} {x=5,y=5} {x=4,y=1} {x=-1,y=-2} {x=-1,y=-1} Initial execution contains events
that don’t affect invariant {x=2,y=2} Each event affects a small subset of receiver’s variables Invariant defined over small subset of processes’ variables

Challenge: Don’t know which events are important Approach: experimentally “infer”
important events stay close to the original execution

… ✗ e1 i1 i2 i4 e2 en im Observation
#2: selectively mask original events τ :

… ✗ e1 i1 i2 i4 e2 en im Observation
#2: selectively mask original events τ : e1 e2 en e3 e4 ext: e5

τ : en e3 ext: e5 e1 e2 e4 …
✗ e1 i1 i2 i4 e2 en im Observation #2: selectively mask original events

x τ : en e3 ext: e5 e1 e2 e4
… ✗ e1 i1 i2 i4 e2 en im Observation #2: selectively mask original events

x τ : en e3 ext: e5 e1 e2 e4
… ✗ e1 i1 i2 i4 e2 en im (Apply Delta Debugging1) 1A Zeller, R. Hildebrandt, “Simplifying and Isolating Failure-Inducing Input”, IEEE ‘02 Observation #2: selectively mask original events

τ : en e3 ext: e5 sub1: e1 e2 e4
… ✗ e1 i1 i2 i4 e2 en im e4 e5 en … (Apply Delta Debugging1) 1A Zeller, R. Hildebrandt, “Simplifying and Isolating Failure-Inducing Input”, IEEE ‘02 Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im en e5 e4 e1 e2 e3 foreach i in τ: if i is pending: deliver i # ignore unexpected … e5 e4 en Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im en e5 e4 e1 e2 e3 foreach i in τ: if i is pending: deliver i # ignore unexpected i1 … e5 e4 en Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im en e5 e4 e1 e2 e3 foreach i in τ: if i is pending: deliver i # ignore unexpected i1 i4 … e5 e4 en im Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im en e5 e4 e1 e2 e3 foreach i in τ: if i is pending: deliver i # ignore unexpected i1 i4 ✗ … e5 e4 en im Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im en e5 e4 i1 i4 ✗ … e5 e4 en im Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im sub2: en e5 e4 i1 i4 ✗ … e5 e4 en im e5 en Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im sub2: i1 i4 … en e5 e4 i1 i4 ✗ … e5 e4 en im e5 en im Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im sub2: i1 i4 ✔ … en e5 e4 i1 i4 ✗ … e5 e4 en im e5 en im Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im sub2: i1 i4 ✔ … Explore backtrack points until (i) ✗ or (ii) time budget for sub2 expired en e5 e4 i1 i4 ✗ … e5 e4 en im e5 en im Observation #2: selectively mask original events

τ : ext: sub1: … ✗ e1 i1 i2 i4
e2 en im sub2: … . . . i1 i4 ✔ … Explore backtrack points until (i) ✗ or (ii) time budget for sub2 expired en e5 e4 i1 i4 ✗ … e5 e4 en im e5 en im Observation #2: selectively mask original events

Message contents may diﬀer across executions!

a b c d e msg dst: d type:t seq:3
src:a dst:d replicate: [1,2] type:t seq:5 src:a dst:d replicate: [1,2] msg dst: d Original message: Replay:

a b c d e msg dst: d Observation #3:
some contents should be masked type:t seq:3 src:a dst:d replicate: [1,2] type:t seq:5 src:a dst:d replicate: [1,2] msg dst: d Original message: Replay:

Phase 1: choose initial schedule Match messages by user-deﬁned “ﬁngerprint”
Observation #3: some contents should be masked

Phase 1: choose initial schedule Match messages by user-deﬁned “ﬁngerprint”
Phase 2: prioritize backtrack points Match messages by type only Backtrack whenever multiple pending messages match by type Observation #3: some contents should be masked

Observation #4: shrink external message contents a b c d
e type:bootstrap peers: [a,b,c,d,e] type:bootstrap peers: [a,b,c,d,e] type:bootstrap peers: [a,b,c,d,e]

Observation #4: shrink external message contents Observation #1: many schedules
are commutative Approach: prioritize schedule space exploration Goal: ﬁnd minimal schedule that produces violation Minimize internal events after externals minimized Observation #2: selectively mask original events Observation #3: some contents should be masked

Target Systems

How well does DEMi work? Total Events 0 300 600
900 1200 1500 1800 2100 2400 2700 3000 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256 11 14 40 77 180 40 226 82 35 23 300 600 1000 400 1710 1500 2850 2380 1250 2160 Initial Execution After Minimization

900 1200 1500 1800 2100 2400 2700 3000 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256 11 14 40 77 180 40 226 82 35 23 300 600 1000 400 1710 1500 2850 2380 1250 2160 Initial Execution After Minimization Found w/ Fuzz Testing!

900 1200 1500 1800 2100 2400 2700 3000 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256 11 14 40 77 180 40 226 82 35 23 300 600 1000 400 1710 1500 2850 2380 1250 2160 Initial Execution After Minimization 80% - 97% Reduction!

90 120 150 180 210 240 270 300 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294spark-3150spark-9256 11 11 25 29 39 28 51 21 23 22 11 14 40 77 180 40 226 82 35 23 After Minimization Smallest Manual Trace

90 120 150 180 210 240 270 300 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294spark-3150spark-9256 11 11 25 29 39 28 51 21 23 22 11 14 40 77 180 40 226 82 35 23 After Minimization Smallest Manual Trace Factor of 1x - 5x from hand-crafted

69 170 How quickly does DEMi work? Runtime in Seconds
0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256 210 245 427 348 10676 69 43482 2132 282 170 Overall Minimization (~12 hours) (~3 hours) (~35 minutes)

69 170 How quickly does DEMi work? Runtime in Seconds
0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 Case Study raft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256 210 245 427 348 10676 69 43482 2132 282 170 Overall Minimization <10 minutes except 3 cases (~12 hours) (~3 hours) (~35 minutes)

See the paper for… How we handle non-determinism Handling multithreaded
processes Supporting other RPC libraries Sketch for minimizing production traces More in-depth evaluation Related work …

Conclusion Open source tool: github.com/NetSys/demi Read our paper! eecs.berkeley.edu/~rcs/research/nsdi_draft.pdf Optimistic
that these techniques can be successfully applied more broadly

Past Work Internet Troubleshooting: NSDI ’10, SIGCOMM ‘12 SDN Troubleshooting:
HotSDN ’13, PODC ’13, SIGCOMM ‘14 Middleboxes & Mobile Devices: SIGCOMM ’12, NSDI ’15 CAP for Networks: HotSDN ‘13

Dst’Sys & Networking

Dst’Sys & Networking 1Parkinson, VSTTE ‘10 Tools for dst’sys lag
sequential tools by ~10 years1

SE PL 1Parkinson, VSTTE ‘10 Tools for dst’sys lag sequential
tools by ~10 years1 Dst’Sys & Networking

SE PL Tools for dst’sys lag sequential tools by ~10
years1 Stable- Multithreading & PCT for dst’sys Routing convergence tradeoffs Testing & debugging async code (mobile,JS) Infer JS defer tags SAT/SMT solvers need systems techniques Test verified dst’sys Program properties for minimization ACID for SDN Synthesizing coordination HCI for configuration hell 1Parkinson, VSTTE ‘10 Dst’Sys & Networking

Conclusion Open source tool: github.com/NetSys/demi Read our paper! eecs.berkeley.edu/~rcs/research/nsdi_draft.pdf Optimistic
that these techniques can be successfully applied more broadly Thanks for your time! Contact me! [email protected]

Attributions Inspiration for slide design: Jay Lorch’s IronFleet slides Graphic
Icons: thenounproject.org logﬁle: mantisshrimpdesign magnifying glass: Ricardo Moreira disk: Anton Outkine hook: Seb Cornelius bug report: Lemon Liu devil: Mourad Mokrane Putin: Remi Mercier

Production Traces Model: feed partially ordered log into single machine
DEMi Require: - Partial ordering of all message deliveries - All crash-recoveries logged to disk

Instrumentation Complexity

Related Work Thread Schedule Minimization •Isolating Failure-Inducing Thread Schedules. SIGSOFT
’02. •A Trace Simplification Technique for Effective Debugging of Concurrent Programs. FSE ’10. Program Flow Analysis. •Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction. ISSTA ’07. •Toward Generating Reducible Replay Logs. PLDI ’11. Best-Effort Replay of Field Failures •A Technique for Enabling and Supporting Debugging of Field Failures. ICSE ’07. •Triage: Diagnosing Production Run Failures at the User’s Site. SOSP ’07.

DDmin in more detail

DDmin assumptions

Local vs. global minima

Minimization Pace

Dealing With Threads If you’re lucky: threads are largely independent
(Spark) If you’re unlucky: key insight: A write to shared memory is equivalent to a message delivery Approach: •interpose on virtual memory, thread scheduler •pause a thread whenever it writes to shared memory / disk Cf. “Enabling Tracing Of Long-Running Multithreaded Programs Via Dynamic Execution Reduction”, ISSTA ‘07

Dealing With Non-Determinism Interpose on: - Timers - Random number
generators - Unordered hash values - ID allocation Stop-gap: replay each schedule multiple times

Complete Results

Runtime Breakdown

Integrating with other RPC libs App RPC lib OS App
RPC lib OS App RPC lib OS DEMi JVM

Minimizing Faulty Executions of Distributed Sys...

Minimizing Faulty Executions of Distributed Systems

More Decks by Colin Scott

Other Decks in Technology

Featured

Transcript