Slide 1

Slide 1 text

Model Checking of Fault-Tolerant Distributed Algorithms Part II: Modeling Fault-tolerant Distributed Algorithms Annu Gmeiner Igor Konnov Ulrich Schmid Helmut Veith Josef Widder TMPA 2014, Kostroma, Russia Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 1 / 64

Slide 2

Slide 2 text

Why Modeling and Verification? Let’s have a look at the recent technical report by amazon.com 1: We have found that the standard verification techniques in industry are necessary but not sufficient. We use deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing [. . . ], but we still find that subtle bugs can hide in complex concurrent fault-tolerant systems. . . . . . We have found that testing the code is inadequate as a method to find subtle errors in design, as the number of reachable states of the code is astronomical. 1C. Newcombe et al. Use of Formal Methods at Amazon Web Services, 2014 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 2 / 64

Slide 3

Slide 3 text

Why Modeling and Verification? (cont.) . . . the final executable code is unambiguous, but contains an overwhelming amount of detail. We needed to be able to capture the essence of a design in a few hundred lines of precise description . Engineers naturally focus on designing the “happy case” for a system, i.e. the processing path in which no errors occur. . . . the shortest error trace exhibiting the bug contained 35 high level steps. The improbability of such compound events is not a defense against such bugs; historically, AWS has observed many combinations of events at least as complicated as those that could trigger this bug. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 3 / 64

Slide 4

Slide 4 text

What are we doing? The approach we have: A high-level description of a design, which is precise. Sound verification method, as complete as possible. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64

Slide 5

Slide 5 text

What are we doing? The approach we have: A high-level description of a design, which is precise. Sound verification method, as complete as possible. More specifically: modeling approach suitable for model checking automatic parameterized verification method targeting at fault-tolerant distributed algorithms (We are not interested in verifying mathematical toy examples) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64

Slide 6

Slide 6 text

Why Model Checking? an alternative proof approach useful counter-examples ability to define and vary assumptions about the system and see why it breaks closer to code level good degree of automation Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 5 / 64

Slide 7

Slide 7 text

Distributed Algorithms: Model Checking Challenges unbounded data types unbounded number of rounds (round numbers part of messages) parameterization in multiple parameters among n processes f ≤ t are faulty with n > 3t contrast to concurrent programs diverse fault models (adverse environments) continuous time fault-tolerant clock synchronization degrees of concurrency: synchronous, asynchronous partially synchronous a process makes at most 5 steps between 2 steps of any other process Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 6 / 64

Slide 8

Slide 8 text

Fault-tolerant distributed algorithms n n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64

Slide 9

Slide 9 text

Fault-tolerant distributed algorithms n ? ? ? t n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64

Slide 10

Slide 10 text

Fault-tolerant distributed algorithms n ? ? ? t f n processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64

Slide 11

Slide 11 text

Challenge #1: fault models clean crashes: least severe faulty processes prematurely halt after/before “send to all” crash faults: faulty processes prematurely halt (also) in the middle of “send to all” omission faults: faulty processes follow the algorithm, but some messages sent by them might be lost symmetric faults: faulty processes send arbitrarily to all or nobody Byzantine faults: most severe faulty processes can do anything encompass all behaviors of above models Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 8 / 64

Slide 12

Slide 12 text

Challenges #2 & #3: Pseudo-code and Communication Translate pseudo-code to a formal description that allows us to verify the algorithm and does not oversimplify the original algorithm. Assumptions about the communication medium are usually written in plain English, spread across research papers, constitute folklore knowledge. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 9 / 64

Slide 13

Slide 13 text

Asynchronous Reliable Broadcast (Srikanth & Toueg, 87) The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi . Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64

Slide 14

Slide 14 text

Asynchronous Reliable Broadcast (Srikanth & Toueg, 87) The core of the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi . Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; asynchronous t Byzantine faults correct if n > 3t the code is parameterized in n and t ⇒ process template P(n, t, f) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64

Slide 15

Slide 15 text

Typical Structure of a Computation Step receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64

Slide 16

Slide 16 text

Typical Structure of a Computation Step receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic im plicit pseudo-code Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64

Slide 17

Slide 17 text

Challenge #4: Parameterized Model Checking Parameterized model checking problem: given a process template P(n, t, f), resilience condition RC : n > 3t ∧ t ≥ f ≥ 0, fairness constraints Φ, e.g., “all messages will be delivered” and an LTL-X formula ϕ show for all n, t, and f satisfying RC (P(n, t, f))n−f + f faults |= (Φ → ϕ) n ? ? ? t n ? ? ? t f Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 12 / 64

Slide 18

Slide 18 text

Challenge #5: Liveness in Distributed Algorithms Interplay of safety and liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results (recall first part of lecture (Fischer et al., 1985)) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 13 / 64

Slide 19

Slide 19 text

Challenge #5: Liveness in Distributed Algorithms Interplay of safety and liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results (recall first part of lecture (Fischer et al., 1985)) Rich literature to verify safety (e.g. in concurrent systems) Distributed algorithms perspective: “doing nothing is always safe” “tools verify algorithms that actually might do nothing” Verification efforts often have to simplify assumptions Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 13 / 64

Slide 20

Slide 20 text

Summary We have to model: faults, communication medium captured in English, algorithms written in pseudo-code. and check: safety and liveness of parameterized systems with unbounded integers, non-standard fairness constraints, Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 14 / 64

Slide 21

Slide 21 text

Existing formalization frameworks TLA+/PlusCal Design & Specification Concurrent Alg. Proving/ TLC (Timed) IOA Asynchronous DA Proving/ UPPAAL PVS Theorem Proving ? (Parameterized) Model Checking of FTDAs DISTAL Simulation PBFT Implementation Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 15 / 64

Slide 22

Slide 22 text

Alternative frameworks TLA (temporal logic of actions): used to design (distributed) algorithms by refinement of the spec verification with proof assistants (low degree of automation) Encodings of DA in proof assistant PVS (e.g., by Rushby): ad-hoc encoding found a bug in a published synchronous Byzantine Agreement algorithm (Lincoln & Rushby, 1993) I/O-Automata: originally designed to write clearer hand-written proofs limited tool support, e.g., Veromodo toolset is still in beta suitable only for asynchronous distributed algorithms Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 16 / 64

Slide 23

Slide 23 text

Alternative frameworks TLA (temporal logic of actions): used to design (distributed) algorithms by refinement of the spec verification with proof assistants (low degree of automation) Encodings of DA in proof assistant PVS (e.g., by Rushby): ad-hoc encoding found a bug in a published synchronous Byzantine Agreement algorithm (Lincoln & Rushby, 1993) I/O-Automata: originally designed to write clearer hand-written proofs limited tool support, e.g., Veromodo toolset is still in beta suitable only for asynchronous distributed algorithms proof assistants are very general, but with low automation degree “everything is possible, but nothing is easy” Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 16 / 64

Slide 24

Slide 24 text

Simulation and Implementation Distal: Domain-specific language (Biely et al., 2013) Simulation and evaluate performance of fault-tolerant algorithms Practical Byzantine Fault-Tolerance (Castro et al., 1999) and other practical algorithms: Implementation with optimizations Precise semantics is unclear The system is partially synchronous: non-divergent message delays are assumed Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 17 / 64

Slide 25

Slide 25 text

In this part We introduce efficient encoding in PROMELA. Verify safety and liveness of fault-tolerant algorithms (fixed parameters). Find counterexamples for parameters known from the literature. This proves adequacy of our modeling. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 18 / 64

Slide 26

Slide 26 text

Preliminaries: Promela Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 19 / 64

Slide 27

Slide 27 text

Promela PROMELA ≡ PROcess MEta LAnguage SPIN ≡ Simple Promela INterpreter (not that simple any more) Here we give a short introduction and cover only the features important to our work. Detailed documentation, tutorials, and books on: http://spinroot.com Gerard Holzmann Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 20 / 64

Slide 28

Slide 28 text

Top-level: global variables and processes /∗ g l o b a l d e c l a r a t i o n s v i s i b l e to a l l p r o c e s s e s ∗/ int x; /∗ a g l o b a l i n t e g e r ( as in C) ∗/ mtype = { X, Y }; /∗ constant message types ∗/ /∗ a FIFO channel with at most 2 messages o f type mtype ∗/ chan c = [2] of { mtype }; active[2] proctype ProcA() { Two processes are created at the initial state ... } proctype ProcB() { Processes can be created later using: run ProcB() ... } init { A special process, use to create other processes run ProcB(); run ProcB(); } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 21 / 64

Slide 29

Slide 29 text

One process: Basics int x, y; active proctype ProcA() { int z; Declare a local variable z = x; Assignment x > y; Block until the expression is evaluated to true true; one step to execute, no effect z++; skip; same as true } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 22 / 64

Slide 30

Slide 30 text

One process: Control flow int x, y; active proctype P() { main: if A guarded command :: x == 0 -> x = 1; :: y == 0 -> y = 1; non-deterministically selects an option whose first expression is not blocked. :: x == 1 && y == 1 -> x = 0; y = 0; fi; continues executing the rest of the option step-by-step. goto main; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 23 / 64

Slide 31

Slide 31 text

One process: Control flow (cont.) int x = 0, y = 0; active proctype P() { main: if :: x == 0 -> x = 1; :: y == 0 -> y = 1; :: x == 1 && y == 1 -> x = 0; y = 0; fi; goto main; } Run 1 Run 2 Run 3 x=0,y=0 x=0,y=0 x=0,y=0 x=1,y=0 x=0,y=1 x=1,y=0 x=1,y=1 x=1,y=1 x=1,y=1 x=0,y=0 x=0,y=0 x=0,y=0 x=0,y=1 x=1,y=0 x=0,y=1 x=1,y=1 x=1,y=1 x=1,y=1 x=0,y=0 x=0,y=0 x=0,y=0 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 24 / 64

Slide 32

Slide 32 text

One process: Loops int x; active proctype P() { do a do..od loop :: x == 10 -> x = 0; :: x == 10 -> break; :: x < 10 -> x++; od; A: if basically the same. goto A introduces one more step :: x == 10 -> x = 0; :: x == 10 -> goto B; :: x < 10 -> x++; fi; goto A; B: } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 25 / 64

Slide 33

Slide 33 text

Many Processes: Interleavings Pure interleaving semantics Every statement is executed atomically int x = 0, y = 1; active[2] proctype A() { x = 1 - x; y = 1 - y; } A[1] A[0] The red path is an example execution where the steps of processes 0 and 1 alternate. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 26 / 64

Slide 34

Slide 34 text

Many Processes: Atomics use atomic { ... } to make execution of a sequence indivisible. non-deterministic choice with if..fi is still allowed! int x = 0, y = 1; active[2] proctype A() { atomic { x = 1 - x; y = 1 - y; } } A[1] A[0] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 27 / 64

Slide 35

Slide 35 text

Many Processes: Atomics use atomic { ... } to make execution of a sequence indivisible. non-deterministic choice with if..fi is still allowed! int x = 0, y = 1; active[2] proctype A() { atomic { x = 1 - x; y = 1 - y; } } A[1] A[0] Larger atomic steps lead to less possible paths and states. Note: different atomicity degrees may lead to different verification results Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 27 / 64

Slide 36

Slide 36 text

(Asynchronous) message passing mtype = { A, B }; chan chan1 = [1] of { mtype }; queue of size 1 chan chan2 = [1] of { mtype }; active proctype Ping() { chan1!A; insert A to “chan1” do :: chan2?B -> chan1!A; od; when B is on the top of “chan2”, remove it and insert A to “chan1” } active proctype Pong() { do :: chan1?A -> chan2!B; od; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 29 / 64

Slide 37

Slide 37 text

Blocking receive mtype = { A, B }; chan chan1 = [1] of { mtype }; chan chan2 = [1] of { mtype }; active proctype Ping() { chan1!A; do :: chan2?B -> ←− deadlock! chan1!A; Ping sends A, Pong receives A, chan1?A is blocked od; } active proctype Pong() { do :: chan1?A -> chan1?A; ←− deadlock! chan2!B; od; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 30 / 64

Slide 38

Slide 38 text

Blocking send mtype = { A, B }; chan chan1 = [1] of { mtype }; chan chan2 = [1] of { mtype }; active proctype Ping() { chan1!A; do :: chan2?B -> When chan1=[A] and chan2=[B], the system deadlocks chan1!A; chan1!A; chan1!A; ←− deadlock! chan1!A; The shortest counter-example has 10 steps od; } Use Spin to find it active proctype Pong() { do :: chan1?A -> chan2!B; ←− deadlock! od; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 31 / 64

Slide 39

Slide 39 text

Promela vs. C PROMELA looks like C But it is not! Non-determinism in the if statements (internal non-determinism) Non-determinstic scheduler (external non-determinism) Atomic statements Message passing PROMELA is a modeling language Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 32 / 64

Slide 40

Slide 40 text

Preliminaries: Kripke Structures Linear Temporal Logic (LTL) Control Flow Automata (CFA) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 33 / 64

Slide 41

Slide 41 text

Kripke structures A Kripke structure is a M = (S, S0, R, AP, L), where: S is a set of states, S0 ⊆ S is the set of initial states, R ⊆ S × S is a transition relation, AP is a set of atomic propositions, L : S → 2AP is a state-labeling function. s4 : {g} s1 : {y} s2 : {y} s3 : {r, y, g} s0 : {r} Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 34 / 64

Slide 42

Slide 42 text

Linear Temporal Logic An LTL formula is defined inductively w.r.t. atomic propositions AP: (base) p ∈ AP is an LTL formula, if ϕ and ψ are LTL formulas, then the following expressions are LTL formulas: Nexttime: X ϕ, Eventually: F ϕ, Globally: G ϕ, Until: ψ U ϕ. Boolean combinations: ϕ ∧ ψ, ϕ ∨ ψ, and ¬ϕ. s0 s2 s3 s4 s1 s 0 s 1 s 2 s 4 s 3 s 0 s 1 s 2 s 3 s 4 s 0 ψ s 1 ψ s 2 ϕ s 3 s 4 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 35 / 64

Slide 43

Slide 43 text

Recall: Typical Structure of a Computation Step receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic im plicit pseudo-code Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 36 / 64

Slide 44

Slide 44 text

CFA: Intermediate representation Intermediate representation of a loop body: a path from qI to qF encodes one iteration. Every variable is assigned at most once (SSA). active proctype P() { int x, y; do :: x == 0 -> x = 1; :: x == 1 -> x = 2; :: x == 2 -> x = 0; :: x == 1 -> x = 0; y = 1 - y; od; } qI q0 q1 q2 q3 q4 qF x = 0 x = 1 x = 1 x = 2 x = 1 x = 2 x = 0 x = 0 y = 1 − y Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 37 / 64

Slide 45

Slide 45 text

Example: from a CFA to a Kripke structure Kripke structure M(n, t, f) = (S, S0, R, AP, L) of N(n, t, f) processes For a path π from qI to qF construct a formula φπ(x, y, x , y ) A state is a pair of x = (x1, . . . , xN) ∈ NN and y = (y1, . . . , yN) ∈ NN and the initial states are S0 = {(0, . . . , 0)} ((x, y), (x , y )) ∈ R iff there are process index k. 1 ≤ k ≤ N and path π: [k moves]: φπ(xk , yk , x k , y k ) holds [others do not]: ∀i ∈ {1, . . . , N} \ {k}. x i = xi , y i = yi . Propositions AP = {[∃i. yi = 0], [∀i. yi = 0]} and a state (x, y) is labeled as: p ∈ L((x, y)) iff (x, y) |= p. qI q0 q1 q2 q3 q4 qF x = 0 x = 1 x = 1 x = 2 x = 1 x = 2 x = 0 x = 0 y = 1 − y Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 38 / 64

Slide 46

Slide 46 text

Example: Properties of ST87 in LTL Unforgeability. If vi = 0 for all correct processes i, then for all correct processes j, acceptj remains 0 forever. G n−f i=1 vi = 0 → G n−f j=1 acceptj = 0 Safety Completeness. If vi = 1 for all correct processes i, then there is a correct process j that eventually sets acceptj to 1. G n−f i=1 vi = 1 → F n−f j=1 acceptj = 1 Liveness Relay. If a correct process i sets accepti to 1, then eventually all correct processes j set acceptj to 1. G n−f i=1 accepti = 1 → F n−f j=1 acceptj = 1 Liveness Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 39 / 64

Slide 47

Slide 47 text

Model Checking Problems Finite-state MC Input: a process template P, an LTL formula ϕ (including fairness), values of parameters n, t, and f. Problem: check, whether M(n, t, f) |= ϕ. Parameterized MC Input: a process template P, an LTL formula φ (including fairness) with atomic propositions of the form [∃i.xi < y] and [∀i.xi < y] Problem: check, whether ∀n, t, f : n > 3t ∧ t ≥ f ∧ f ≥ 0. M(n, t, f) |= φ. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 40 / 64

Slide 48

Slide 48 text

Parameterized modeling & Non-parameterized model checking as in SPIN’13: (John et al., 2013) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 41 / 64

Slide 49

Slide 49 text

Modeling of threshold-based algorithms in Promela. . . We introduce efficient encoding of threshold-based fault-tolerant algorithms in PROMELA (with parametrization!) Verify safety and liveness of fault-tolerant algorithms (fixed parameters). Find counterexamples for parameters known from the literature. This proves adequacy of our modeling. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 42 / 64

Slide 50

Slide 50 text

Modeling of threshold-based algorithms in Promela. . . We introduce efficient encoding of threshold-based fault-tolerant algorithms in PROMELA (with parametrization!) Verify safety and liveness of fault-tolerant algorithms (fixed parameters). Find counterexamples for parameters known from the literature. This proves adequacy of our modeling. For our method, we exploit specifics of FTDAs: 1 central feature of the algorithms (message counting); 2 specific message passing (we do not need to know who sent but how many of them sent messages); 3 the way faults affect messages (again, counting messages). Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 42 / 64

Slide 51

Slide 51 text

Case Studies We consider a number of threshold-based algorithms. Our running example ST87 for 1 Byzantine faults (BYZ) 2 omission faults (OMIT) 3 symmetric faults (SYMM) 4 clean crashes (CLEAN). 5 Forklore reliable broadcast for clean crashes [Chandra & Toueg 96, CT96] (to be continued) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 43 / 64

Slide 52

Slide 52 text

Characteristics of the FTDA by Srikanth & Toueg, 87 Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 44 / 64

Slide 53

Slide 53 text

Characteristics of the FTDA by Srikanth & Toueg, 87 Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; the algorithm consists of threshold-guarded commands, only thresholds t + 1 and n − t communication is by “send to all” how processes distinguish distinct senders is not part of the algorithm (i.e., algorithm description is high level) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 44 / 64

Slide 54

Slide 54 text

Case Studies (cont.): Larger Algorithms more involved algorithms in the purely asynchronous setting: 6 Asynchronous Byzantine Agreement (Bracha & Toueg 85, BT85) Byzantine faults two phases and two message types five status values properties: unforgeability, correctness (liveness), agreement (liveness) 7 Condition-based Consensus (Most´ efaoui et al. 01, MRRR01) crash faults two phases and four message types nine status variables properties: validity, agreement, termination (liveness) 8 Fast Byzantine Consensus: common case (Martin, Alvisi 06, MA06) Byzantine faults the core part of the algorithm no cryptography Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 45 / 64

Slide 55

Slide 55 text

Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time 1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Slide 56

Slide 56 text

Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time 1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Slide 57

Slide 57 text

Experimental Results at Glance Algorithm Fault Parameters Resilience Properties 1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 5. CT96 CRASH n = 2 — U, C, R Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Slide 58

Slide 58 text

Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time 1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Slide 59

Slide 59 text

Experimental Results at Glance 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Slide 60

Slide 60 text

Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time 1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64

Slide 61

Slide 61 text

Experimental Results: on ST87, the Byzantine Case Time (sec, logscale) no faults: f = 0 two faults: f = 2 Memory (MB, logscale) no faults: f = 2 The more faults we have, the easier the problem is: Two faults: we can check the systems of up to nine processes No faults: we can check the systems of up to seven processes Precision of modeling: we found counter-examples for the corner cases n = 3t and f > t, where the resilience condition is violated. (June 2013: somebody wrote on Wikipedia that n = 3t should work :-) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 47 / 64

Slide 62

Slide 62 text

Discussion of the specifications Unforgeability. If vi = 0 for all correct processes i, then for all correct processes j, acceptj remains 0 forever. G n−f i=1 vi = 0 → G n−f j=1 acceptj = 0 The specification of Byzantine FTDAs have the following features: Only the states of correct processes are evaluated. Faulty processes may be Byzantine. (no assumption on behavior) Specifications do not talk about individual processes. Only global safety and progress are important. Indexed temporal logic is not required! Quantification over processes is on the level of atomic propositions. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 48 / 64

Slide 63

Slide 63 text

Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64

Slide 64

Slide 64 text

Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64

Slide 65

Slide 65 text

Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Fault-Tolerant Algorithms: n processes, at most t are Byzantine Threshold Guard if received m from n − t processes then ... (the processes cannot refer to f!) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64

Slide 66

Slide 66 text

Counting Argument in Threshold-Guarded Algorithms n t f t + 1 Correct processes count incoming messages from distinct processes Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64

Slide 67

Slide 67 text

Counting Argument in Threshold-Guarded Algorithms n t f t + 1 Correct processes count incoming messages from distinct processes Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64

Slide 68

Slide 68 text

Counting Argument in Threshold-Guarded Algorithms n t f t + 1 at least one non-faulty sent the message Correct processes count incoming messages from distinct processes Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64

Slide 69

Slide 69 text

Modeling threshold-based algorithms in Promela As the distributed algorithms are given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received from n − t distinct processes” faults Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64

Slide 70

Slide 70 text

Modeling threshold-based algorithms in Promela As the distributed algorithms are given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received from n − t distinct processes” faults In what follows, we compare side-by-side two solutions: A straightforward encoding with PROMELA channels and explicit representation of faulty processes. [Solution 1] An advanced encoding with shared variables and fault injection. [Solution 2] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64

Slide 71

Slide 71 text

Modeling threshold-based algorithms in Promela As the distributed algorithms are given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received from n − t distinct processes” faults In what follows, we compare side-by-side two solutions: A straightforward encoding with PROMELA channels and explicit representation of faulty processes. [Solution 1] An advanced encoding with shared variables and fault injection. [Solution 2] To decouple encoding of reliable message passing and of faults, we first consider message passing without faults, and then show how to encode faults. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64

Slide 72

Slide 72 text

Template in Promela We implement the following loop on the right. receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic /∗ shared s t a t e : a v a r i a b l e or a channel ∗/ active proctype[N(n,t,f)] P(){ /∗ l o c a l v a r i a b l e to count messages from d i s t i n c t p r o c e s s e s ∗/ int nrcvd; /∗ i n i t i a l i z a t i o n ∗/ loop: atomic { /∗ 1 . r e c e i v e and count messages 2 . compute using nrcvd 3 . send messages ∗/ } goto loop; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 52 / 64

Slide 73

Slide 73 text

Modeling Message Passing All our case studies are designed with the assumption of classic reliable asynchronous message passing as in (Fischer et al., 1985): non-blocking communication, operations “receive” and “send” are executed immediately. if a message can be received now, it may be also received later, a process does not have to receive a message as soon as it is able to. every sent message is eventually received, but there are no bounds on the delays. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 53 / 64

Slide 74

Slide 74 text

Solution 1: Message Passing using Promela channels A straightforward encoding using message channels: /∗ message type ∗/ mtype = { ECHO }; /∗ point −to−point channels ∗/ chan p2p[N][N] = [1] of { mtype }; /∗ tag r e c e i v e d messages ∗/ bit rx[N][N]; Sending a message to all processes: for (i : 1 .. N) { p2p[_pid][i]!ECHO; } Note: pid denotes the process identifier in PROMELA (we use it solely to encode message passing). Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 54 / 64

Slide 75

Slide 75 text

Solution 1: Message Passing (cont.) Receiving and counting messages from distinct processes (no faults yet): /∗ l o c a l ∗/ int nrcvd = 0; /∗ i n i t i a l l y , no messages ∗/ ... i = 0; do /∗ i s t h e r e a message from p r o c e s s i? ∗/ :: (i < N) && nempty(p2p[i][_pid]) -> p2p[i][_pid]?ECHO; /∗ remove i t ∗/ if :: !rx[i][_pid] -> /∗ 1 . the f i r s t time : ∗/ rx[i][_pid] = 1; /∗ a . mark as r e c e i v e d ∗/ nrcvd++; break; /∗ b . i n c r e a s e l o c a l counter ∗/ :: rx[i][_pid]; /∗ 2 . ign ore a d u p l i c a t e ∗/ fi; i++; /∗ next p r o c e s s ∗/ :: (i < N) -> i++; /∗ r e c e i v e nothing from i ∗/ :: i == N -> break; od Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 55 / 64

Slide 76

Slide 76 text

Solution 2: Simulating message passing with variables Keeping the number of send-to-all’s by (correct) processes: int nsnt; /∗ shared v a r i a b l e ∗/ /∗ number o f send−to−a l l ’ s sent by c o r r e c t p r o c e s s e s ∗/ Sending a message to all: nsnt++; Receiving and counting messages from distinct processes (no faults): if /∗ p i c k a l a r g e r value ≤ nsnt ∗/ :: ((nrcvd + 1) < nsnt) -> nrcvd++; /∗ one more message ∗/ :: skip; /∗ or nothing ∗/ fi; Reliable communication as a fairness property: F G [∀i.nrcvdi ≥ nsnt] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 56 / 64

Slide 77

Slide 77 text

Solution 2: Some questions you might ask Q1: instead of if :: ((nrcvd + 1) < nsnt) -> nrcvd++; /∗ one more message ∗/ :: skip; /∗ or nothing ∗/ fi; why cannot we just write: nrcvd = nsnt; A1: You can, but that will be another model, not [FLP85]! [FLP85] only guarantees that every message is eventually received. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 57 / 64

Slide 78

Slide 78 text

Solution 2: Some questions you might ask (cont.) Reliable communication: every sent message is eventually received. Q2: Why do we write F G [∀i.nrcvdi ≥ nsnt] (1) instead of: ∀i. G F [nrcvdi ≥ nsnt] (2) A2: We like to write (2), but it will require us to use another logic called indexed LTL, which will cause problems in the parameterized case. For threshold-based algorithms, the value of nsnt is changes at most n times. Under this assumption, (2) is equivalent (1). Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 58 / 64

Slide 79

Slide 79 text

Solution 1 (cont.): Explicit Modeling of Faults (Lamport et al., 1982) introduce Byzantine processes that can virtually do anything. In our case, Byzantine behavior boils down to sending ECHO to some of the correct processes and not sending ECHO to the others: active[F] proctype Byz() { step: atomic { i = 0; do /∗ send ECHO to p r o c e s s i ∗/ :: i < N -> p2p[_pid][i]!ECHO; i++; /∗ or not ∗/ :: i < N -> i++; :: i == N -> break; od }; goto step; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 59 / 64

Slide 80

Slide 80 text

Solution 2 (cont.): Injecting Faults into Message Counters We instantiate n − f correct processes and no faulty processes. Instead, we say that the correct processes may receive up to f additional messages due to faults: if :: ((nrcvd + 1) < nsnt + f) -> nrcvd++; /∗ r e c e i v e one more message ∗/ :: skip; /∗ or nothing ∗/ fi; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 60 / 64

Slide 81

Slide 81 text

Solution 2 (cont.): Injecting Faults into Message Counters We instantiate n − f correct processes and no faulty processes. Instead, we say that the correct processes may receive up to f additional messages due to faults: if :: ((nrcvd + 1) < nsnt + f) -> nrcvd++; /∗ r e c e i v e one more message ∗/ :: skip; /∗ or nothing ∗/ fi; The fairness still forces the processes to receive all the messages sent by the correct processes: F G [∀i.nrcvdi ≥ nsnt] Note: each correct process sends at most one ECHO message. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 60 / 64

Slide 82

Slide 82 text

Solution 2 (cont.): Modeling different kinds of faults Byzantine faults (previous slide): create only correct processes, i.e., n − f processes only they have to satisfy spec extra messages from Byzantine: ((nrcvd + 1) < nsnt + f) fairness (reliable communication): F G [∀i.nrcvdi ≥ nsnt] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64

Slide 83

Slide 83 text

Solution 2 (cont.): Modeling different kinds of faults Byzantine faults (previous slide): create only correct processes, i.e., n − f processes only they have to satisfy spec extra messages from Byzantine: ((nrcvd + 1) < nsnt + f) fairness (reliable communication): F G [∀i.nrcvdi ≥ nsnt] Omission faults (processes fail to send messages): create all processes, i.e., n processes all of them are mentioned in the specification no additional messages: ((nrcvd + 1) < nsnt) fairness (with possible message loss due to faults) F G [∀i.nrcvdi ≥ nsnt − f] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64

Slide 84

Slide 84 text

Solution 2 (cont.): Modeling different kinds of faults Byzantine faults (previous slide): create only correct processes, i.e., n − f processes only they have to satisfy spec extra messages from Byzantine: ((nrcvd + 1) < nsnt + f) fairness (reliable communication): F G [∀i.nrcvdi ≥ nsnt] Omission faults (processes fail to send messages): create all processes, i.e., n processes all of them are mentioned in the specification no additional messages: ((nrcvd + 1) < nsnt) fairness (with possible message loss due to faults) F G [∀i.nrcvdi ≥ nsnt − f] Crash faults: similar to omissions with crash control state added Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64

Slide 85

Slide 85 text

Experiments: Solution 1 vs. Solution 2 States (logscale) 10 100 1000 10000 100000 1e+06 1e+07 1e+08 3 4 5 6 7 8 states (logscale) number of processes, N Memory (MB, logscale, ≤ 12 GB) 100 1000 10000 3 4 5 6 7 8 memory, MB (logscale) number of processes, N Solution 1: Channels + explicit Byzantine processes (blue) Solution 2: shared variables + fault injection (red) in the presence of one Byzantine faulty process (f = 1) (case f = 2 runs out of memory too fast) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 62 / 64

Slide 86

Slide 86 text

Summary We show how to model threshold-based fault-tolerant algorithms starting with an imprecise description We create PROMELA models using expert advice. The tool demonstrates that the model behaves as predicted by theory (for concrete values of parameters) This reference implementation allows us to optimize the encoding ... and to make the model amenable to parameterized verification Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 63 / 64

Slide 87

Slide 87 text

Tomorrow: parameterized model checking Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 64 / 64

Slide 88

Slide 88 text

References I Biely, M., Delgado, P., Milosevic, Z., & Schiper, A. 2013 (June). Distal: A framework for implementing fault-tolerant distributed algorithms. Pages 1–8 of: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. Castro, Miguel, Liskov, Barbara, et al. 1999. Practical Byzantine fault tolerance. Pages 173–186 of: OSDI, vol. 99. Fischer, Michael J., Lynch, Nancy A., & Paterson, M. S. 1985. Impossibility of Distributed Consensus with one Faulty Process. J. ACM, 32(2), 374–382. http://doi.acm.org/10.1145/3149.214121. John, Annu, Konnov, Igor, Schmid, Ulrich, Veith, Helmut, & Widder, Josef. 2013. Towards Modeling and Model Checking Fault-Tolerant Distributed Algorithms. Pages 209–226 of: SPIN. LNCS, vol. 7976. Lamport, Leslie, Shostak, Robert E., & Pease, Marshall C. 1982. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst., 4(3), 382–401. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 65 / 64

Slide 89

Slide 89 text

References II Lincoln, P., & Rushby, J. 1993. A formally verified algorithm for interactive consistency under a hybrid fault model. Pages 402–411 of: FTCS-23. http://dx.doi.org/10.1109/FTCS.1993.627343. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 66 / 64

Slide 90

Slide 90 text

Folklore Reliable Broadcast (e.g., Chandra & Toueg, 96) Correct processes agree on value vi in the presence of crash faults. Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f (vi = 1 or received from some process ) and accepti = 0 then begin send to all ; accepti := 1; end Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 67 / 64

Slide 91

Slide 91 text

Folklore Reliable Broadcast (e.g., Chandra & Toueg, 96) Correct processes agree on value vi in the presence of crash faults. Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f (vi = 1 or received from some process ) and accepti = 0 then begin send to all ; /* when crashing it sends to a subset of processes */ accepti := 1; /* it can also crash here */ end Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 67 / 64

Slide 92

Slide 92 text

Verification Problem as in Distributed Computing Given a distributed algorithm A and specifications ϕU , ϕC , ϕR , Fix n and t with n > 3t, show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR . Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64

Slide 93

Slide 93 text

Verification Problem as in Distributed Computing Given a distributed algorithm A and specifications ϕU , ϕC , ϕR , Fix n and t with n > 3t, show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR . In every execution: the number of faulty processes is restricted, i.e., f ≤ t; processes can use n and t in the code, but not f; f is constant (if a process fails late, its “correct” behavior was a Byzantine trick). A distributed system A(n, t) f = 0 . . . f = t Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64

Slide 94

Slide 94 text

Verification Problem as in Distributed Computing Given a distributed algorithm A and specifications ϕU , ϕC , ϕR , Fix n and t with n > 3t, show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR . In every execution: the number of faulty processes is restricted, i.e., f ≤ t; processes can use n and t in the code, but not f; f is constant (if a process fails late, its “correct” behavior was a Byzantine trick). A distributed system A(n, t) f = 0 . . . f = t Counterexamples when f > t? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64

Slide 95

Slide 95 text

Experiments: Channels vs. Shared Variables enumerating reachable states in SPIN with POR and state compression States (logscale) 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 3 4 5 6 7 8 states (logscale) number of processes, N Memory (MB, logscale, limit of 12 GB) 100 1000 10000 3 4 5 6 7 8 memory, MB (logscale) number of processes, N Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 69 / 64