Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Exactpro
November 24, 2014

Introduction into Fault-tolerant Distributed Algorithms and their Modeling (Part 2)

Josef Widder, Vienna University of Technology

Exactpro

November 24, 2014
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. Model Checking of Fault-Tolerant Distributed Algorithms Part II: Modeling Fault-tolerant

    Distributed Algorithms Annu Gmeiner Igor Konnov Ulrich Schmid Helmut Veith Josef Widder TMPA 2014, Kostroma, Russia Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 1 / 64
  2. Why Modeling and Verification? Let’s have a look at the

    recent technical report by amazon.com 1: We have found that the standard verification techniques in industry are necessary but not sufficient. We use deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing [. . . ], but we still find that subtle bugs can hide in complex concurrent fault-tolerant systems. . . . . . We have found that testing the code is inadequate as a method to find subtle errors in design, as the number of reachable states of the code is astronomical. 1C. Newcombe et al. Use of Formal Methods at Amazon Web Services, 2014 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 2 / 64
  3. Why Modeling and Verification? (cont.) . . . the final

    executable code is unambiguous, but contains an overwhelming amount of detail. We needed to be able to capture the essence of a design in a few hundred lines of precise description . Engineers naturally focus on designing the “happy case” for a system, i.e. the processing path in which no errors occur. . . . the shortest error trace exhibiting the bug contained 35 high level steps. The improbability of such compound events is not a defense against such bugs; historically, AWS has observed many combinations of events at least as complicated as those that could trigger this bug. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 3 / 64
  4. What are we doing? The approach we have: A high-level

    description of a design, which is precise. Sound verification method, as complete as possible. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64
  5. What are we doing? The approach we have: A high-level

    description of a design, which is precise. Sound verification method, as complete as possible. More specifically: modeling approach suitable for model checking automatic parameterized verification method targeting at fault-tolerant distributed algorithms (We are not interested in verifying mathematical toy examples) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 4 / 64
  6. Why Model Checking? an alternative proof approach useful counter-examples ability

    to define and vary assumptions about the system and see why it breaks closer to code level good degree of automation Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 5 / 64
  7. Distributed Algorithms: Model Checking Challenges unbounded data types unbounded number

    of rounds (round numbers part of messages) parameterization in multiple parameters among n processes f ≤ t are faulty with n > 3t contrast to concurrent programs diverse fault models (adverse environments) continuous time fault-tolerant clock synchronization degrees of concurrency: synchronous, asynchronous partially synchronous a process makes at most 5 steps between 2 steps of any other process Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 6 / 64
  8. Fault-tolerant distributed algorithms n n processes communicate by messages all

    processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64
  9. Fault-tolerant distributed algorithms n ? ? ? t n processes

    communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64
  10. Fault-tolerant distributed algorithms n ? ? ? t f n

    processes communicate by messages all processes know that at most t of them might be faulty f are actually faulty Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 7 / 64
  11. Challenge #1: fault models clean crashes: least severe faulty processes

    prematurely halt after/before “send to all” crash faults: faulty processes prematurely halt (also) in the middle of “send to all” omission faults: faulty processes follow the algorithm, but some messages sent by them might be lost symmetric faults: faulty processes send arbitrarily to all or nobody Byzantine faults: most severe faulty processes can do anything encompass all behaviors of above models Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 8 / 64
  12. Challenges #2 & #3: Pseudo-code and Communication Translate pseudo-code to

    a formal description that allows us to verify the algorithm and does not oversimplify the original algorithm. Assumptions about the communication medium are usually written in plain English, spread across research papers, constitute folklore knowledge. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 9 / 64
  13. Asynchronous Reliable Broadcast (Srikanth & Toueg, 87) The core of

    the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi . Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64
  14. Asynchronous Reliable Broadcast (Srikanth & Toueg, 87) The core of

    the classic broadcast algorithm from the DA literature. It solves an agreement problem depending on the inputs vi . Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; asynchronous t Byzantine faults correct if n > 3t the code is parameterized in n and t ⇒ process template P(n, t, f) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 10 / 64
  15. Typical Structure of a Computation Step receive messages compute using

    messages and local variables (description in English with basic control flow if-then-else) send messages atomic Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64
  16. Typical Structure of a Computation Step receive messages compute using

    messages and local variables (description in English with basic control flow if-then-else) send messages atomic im plicit pseudo-code Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 11 / 64
  17. Challenge #4: Parameterized Model Checking Parameterized model checking problem: given

    a process template P(n, t, f), resilience condition RC : n > 3t ∧ t ≥ f ≥ 0, fairness constraints Φ, e.g., “all messages will be delivered” and an LTL-X formula ϕ show for all n, t, and f satisfying RC (P(n, t, f))n−f + f faults |= (Φ → ϕ) n ? ? ? t n ? ? ? t f Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 12 / 64
  18. Challenge #5: Liveness in Distributed Algorithms Interplay of safety and

    liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results (recall first part of lecture (Fischer et al., 1985)) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 13 / 64
  19. Challenge #5: Liveness in Distributed Algorithms Interplay of safety and

    liveness is a central challenge in DAs achieving safety and liveness is non-trivial asynchrony and faults lead to impossibility results (recall first part of lecture (Fischer et al., 1985)) Rich literature to verify safety (e.g. in concurrent systems) Distributed algorithms perspective: “doing nothing is always safe” “tools verify algorithms that actually might do nothing” Verification efforts often have to simplify assumptions Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 13 / 64
  20. Summary We have to model: faults, communication medium captured in

    English, algorithms written in pseudo-code. and check: safety and liveness of parameterized systems with unbounded integers, non-standard fairness constraints, Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 14 / 64
  21. Existing formalization frameworks TLA+/PlusCal Design & Specification Concurrent Alg. Proving/

    TLC (Timed) IOA Asynchronous DA Proving/ UPPAAL PVS Theorem Proving ? (Parameterized) Model Checking of FTDAs DISTAL Simulation PBFT Implementation Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 15 / 64
  22. Alternative frameworks TLA (temporal logic of actions): used to design

    (distributed) algorithms by refinement of the spec verification with proof assistants (low degree of automation) Encodings of DA in proof assistant PVS (e.g., by Rushby): ad-hoc encoding found a bug in a published synchronous Byzantine Agreement algorithm (Lincoln & Rushby, 1993) I/O-Automata: originally designed to write clearer hand-written proofs limited tool support, e.g., Veromodo toolset is still in beta suitable only for asynchronous distributed algorithms Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 16 / 64
  23. Alternative frameworks TLA (temporal logic of actions): used to design

    (distributed) algorithms by refinement of the spec verification with proof assistants (low degree of automation) Encodings of DA in proof assistant PVS (e.g., by Rushby): ad-hoc encoding found a bug in a published synchronous Byzantine Agreement algorithm (Lincoln & Rushby, 1993) I/O-Automata: originally designed to write clearer hand-written proofs limited tool support, e.g., Veromodo toolset is still in beta suitable only for asynchronous distributed algorithms proof assistants are very general, but with low automation degree “everything is possible, but nothing is easy” Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 16 / 64
  24. Simulation and Implementation Distal: Domain-specific language (Biely et al., 2013)

    Simulation and evaluate performance of fault-tolerant algorithms Practical Byzantine Fault-Tolerance (Castro et al., 1999) and other practical algorithms: Implementation with optimizations Precise semantics is unclear The system is partially synchronous: non-divergent message delays are assumed Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 17 / 64
  25. In this part We introduce efficient encoding in PROMELA. Verify

    safety and liveness of fault-tolerant algorithms (fixed parameters). Find counterexamples for parameters known from the literature. This proves adequacy of our modeling. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 18 / 64
  26. Promela PROMELA ≡ PROcess MEta LAnguage SPIN ≡ Simple Promela

    INterpreter (not that simple any more) Here we give a short introduction and cover only the features important to our work. Detailed documentation, tutorials, and books on: http://spinroot.com Gerard Holzmann Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 20 / 64
  27. Top-level: global variables and processes /∗ g l o b

    a l d e c l a r a t i o n s v i s i b l e to a l l p r o c e s s e s ∗/ int x; /∗ a g l o b a l i n t e g e r ( as in C) ∗/ mtype = { X, Y }; /∗ constant message types ∗/ /∗ a FIFO channel with at most 2 messages o f type mtype ∗/ chan c = [2] of { mtype }; active[2] proctype ProcA() { Two processes are created at the initial state ... } proctype ProcB() { Processes can be created later using: run ProcB() ... } init { A special process, use to create other processes run ProcB(); run ProcB(); } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 21 / 64
  28. One process: Basics int x, y; active proctype ProcA() {

    int z; Declare a local variable z = x; Assignment x > y; Block until the expression is evaluated to true true; one step to execute, no effect z++; skip; same as true } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 22 / 64
  29. One process: Control flow int x, y; active proctype P()

    { main: if A guarded command :: x == 0 -> x = 1; :: y == 0 -> y = 1; non-deterministically selects an option whose first expression is not blocked. :: x == 1 && y == 1 -> x = 0; y = 0; fi; continues executing the rest of the option step-by-step. goto main; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 23 / 64
  30. One process: Control flow (cont.) int x = 0, y

    = 0; active proctype P() { main: if :: x == 0 -> x = 1; :: y == 0 -> y = 1; :: x == 1 && y == 1 -> x = 0; y = 0; fi; goto main; } Run 1 Run 2 Run 3 x=0,y=0 x=0,y=0 x=0,y=0 x=1,y=0 x=0,y=1 x=1,y=0 x=1,y=1 x=1,y=1 x=1,y=1 x=0,y=0 x=0,y=0 x=0,y=0 x=0,y=1 x=1,y=0 x=0,y=1 x=1,y=1 x=1,y=1 x=1,y=1 x=0,y=0 x=0,y=0 x=0,y=0 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 24 / 64
  31. One process: Loops int x; active proctype P() { do

    a do..od loop :: x == 10 -> x = 0; :: x == 10 -> break; :: x < 10 -> x++; od; A: if basically the same. goto A introduces one more step :: x == 10 -> x = 0; :: x == 10 -> goto B; :: x < 10 -> x++; fi; goto A; B: } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 25 / 64
  32. Many Processes: Interleavings Pure interleaving semantics Every statement is executed

    atomically int x = 0, y = 1; active[2] proctype A() { x = 1 - x; y = 1 - y; } A[1] A[0] The red path is an example execution where the steps of processes 0 and 1 alternate. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 26 / 64
  33. Many Processes: Atomics use atomic { ... } to make

    execution of a sequence indivisible. non-deterministic choice with if..fi is still allowed! int x = 0, y = 1; active[2] proctype A() { atomic { x = 1 - x; y = 1 - y; } } A[1] A[0] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 27 / 64
  34. Many Processes: Atomics use atomic { ... } to make

    execution of a sequence indivisible. non-deterministic choice with if..fi is still allowed! int x = 0, y = 1; active[2] proctype A() { atomic { x = 1 - x; y = 1 - y; } } A[1] A[0] Larger atomic steps lead to less possible paths and states. Note: different atomicity degrees may lead to different verification results Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 27 / 64
  35. (Asynchronous) message passing mtype = { A, B }; chan

    chan1 = [1] of { mtype }; queue of size 1 chan chan2 = [1] of { mtype }; active proctype Ping() { chan1!A; insert A to “chan1” do :: chan2?B -> chan1!A; od; when B is on the top of “chan2”, remove it and insert A to “chan1” } active proctype Pong() { do :: chan1?A -> chan2!B; od; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 29 / 64
  36. Blocking receive mtype = { A, B }; chan chan1

    = [1] of { mtype }; chan chan2 = [1] of { mtype }; active proctype Ping() { chan1!A; do :: chan2?B -> ←− deadlock! chan1!A; Ping sends A, Pong receives A, chan1?A is blocked od; } active proctype Pong() { do :: chan1?A -> chan1?A; ←− deadlock! chan2!B; od; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 30 / 64
  37. Blocking send mtype = { A, B }; chan chan1

    = [1] of { mtype }; chan chan2 = [1] of { mtype }; active proctype Ping() { chan1!A; do :: chan2?B -> When chan1=[A] and chan2=[B], the system deadlocks chan1!A; chan1!A; chan1!A; ←− deadlock! chan1!A; The shortest counter-example has 10 steps od; } Use Spin to find it active proctype Pong() { do :: chan1?A -> chan2!B; ←− deadlock! od; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 31 / 64
  38. Promela vs. C PROMELA looks like C But it is

    not! Non-determinism in the if statements (internal non-determinism) Non-determinstic scheduler (external non-determinism) Atomic statements Message passing PROMELA is a modeling language Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 32 / 64
  39. Preliminaries: Kripke Structures Linear Temporal Logic (LTL) Control Flow Automata

    (CFA) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 33 / 64
  40. Kripke structures A Kripke structure is a M = (S,

    S0, R, AP, L), where: S is a set of states, S0 ⊆ S is the set of initial states, R ⊆ S × S is a transition relation, AP is a set of atomic propositions, L : S → 2AP is a state-labeling function. s4 : {g} s1 : {y} s2 : {y} s3 : {r, y, g} s0 : {r} Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 34 / 64
  41. Linear Temporal Logic An LTL formula is defined inductively w.r.t.

    atomic propositions AP: (base) p ∈ AP is an LTL formula, if ϕ and ψ are LTL formulas, then the following expressions are LTL formulas: Nexttime: X ϕ, Eventually: F ϕ, Globally: G ϕ, Until: ψ U ϕ. Boolean combinations: ϕ ∧ ψ, ϕ ∨ ψ, and ¬ϕ. s0 s2 s3 s4 s1 s 0 s 1 s 2 s 4 s 3 s 0 s 1 s 2 s 3 s 4 s 0 ψ s 1 ψ s 2 ϕ s 3 s 4 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 35 / 64
  42. Recall: Typical Structure of a Computation Step receive messages compute

    using messages and local variables (description in English with basic control flow if-then-else) send messages atomic im plicit pseudo-code Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 36 / 64
  43. CFA: Intermediate representation Intermediate representation of a loop body: a

    path from qI to qF encodes one iteration. Every variable is assigned at most once (SSA). active proctype P() { int x, y; do :: x == 0 -> x = 1; :: x == 1 -> x = 2; :: x == 2 -> x = 0; :: x == 1 -> x = 0; y = 1 - y; od; } qI q0 q1 q2 q3 q4 qF x = 0 x = 1 x = 1 x = 2 x = 1 x = 2 x = 0 x = 0 y = 1 − y Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 37 / 64
  44. Example: from a CFA to a Kripke structure Kripke structure

    M(n, t, f) = (S, S0, R, AP, L) of N(n, t, f) processes For a path π from qI to qF construct a formula φπ(x, y, x , y ) A state is a pair of x = (x1, . . . , xN) ∈ NN and y = (y1, . . . , yN) ∈ NN and the initial states are S0 = {(0, . . . , 0)} ((x, y), (x , y )) ∈ R iff there are process index k. 1 ≤ k ≤ N and path π: [k moves]: φπ(xk , yk , x k , y k ) holds [others do not]: ∀i ∈ {1, . . . , N} \ {k}. x i = xi , y i = yi . Propositions AP = {[∃i. yi = 0], [∀i. yi = 0]} and a state (x, y) is labeled as: p ∈ L((x, y)) iff (x, y) |= p. qI q0 q1 q2 q3 q4 qF x = 0 x = 1 x = 1 x = 2 x = 1 x = 2 x = 0 x = 0 y = 1 − y Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 38 / 64
  45. Example: Properties of ST87 in LTL Unforgeability. If vi =

    0 for all correct processes i, then for all correct processes j, acceptj remains 0 forever. G n−f i=1 vi = 0 → G n−f j=1 acceptj = 0 Safety Completeness. If vi = 1 for all correct processes i, then there is a correct process j that eventually sets acceptj to 1. G n−f i=1 vi = 1 → F n−f j=1 acceptj = 1 Liveness Relay. If a correct process i sets accepti to 1, then eventually all correct processes j set acceptj to 1. G n−f i=1 accepti = 1 → F n−f j=1 acceptj = 1 Liveness Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 39 / 64
  46. Model Checking Problems Finite-state MC Input: a process template P,

    an LTL formula ϕ (including fairness), values of parameters n, t, and f. Problem: check, whether M(n, t, f) |= ϕ. Parameterized MC Input: a process template P, an LTL formula φ (including fairness) with atomic propositions of the form [∃i.xi < y] and [∀i.xi < y] Problem: check, whether ∀n, t, f : n > 3t ∧ t ≥ f ∧ f ≥ 0. M(n, t, f) |= φ. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 40 / 64
  47. Parameterized modeling & Non-parameterized model checking as in SPIN’13: (John

    et al., 2013) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 41 / 64
  48. Modeling of threshold-based algorithms in Promela. . . We introduce

    efficient encoding of threshold-based fault-tolerant algorithms in PROMELA (with parametrization!) Verify safety and liveness of fault-tolerant algorithms (fixed parameters). Find counterexamples for parameters known from the literature. This proves adequacy of our modeling. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 42 / 64
  49. Modeling of threshold-based algorithms in Promela. . . We introduce

    efficient encoding of threshold-based fault-tolerant algorithms in PROMELA (with parametrization!) Verify safety and liveness of fault-tolerant algorithms (fixed parameters). Find counterexamples for parameters known from the literature. This proves adequacy of our modeling. For our method, we exploit specifics of FTDAs: 1 central feature of the algorithms (message counting); 2 specific message passing (we do not need to know who sent but how many of them sent messages); 3 the way faults affect messages (again, counting messages). Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 42 / 64
  50. Case Studies We consider a number of threshold-based algorithms. Our

    running example ST87 for 1 Byzantine faults (BYZ) 2 omission faults (OMIT) 3 symmetric faults (SYMM) 4 clean crashes (CLEAN). 5 Forklore reliable broadcast for clean crashes [Chandra & Toueg 96, CT96] (to be continued) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 43 / 64
  51. Characteristics of the FTDA by Srikanth & Toueg, 87 Variables

    of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 44 / 64
  52. Characteristics of the FTDA by Srikanth & Toueg, 87 Variables

    of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f vi = 1 then send ( echo ) to all ; i f received (echo) from at l e a s t t + 1 distinct processes and not sent ( echo ) before then send ( echo ) to all ; i f received ( echo ) from at l e a s t n - t distinct processes then accepti := 1; the algorithm consists of threshold-guarded commands, only thresholds t + 1 and n − t communication is by “send to all” how processes distinguish distinct senders is not part of the algorithm (i.e., algorithm description is high level) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 44 / 64
  53. Case Studies (cont.): Larger Algorithms more involved algorithms in the

    purely asynchronous setting: 6 Asynchronous Byzantine Agreement (Bracha & Toueg 85, BT85) Byzantine faults two phases and two message types five status values properties: unforgeability, correctness (liveness), agreement (liveness) 7 Condition-based Consensus (Most´ efaoui et al. 01, MRRR01) crash faults two phases and four message types nine status variables properties: validity, agreement, termination (liveness) 8 Fast Byzantine Consensus: common case (Martin, Alvisi 06, MA06) Byzantine faults the core part of the algorithm no cryptography Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 45 / 64
  54. Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time

    1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64
  55. Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time

    1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64
  56. Experimental Results at Glance Algorithm Fault Parameters Resilience Properties 1.

    ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 5. CT96 CRASH n = 2 — U, C, R Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64
  57. Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time

    1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64
  58. Experimental Results at Glance 6. BT85 BYZ n = 5,

    t = 1, f = 1 n > 3t R 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64
  59. Experimental Results at Glance Algorithm Fault Parameters Resilience Properties Time

    1. ST87 BYZ n = 7, t = 2, f = 2 n > 3t U, C, R 6 sec. 1. ST87 BYZ n = 7, t = 3, f = 2 n > 3t U, C, R 5 sec. 1. ST87 BYZ n = 7, t = 1, f = 2 n > 3t U, C, R 1 sec. 2. ST87 OMIT n = 5, t = 2, f = 2 n > 2t U, C, R 4 sec. 2. ST87 OMIT n = 5, t = 2, f = 3 n > 2t U, C, R 5 sec. 3. ST87 SYMM n = 5, t = 1, fp = 1, fs = 0 n > 2t U, C, R 1 sec. 3. ST87 SYMM n = 5, t = 2, fp = 3, fs = 1 n > 2t U, C, R 1 sec. 4. ST87 CLEAN n = 3, t = 2, fc = 2, fnc = 0 n > t U, C, R 1 sec. 5. CT96 CRASH n = 2 — U, C, R 1 sec. 6. BT85 BYZ n = 5, t = 1, f = 1 n > 3t R 131 sec. 6. BT85 BYZ n = 5, t = 1, f = 2 n > 3t R 1 sec. 6. BT85 BYZ n = 5, t = 2, f = 2 n > 3t R 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 1 n > 2t V0, V1, A, T 1 sec. 7. MRRR01 CRASH n = 3, t = 1, f = 2 n > 2t V0, V1, A, T 1 sec. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1,f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 3 hrs. 8. MA06 BYZ p = 4,a = 5,l = 4, t = 1, f = 1 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 14 min. 8. MA06 BYZ p = 4,a = 6,l = 4, t = 1, f = 2 p > 3t, a > 5t, l > 3t CS1, CS3, CL1, CL2 2 sec. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 46 / 64
  60. Experimental Results: on ST87, the Byzantine Case Time (sec, logscale)

    no faults: f = 0 two faults: f = 2 Memory (MB, logscale) no faults: f = 2 The more faults we have, the easier the problem is: Two faults: we can check the systems of up to nine processes No faults: we can check the systems of up to seven processes Precision of modeling: we found counter-examples for the corner cases n = 3t and f > t, where the resilience condition is violated. (June 2013: somebody wrote on Wikipedia that n = 3t should work :-) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 47 / 64
  61. Discussion of the specifications Unforgeability. If vi = 0 for

    all correct processes i, then for all correct processes j, acceptj remains 0 forever. G n−f i=1 vi = 0 → G n−f j=1 acceptj = 0 The specification of Byzantine FTDAs have the following features: Only the states of correct processes are evaluated. Faulty processes may be Byzantine. (no assumption on behavior) Specifications do not talk about individual processes. Only global safety and progress are important. Indexed temporal logic is not required! Quantification over processes is on the level of atomic propositions. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 48 / 64
  62. Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard

    if received m from some process then ... Universal Guard if received m from all processes then ... Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64
  63. Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard

    if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64
  64. Threshold-Guarded Distributed Algorithms Standard construct: quantified guards (t=f=0) Existential Guard

    if received m from some process then ... Universal Guard if received m from all processes then ... what if faults might occur? Fault-Tolerant Algorithms: n processes, at most t are Byzantine Threshold Guard if received m from n − t processes then ... (the processes cannot refer to f!) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 49 / 64
  65. Counting Argument in Threshold-Guarded Algorithms n t f t +

    1 Correct processes count incoming messages from distinct processes Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64
  66. Counting Argument in Threshold-Guarded Algorithms n t f t +

    1 Correct processes count incoming messages from distinct processes Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64
  67. Counting Argument in Threshold-Guarded Algorithms n t f t +

    1 at least one non-faulty sent the message Correct processes count incoming messages from distinct processes Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 50 / 64
  68. Modeling threshold-based algorithms in Promela As the distributed algorithms are

    given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received <m> from n − t distinct processes” faults Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64
  69. Modeling threshold-based algorithms in Promela As the distributed algorithms are

    given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received <m> from n − t distinct processes” faults In what follows, we compare side-by-side two solutions: A straightforward encoding with PROMELA channels and explicit representation of faulty processes. [Solution 1] An advanced encoding with shared variables and fault injection. [Solution 2] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64
  70. Modeling threshold-based algorithms in Promela As the distributed algorithms are

    given in pseudo-code, we have to decide on how to encode in PROMELA: send to all and receive counting expressions “received <m> from n − t distinct processes” faults In what follows, we compare side-by-side two solutions: A straightforward encoding with PROMELA channels and explicit representation of faulty processes. [Solution 1] An advanced encoding with shared variables and fault injection. [Solution 2] To decouple encoding of reliable message passing and of faults, we first consider message passing without faults, and then show how to encode faults. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 51 / 64
  71. Template in Promela We implement the following loop on the

    right. receive messages compute using messages and local variables (description in English with basic control flow if-then-else) send messages atomic /∗ shared s t a t e : a v a r i a b l e or a channel ∗/ active proctype[N(n,t,f)] P(){ /∗ l o c a l v a r i a b l e to count messages from d i s t i n c t p r o c e s s e s ∗/ int nrcvd; /∗ i n i t i a l i z a t i o n ∗/ loop: atomic { /∗ 1 . r e c e i v e and count messages 2 . compute using nrcvd 3 . send messages ∗/ } goto loop; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 52 / 64
  72. Modeling Message Passing All our case studies are designed with

    the assumption of classic reliable asynchronous message passing as in (Fischer et al., 1985): non-blocking communication, operations “receive” and “send” are executed immediately. if a message can be received now, it may be also received later, a process does not have to receive a message as soon as it is able to. every sent message is eventually received, but there are no bounds on the delays. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 53 / 64
  73. Solution 1: Message Passing using Promela channels A straightforward encoding

    using message channels: /∗ message type ∗/ mtype = { ECHO }; /∗ point −to−point channels ∗/ chan p2p[N][N] = [1] of { mtype }; /∗ tag r e c e i v e d messages ∗/ bit rx[N][N]; Sending a message to all processes: for (i : 1 .. N) { p2p[_pid][i]!ECHO; } Note: pid denotes the process identifier in PROMELA (we use it solely to encode message passing). Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 54 / 64
  74. Solution 1: Message Passing (cont.) Receiving and counting messages from

    distinct processes (no faults yet): /∗ l o c a l ∗/ int nrcvd = 0; /∗ i n i t i a l l y , no messages ∗/ ... i = 0; do /∗ i s t h e r e a message from p r o c e s s i? ∗/ :: (i < N) && nempty(p2p[i][_pid]) -> p2p[i][_pid]?ECHO; /∗ remove i t ∗/ if :: !rx[i][_pid] -> /∗ 1 . the f i r s t time : ∗/ rx[i][_pid] = 1; /∗ a . mark as r e c e i v e d ∗/ nrcvd++; break; /∗ b . i n c r e a s e l o c a l counter ∗/ :: rx[i][_pid]; /∗ 2 . ign ore a d u p l i c a t e ∗/ fi; i++; /∗ next p r o c e s s ∗/ :: (i < N) -> i++; /∗ r e c e i v e nothing from i ∗/ :: i == N -> break; od Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 55 / 64
  75. Solution 2: Simulating message passing with variables Keeping the number

    of send-to-all’s by (correct) processes: int nsnt; /∗ shared v a r i a b l e ∗/ /∗ number o f send−to−a l l ’ s sent by c o r r e c t p r o c e s s e s ∗/ Sending a message to all: nsnt++; Receiving and counting messages from distinct processes (no faults): if /∗ p i c k a l a r g e r value ≤ nsnt ∗/ :: ((nrcvd + 1) < nsnt) -> nrcvd++; /∗ one more message ∗/ :: skip; /∗ or nothing ∗/ fi; Reliable communication as a fairness property: F G [∀i.nrcvdi ≥ nsnt] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 56 / 64
  76. Solution 2: Some questions you might ask Q1: instead of

    if :: ((nrcvd + 1) < nsnt) -> nrcvd++; /∗ one more message ∗/ :: skip; /∗ or nothing ∗/ fi; why cannot we just write: nrcvd = nsnt; A1: You can, but that will be another model, not [FLP85]! [FLP85] only guarantees that every message is eventually received. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 57 / 64
  77. Solution 2: Some questions you might ask (cont.) Reliable communication:

    every sent message is eventually received. Q2: Why do we write F G [∀i.nrcvdi ≥ nsnt] (1) instead of: ∀i. G F [nrcvdi ≥ nsnt] (2) A2: We like to write (2), but it will require us to use another logic called indexed LTL, which will cause problems in the parameterized case. For threshold-based algorithms, the value of nsnt is changes at most n times. Under this assumption, (2) is equivalent (1). Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 58 / 64
  78. Solution 1 (cont.): Explicit Modeling of Faults (Lamport et al.,

    1982) introduce Byzantine processes that can virtually do anything. In our case, Byzantine behavior boils down to sending ECHO to some of the correct processes and not sending ECHO to the others: active[F] proctype Byz() { step: atomic { i = 0; do /∗ send ECHO to p r o c e s s i ∗/ :: i < N -> p2p[_pid][i]!ECHO; i++; /∗ or not ∗/ :: i < N -> i++; :: i == N -> break; od }; goto step; } Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 59 / 64
  79. Solution 2 (cont.): Injecting Faults into Message Counters We instantiate

    n − f correct processes and no faulty processes. Instead, we say that the correct processes may receive up to f additional messages due to faults: if :: ((nrcvd + 1) < nsnt + f) -> nrcvd++; /∗ r e c e i v e one more message ∗/ :: skip; /∗ or nothing ∗/ fi; Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 60 / 64
  80. Solution 2 (cont.): Injecting Faults into Message Counters We instantiate

    n − f correct processes and no faulty processes. Instead, we say that the correct processes may receive up to f additional messages due to faults: if :: ((nrcvd + 1) < nsnt + f) -> nrcvd++; /∗ r e c e i v e one more message ∗/ :: skip; /∗ or nothing ∗/ fi; The fairness still forces the processes to receive all the messages sent by the correct processes: F G [∀i.nrcvdi ≥ nsnt] Note: each correct process sends at most one ECHO message. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 60 / 64
  81. Solution 2 (cont.): Modeling different kinds of faults Byzantine faults

    (previous slide): create only correct processes, i.e., n − f processes only they have to satisfy spec extra messages from Byzantine: ((nrcvd + 1) < nsnt + f) fairness (reliable communication): F G [∀i.nrcvdi ≥ nsnt] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64
  82. Solution 2 (cont.): Modeling different kinds of faults Byzantine faults

    (previous slide): create only correct processes, i.e., n − f processes only they have to satisfy spec extra messages from Byzantine: ((nrcvd + 1) < nsnt + f) fairness (reliable communication): F G [∀i.nrcvdi ≥ nsnt] Omission faults (processes fail to send messages): create all processes, i.e., n processes all of them are mentioned in the specification no additional messages: ((nrcvd + 1) < nsnt) fairness (with possible message loss due to faults) F G [∀i.nrcvdi ≥ nsnt − f] Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64
  83. Solution 2 (cont.): Modeling different kinds of faults Byzantine faults

    (previous slide): create only correct processes, i.e., n − f processes only they have to satisfy spec extra messages from Byzantine: ((nrcvd + 1) < nsnt + f) fairness (reliable communication): F G [∀i.nrcvdi ≥ nsnt] Omission faults (processes fail to send messages): create all processes, i.e., n processes all of them are mentioned in the specification no additional messages: ((nrcvd + 1) < nsnt) fairness (with possible message loss due to faults) F G [∀i.nrcvdi ≥ nsnt − f] Crash faults: similar to omissions with crash control state added Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 61 / 64
  84. Experiments: Solution 1 vs. Solution 2 States (logscale) 10 100

    1000 10000 100000 1e+06 1e+07 1e+08 3 4 5 6 7 8 states (logscale) number of processes, N Memory (MB, logscale, ≤ 12 GB) 100 1000 10000 3 4 5 6 7 8 memory, MB (logscale) number of processes, N Solution 1: Channels + explicit Byzantine processes (blue) Solution 2: shared variables + fault injection (red) in the presence of one Byzantine faulty process (f = 1) (case f = 2 runs out of memory too fast) Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 62 / 64
  85. Summary We show how to model threshold-based fault-tolerant algorithms starting

    with an imprecise description We create PROMELA models using expert advice. The tool demonstrates that the model behaves as predicted by theory (for concrete values of parameters) This reference implementation allows us to optimize the encoding ... and to make the model amenable to parameterized verification Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 63 / 64
  86. References I Biely, M., Delgado, P., Milosevic, Z., & Schiper,

    A. 2013 (June). Distal: A framework for implementing fault-tolerant distributed algorithms. Pages 1–8 of: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on. Castro, Miguel, Liskov, Barbara, et al. 1999. Practical Byzantine fault tolerance. Pages 173–186 of: OSDI, vol. 99. Fischer, Michael J., Lynch, Nancy A., & Paterson, M. S. 1985. Impossibility of Distributed Consensus with one Faulty Process. J. ACM, 32(2), 374–382. http://doi.acm.org/10.1145/3149.214121. John, Annu, Konnov, Igor, Schmid, Ulrich, Veith, Helmut, & Widder, Josef. 2013. Towards Modeling and Model Checking Fault-Tolerant Distributed Algorithms. Pages 209–226 of: SPIN. LNCS, vol. 7976. Lamport, Leslie, Shostak, Robert E., & Pease, Marshall C. 1982. The Byzantine Generals Problem. ACM Trans. Program. Lang. Syst., 4(3), 382–401. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 65 / 64
  87. References II Lincoln, P., & Rushby, J. 1993. A formally

    verified algorithm for interactive consistency under a hybrid fault model. Pages 402–411 of: FTCS-23. http://dx.doi.org/10.1109/FTCS.1993.627343. Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 66 / 64
  88. Folklore Reliable Broadcast (e.g., Chandra & Toueg, 96) Correct processes

    agree on value vi in the presence of crash faults. Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f (vi = 1 or received <echo> from some process ) and accepti = 0 then begin send <echo> to all ; accepti := 1; end Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 67 / 64
  89. Folklore Reliable Broadcast (e.g., Chandra & Toueg, 96) Correct processes

    agree on value vi in the presence of crash faults. Variables of process i vi : {0 , 1} i n i t i a l l y 0 or 1 accepti : {0 , 1} i n i t i a l l y 0 An atomic step: i f (vi = 1 or received <echo> from some process ) and accepti = 0 then begin send <echo> to all ; /* when crashing it sends to a subset of processes */ accepti := 1; /* it can also crash here */ end Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 67 / 64
  90. Verification Problem as in Distributed Computing Given a distributed algorithm

    A and specifications ϕU , ϕC , ϕR , Fix n and t with n > 3t, show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR . Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64
  91. Verification Problem as in Distributed Computing Given a distributed algorithm

    A and specifications ϕU , ϕC , ϕR , Fix n and t with n > 3t, show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR . In every execution: the number of faulty processes is restricted, i.e., f ≤ t; processes can use n and t in the code, but not f; f is constant (if a process fails late, its “correct” behavior was a Byzantine trick). A distributed system A(n, t) f = 0 . . . f = t Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64
  92. Verification Problem as in Distributed Computing Given a distributed algorithm

    A and specifications ϕU , ϕC , ϕR , Fix n and t with n > 3t, show that every execution of A(n, t) satisfies ϕU , ϕC , ϕR . In every execution: the number of faulty processes is restricted, i.e., f ≤ t; processes can use n and t in the code, but not f; f is constant (if a process fails late, its “correct” behavior was a Byzantine trick). A distributed system A(n, t) f = 0 . . . f = t Counterexamples when f > t? Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 68 / 64
  93. Experiments: Channels vs. Shared Variables enumerating reachable states in SPIN

    with POR and state compression States (logscale) 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 3 4 5 6 7 8 states (logscale) number of processes, N Memory (MB, logscale, limit of 12 GB) 100 1000 10000 3 4 5 6 7 8 memory, MB (logscale) number of processes, N Josef Widder (www.forsyte.at) Checking Fault-Tolerant Distributed Algos TMPA’14, Nov. 2014 69 / 64