Slide 1

Slide 1 text

Joseph Blomstedt (@jtuple) Basho Technologies [email protected] Test-First Construction of Distributed Systems Erlang Factory SF March 2012 Thursday, March 29, 2012

Slide 2

Slide 2 text

2 a distributed, scalable, and highly- available datastore store. Basho makes Thursday, March 29, 2012

Slide 3

Slide 3 text

3 Basho is a start-up Thursday, March 29, 2012

Slide 4

Slide 4 text

4 Ship Quickly Ship Correctly Highly Available Fault Tolerant Enterprise Start-up Iterate Agility Thursday, March 29, 2012

Slide 5

Slide 5 text

5 Ship Quickly Ship Correctly Highly Available Fault Tolerant Enterprise Start-up Iterate Agility Strive to reduce gap Thursday, March 29, 2012

Slide 6

Slide 6 text

Erlang Is Indispensable 6 • Built-in concurrency and distributed programming • Fault-tolerant just-crash / supervisor mentality • Ability to inspect VM state • Hot load code loading Thursday, March 29, 2012

Slide 7

Slide 7 text

7 Result? Thursday, March 29, 2012

Slide 8

Slide 8 text

8 Majority of bugs are concurrent logic errors Thursday, March 29, 2012

Slide 9

Slide 9 text

Testing Tools 9 • Quickcheck Property-based testing • Pulse Randomizing Erlang scheduler • McErlang / Concuerror Model checkers Thursday, March 29, 2012

Slide 10

Slide 10 text

Quickcheck 10 my_test() -> eqc:quickcheck(reverse_prop()). reverse_prop() -> ?FORALL(L, list(int()), begin lists:reverse(lists:reverse(L)) == L end) Thursday, March 29, 2012

Slide 11

Slide 11 text

Quickcheck eqc_statem 11 Run against stateful code Verify postconditions Generate Command Sequence Thursday, March 29, 2012

Slide 12

Slide 12 text

Quickcheck eqc_statem 12 command(State) -> %% Commands to run against stateful system oneof(Cmds). precondition(State, Cmd) -> %% Return true if cmd is valid in current state. next_state(State, Result, Cmd) -> %% Update test state after a given cmd. postcondition(State, Cmd, Result) -> %% Test postconditions. Thursday, March 29, 2012

Slide 13

Slide 13 text

Testing Issues • Building test from implementation often not straightforward • Testing concurrent interleaving requires a different approach • Building a great implementation of a broken algorithm is disheartening 13 Thursday, March 29, 2012

Slide 14

Slide 14 text

14 Test First Construction Thursday, March 29, 2012

Slide 15

Slide 15 text

15 Build testable model Thursday, March 29, 2012

Slide 16

Slide 16 text

16 Test Iterate Gain Confidence Thursday, March 29, 2012

Slide 17

Slide 17 text

17 Convert model into implementation Thursday, March 29, 2012

Slide 18

Slide 18 text

18 Verify implementation against model Thursday, March 29, 2012

Slide 19

Slide 19 text

History 19 • First built testable model for new clustering subsystem for Riak 1.0 • Model built on top of eqc_statem • The test itself was the model of the system and tested properties against itself • Somewhat ad-hoc, but it worked Thursday, March 29, 2012

Slide 20

Slide 20 text

eqc_system (1/2) 20 • Refactored the approach into general-purpose framework based on lessons learned • Events External events, timers, things you do not care to model • Calls/Casts Similar to OTP gen_server • Calls/casts map to simulated receive/reply semantics Thursday, March 29, 2012

Slide 21

Slide 21 text

eqc_system (2/2) 21 • Test consists of test module and a set of node modules • Callbacks handle_event, handle_call, handle_cast after_event, after_call, after_cast post_event, post_call, post_cast, always • Test module can generate events and test properties against global test state • Node modules generate events, calls, casts and test local properties Thursday, March 29, 2012

Slide 22

Slide 22 text

Simple Example • Nodes join together an form a cluster • Nodes periodically gossip membership state to other known nodes 22 Thursday, March 29, 2012

Slide 23

Slide 23 text

23 events(#state{nodes=Nodes}) -> ?EVENT(join, [elements(Nodes), elements(Nodes)]). precondition(_, S, join, [Node,[OtherNode]]) -> Singleton = S#state.singleton, all([Node /= OtherNode, lists:member(Node, Singleton), (Singleton == S#state.nodes) or lists:member(OtherNode, Singleton)]); after_event(_Nodes, S, {join, [OtherNode]}, Node, _NodeState) -> Singleton = S#state.singleton -- [Node, OtherNode], S#state{singleton=Singleton}; Test Module Thursday, March 29, 2012

Slide 24

Slide 24 text

24 events(Node, #state{members=Members}) -> ?EVENT(gossip, [Node, [elements(Members)]]). precondition(S, gossip, [Node, [OtherNode]]) -> all([lists:member(OtherNode, S#state.members), Node /= OtherNode]); Test Node Module (1/3) Thursday, March 29, 2012

Slide 25

Slide 25 text

25 handle_event({join, [OtherNode]}, State) -> call(State, OtherNode, get_members, fun(Members) -> Members2 = lists:sort([State#state.id | Members]), State2 = State#state{members=Members2}, {noreply, State2} end); handle_event({gossip, [OtherNode]}, State) -> cast(OtherNode, {gossip, State#state.members}), {ok, State}. Test Node Module (2/3) Thursday, March 29, 2012

Slide 26

Slide 26 text

26 handle_call(get_members, _From, State) -> {reply, State#state.members, State}. handle_cast({gossip, OtherMembers}, State) -> Members2 = merge(Members, OtherMembers), {noreply, State#state{members=Members2}}. Test Node Module (3/3) Thursday, March 29, 2012

Slide 27

Slide 27 text

27 [{init,{sys_state,undefined,undefined,rc,0,[],undefined,undefined, model}}, {set,{var,1},{call,eqc_sys,init_dynamic,[]}}, {set,{var,2},{call,eqc_sys,init_system,[rc]}}, {set,{var,3},{call,rc,join,[3,[1]]}}, {set,{var,4},{call,eqc_sys,rcvmsg,[1,{3,{call,get_members}}]}}, {set,{var,5},{call,rc,join,[2,[1]]}}, {set,{var,6},{call,eqc_sys,rcvreply,[3,{1,[1]}]}}, {set,{var,7},{call,eqc_sys,rcvmsg,[1,{2,{call,get_members}}]}}, {set,{var,8},{call,eqc_sys,rcvreply,[2,{1,[1]}]}}, {set,{var,9},{call,rc_node,send_gossip,[3,[1]]}}, {set,{var,10},{call,rc_node,send_gossip,[3,[1]]}}, {set,{var,11},{call,rc_node,send_gossip,[2,[1]]}}, {set,{var,12},{call,rc_node,send_gossip,[3,[1]]}}, {set,{var,13},{call,eqc_sys,rcvmsg,[1,{3,{cast,{gossip, [1,3]}}}]}}, {set,{var,14},{call,eqc_sys,rcvmsg,[1,{3,{cast,{gossip, [1,3]}}}]}}] Example Command Sequence Thursday, March 29, 2012

Slide 28

Slide 28 text

Extended Example • Cluster maintains a weak leader Lowest node id in the cluster is considered the leader No actual leader election or failure detection • Property we care about At all times, there is only one node that believe it is the leader of a cluster 28 Thursday, March 29, 2012

Slide 29

Slide 29 text

29 -record(state, {id, members, leader}). handle_event({join, [OtherNode]}, _Node, State) -> call(State, OtherNode, get_state, fun(#state{members=Members, leader=Leader}) -> Members2 = lists:sort([State#state.id | Members]), {noreply, State#state{members=Members2, leader=Leader}} end); handle_event({send_gossip, [OtherNode]}, _Node, State) -> cast(OtherNode, {gossip, State}), {ok, State}; Extended Node Module (1/3) Thursday, March 29, 2012

Slide 30

Slide 30 text

30 handle_call(get_state, _From, _Node, State) -> {{reply, State}, State}; handle_cast({gossip, #state{members=Members, leader=Leader}}, _From, _Node, State) -> Members2 = lists:usort(State#state.members ++ Members), case State#state.id == State#state.leader of true -> Leader2 = hd(lists:sort(Members2)); false -> Leader2 = Leader end, {noreply, State#state{members=Members2, leader=Leader2}}; Extended Node Module (2/3) Thursday, March 29, 2012

Slide 31

Slide 31 text

31 get_leader(S) -> S#state.leader. get_members(S) -> S#state.members. Extended Node Module (3/3) Thursday, March 29, 2012

Slide 32

Slide 32 text

32 always(Nodes, S) -> all([begin Members = nodecall(Nodes, Node, get_members, []), one_leader(Nodes, Members) end || Node <- S#state.nodes]). one_leader(Nodes, Members) -> Leaders = [Leader || Node <- Members, Leader <- [nodecall(Nodes, Node, get_leader, [])], Leader == Node], length(lists:usort(Leaders)) < 2. Extended Test Module Thursday, March 29, 2012

Slide 33

Slide 33 text

33 [{init,{sys_state,undefined,undefined,rc,0,[],undefined,undefined,model}}, {set,{var,1},{call,eqc_sys,init_dynamic,[]}}, {set,{var,2},{call,eqc_sys,init_system,[rc]}}, {set,{var,3},{call,rc,join,[1,[3]]}}, {set,{var,4},{call,eqc_sys,rcvmsg,[3,{1,{call,get_state}}]}}, {set,{var,5},{call,eqc_sys,rcvreply,[1,{3,{state,3,[3],3}}]}}, {set,{var,6},{call,rc_node,send_gossip,[1,[3]]}}, {set,{var,7},{call,eqc_sys,rcvmsg,[3,{1,{cast,{gossip,{state,1,[1,3],3}}}}]}}, {set,{var,8},{call,rc_node,send_gossip,[3,[1]]}}, {set,{var,9},{call,rc_node,send_gossip,[1,[3]]}}, {set,{var,16}, {call,eqc_sys,rcvmsg,[3,{1,{cast,{gossip,{state,1,[1,3],3}}}}]}}, {set,{var,18}, {call,eqc_sys,rcvmsg,[1,{3,{cast,{gossip,{state,3,[1,3],1}}}}]}}] {postcondition,false} Counterexample Thursday, March 29, 2012

Slide 34

Slide 34 text

Versioned leader state 34 • Add version number to gossiped state • Leader increments version when changed • Node updates leader only if newer version • After changes, model passes without issue Thursday, March 29, 2012

Slide 35

Slide 35 text

35 Convert to Implementation Thursday, March 29, 2012

Slide 36

Slide 36 text

Convert to Implementation 36 • Convert model into actual implementation • Majority of code reused eqc_sys designed to mirror OTP code • Update model if as necessary and reiterate Thursday, March 29, 2012

Slide 37

Slide 37 text

37 Test Implementation Thursday, March 29, 2012

Slide 38

Slide 38 text

Recall model design 38 • Events Commands that trigger system transitions • Calls/casts Emulated as commands in order for testing purposes Thursday, March 29, 2012

Slide 39

Slide 39 text

Testing Approach #1 • Quickcheck generates event sequences, not call/casts • Events mapped to equivalent implementation constructs • Erlang tracing used to capture actual call/casts that occurred • Verify events + observed call/casts against model and final cluster state 39 Thursday, March 29, 2012

Slide 40

Slide 40 text

Testing Approach #2 • Modify implementation to enable controlling message interleaving • Implemented as a proxy process that delays forwarding messages until told to do so by test module • Investigating parse_transform option 40 Thursday, March 29, 2012

Slide 41

Slide 41 text

Interacting with other tools • Pulse, McErlang, Concuerror All aimed at concurrency debugging • Testing approach #1 works well with these tools Generate event sequences + trace, but allow scheduling tools to force interleavings • Tested with Pulse and Concuerror • Even more confidence in model/code 41 Thursday, March 29, 2012

Slide 42

Slide 42 text

Limitations • eqc_sys entirely random, may not hit lurking bad interleaving • Pulse also random • McErlang / Concuerror state space usually too large 42 Thursday, March 29, 2012

Slide 43

Slide 43 text

Coq Proof Assistant • Working on using Coq to prove model • Coq script similar to Quickcheck model Represent commands as a list constructed from a generate Model are functions that operate over list, producing state Properties checked against state Prove: Forall commands, properties always hold. 43 Thursday, March 29, 2012

Slide 44

Slide 44 text

Coq Challenges (1/2) • Writing Coq scripts Syntax (Basho is an Erlang company) Semantics (Mapping Erlang ideas to Coq) • Working on Erlang to Coq generate that works on subset of Erlang used in my models Solves syntax issues Semantics are tricker, but approached as encountered 44 Thursday, March 29, 2012

Slide 45

Slide 45 text

Coq Challenges (2/2) • Proving in Coq is not automatic • Tedious process, not Basho specialty • Working on domain-specific proof tactic and library of lemmas to enable automated • Inspired by Professor Chlipala’s book http://adam.chlipala.net/cpdt • Possibly hear more later this year Personal project, so progress is slow 45 Thursday, March 29, 2012

Slide 46

Slide 46 text

46 Test Implement Model Verify Thursday, March 29, 2012

Slide 47

Slide 47 text

47 Ship Quickly Ship Correctly Getting a little closer Thursday, March 29, 2012

Slide 48

Slide 48 text

Questions? [email protected] @jtuple Thursday, March 29, 2012