Slide 1

Slide 1 text

[ ИМИДЖЕВОЕ ИЗОБРАЖЕНИЕ ] Live Testing of Distributed System Fault Tolerance With Fault Injection Techniques Alexey Vasyukov, Inventa Vadim Zherder, Moscow Exchange

Slide 2

Slide 2 text

Plan • Introduction Distributed Trading System concepts • Distributed Consensus protocol • MOEX Fault Injection testing framework 2

Slide 3

Slide 3 text

Trading System concepts 3

Slide 4

Slide 4 text

Trading System 4

Slide 5

Slide 5 text

Transactions Processing • One incoming stream • Strictly ordered FIFO incoming stream • SeqNum • TimeStamp • Strictly ordered FIFO outcoming stream • Each transaction should get reply • No loss, no duplicates • Multistage processing • No 1 – finished, No 2,3 – in process, No 4 - incoming • Transaction processing result • ErrorCode, Message, Statuses, … • TransLog = Transactions Log • Transaction buffer • Large but finite. When exhausted, TS stops processing transactions and become “unavailable” 5

Slide 6

Slide 6 text

Trading System: what’s behind 6

Slide 7

Slide 7 text

Messaging 7 UP – transactions DOWN – responses and messages to nodes of the cluster Use CRC (Cyclic redundancy check) to control transaction flow Full history: “Late join” is possible

Slide 8

Slide 8 text

Role: “Main” “Main” = the main TS instance • Get new transactions from incoming stream • Broadcast transactions within cluster • Process transactions • Check transaction result (compare to results obtained from other nodes) • Broadcast results within cluster • Publish replies to clients 8

Slide 9

Slide 9 text

Role: “Backup” “Backup” = a special state of TS instance • Can be switched to Main quickly • Get all transactions from Main • Process transactions independently • Check transaction result (compare to results from other nodes) • Write its own TransLog • Do not send replies to clients 9

Slide 10

Slide 10 text

Role: “Backup” 2 modes: SYNC (“Hot Backup”) and ASYNC (“Warm Backup”) 10 If Main failed, SYNC can switch to Main automatically SYNC publish transaction results If SYNC failed, ASYNC can switch to SYNC automatically ASYNC does not publish transaction result ASYNC can be switched to Main manually by Operator Number of SYNCs is a static parameter determined by Operator

Slide 11

Slide 11 text

Role: Governor 11 Governor can • Force node to change its role • Force node to stop • Start Elections to assign new Main Only one Governor in the cluster Governor can be assigned only by Operator Governor role cannot be changed If some node asks Governor but it is unavailable then this node stalls until it recovers connectivity to the Governor Governor can be recovered or restarted only by Operator

Slide 12

Slide 12 text

Roles Summary 12 Governor Main SYNC Backup ASYNC Backup Send Table of states V V V V Get Client Transaction V Broadcast Transaction V Process Transaction V V V Broadcast Transaction result V V Compare transaction results V V V Send replies to clients V Can Switch to Main SYNC

Slide 13

Slide 13 text

If something goes wrong… 13 IF we detect • Mismatch in transaction result • A node does not respond • No new transactions incoming • Wrong CRC • Governor does not response • Mismatch in tables of states • … THEN ASK Governor

Slide 14

Slide 14 text

Elections 14 Elections Starts to assign new Main Stop transaction processing 2-fold Generation counter (G:S) Initial values (0:0) Every successful election increases G and drop S to 0 (G:0). Every round of elections increases S Example: (1:0) -> (1:1) -> (1:2) -> (2:0) Generation counter in every message to/from Governor 2-Phase commit approach Governor sends new table of states and waits for confirmation from all nodes

Slide 15

Slide 15 text

Distributed consensus protocol 15

Slide 16

Slide 16 text

MOEX Consensus Protocol (by Sergey Kostanbaev, MOEX) 16 We must provide Tables of state to be consistent at all nodes during normal work period We must provide Tables of state to become consistent after some nodes failed Node 1 Node 2 Node 3 Node 4 uuid1 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC uuid2 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC uuid3 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC uuid4 S_MAIN S_GOVERNOR S_BACKUP_SYNC S_BACKUP_ASYNC Table of States at each node

Slide 17

Slide 17 text

MOEX Consensus Protocol 17 Thus, it is an example of a Distributed consensus protocol Other examples: • Paxos, 1998, 2001, … LAMPORT, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25. • RAFT, 2014 https://raft.github.io/raft.pdf ONGARO, D., AND OUSTERHOUT, J. In search of an understandable consensus algorithm. In Proc ATC’14,USENIX Annual Technical Conference (2014), USENIX • DNCP, 2016 https://tools.ietf.org/html/rfc7787 Open questions: Is MOEX CP equivalent to any of known protocols? Hypothesis on MOEX CP features H1. Byzantine fault tolerance H2. Safety H3. No liveness

Slide 18

Slide 18 text

Cluster Normal State Requirements 18 • There is exactly 1 Governor in the cluster • There is exactly 1 Main in the cluster • Tables of states at all nodes are consistent • All active nodes in the cluster have the same value of Generation Counter • The cluster is available (for client connection) and process transactions • All nodes process the same sequence of transactions • Either number of SYNCs equals to the predefined value, or it is less than predefined value and there is no ASYNCs …

Slide 19

Slide 19 text

Main “Theorem” 19 • Assume that the cluster was in Normal state, and one of Main or Backup node fails. Then the cluster goes back to Normal state during finite time.

Slide 20

Slide 20 text

MOEX CP Testing 20 Investigate • Fault detection • Implementation correctness • Timing • Dependence on load profile • Dependence on environment configuration • Statistics Integration with CI/CD processes

Slide 21

Slide 21 text

Typical Test Scenario 21 1. Start all 2. Wait for normal state 3. Start transactions generator 4. Keep transactions flow for some time 5. Fault injection – emulate fault (single or multiple) 6. Wait for normal state (check timeout) 7. Check state at each node 8. Get artifacts

Slide 22

Slide 22 text

References 22 WIDDER J., Introduction into Fault-tolerant Distributed Algorithms and their Modeling, TMPA (2014) LAMPORT, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25. https://raft.github.io/raft.pdf ONGARO, D., AND OUSTERHOUT, J. In search of an understandable consensus algorithm. In Proc ATC’14,USENIX Annual Technical Conference (2014), USENIX ONGARO D. Consensus: Bridging theory and practice : Doctoral dissertation – Stanford University, 2014.

Slide 23

Slide 23 text

MOEX Fault Injection Testing Framework 23

Slide 24

Slide 24 text

Fault Injection: Testing Implementation 24

Slide 25

Slide 25 text

MOEX Fault Injection Framework Concepts • End-to-end testing of cluster implementation • Starts complete real system on real infrastructure • Provides modules to inject predictable faults on selected servers • Provides domain specific libraries to write tests • System, network, app issues are injected directly • Misconfiguration problems are tested indirectly (real infrastructure, config push before test start) 25

Slide 26

Slide 26 text

Architecture 26

Slide 27

Slide 27 text

Inject Techniques OS Processes • Kill (SIGKILL) • Hang (SIGSTOP for N seconds + SIGCONT) Network • Interface “blink” (DROP 100% packets for N seconds) • Interface “noise” (DROP X% packets for N seconds) • Content filtering – allows “smart” inject into protocol, dropping selected messages from the flow Application • Data corrupt (with gdb script) – emulates application level issues from incorrect calculation 27

Slide 28

Slide 28 text

Basic Cluster State Validations 28 # Code Description 00 ALIVE_ON_START Cluster nodes should start correctly 01 SINGLE_MAIN Only one node should consider itself MAIN 02 GW_OK All gateways should be connected to correct MAIN 03 GEN_OK All active cluster nodes should have the same generation 04 TE_VIEW_OK Current MAIN should be connected to all alive nodes 05 CLU_VIEW_CONSISTENT All alive nodes should have the same cluster view 06 ELECTIONS_OK Elections count during the test should match inject scenario 07 DEAD_NODES_OK The number of lost nodes should match inject scenario 08 CLIENTS_ALIVE Clients should not notice any issue, fault handling logic is completely hidden from them

Slide 29

Slide 29 text

Test Targets • Basic system faults • Multiple system faults on different nodes • Application level faults • Random network instabilities • Recovery after faults • Governor stability (failures, restarts, failures during elections) 29

Slide 30

Slide 30 text

Test Summary 30 Logs from all nodes for root cause analysis Cluster state validations summary Cluster nodes states (Sync Backup is dead, Async Backup switched to Sync)

Slide 31

Slide 31 text

Basic Fault: Overall System Behavior Event log timeline BS died, elections started Elections, no transactions Resumed operation

Slide 32

Slide 32 text

Restore After Fault: Overall System Behavior BS hanged, elections started Elections, no transactions Resumed operation BS is alive again BS rejoins the cluster, receiving missed transactions

Slide 33

Slide 33 text

Performance Metrics • Key performance data from all cluster nodes • How faults influence service quality for consumers? • Compare configurations (indirectly, together with config push) 33

Slide 34

Slide 34 text

Domain Specific Language • Useful for ad-hoc tests and quick analysis • Complements set of 'default' tests (written in Python) 34

Slide 35

Slide 35 text

Statistics • Multiple runs to identify problems without stable reproducers • Heatmap to analyze quickly both which tests and which validations fail 35

Slide 36

Slide 36 text

References 36 Similar tools: 1. Netflix Simian Army; http://techblog.netflix.com/2011/07/netflix-simian-army.html 2. Jepsen; https://jepsen.io/ Reading: 1. Caitie McCaffrey. 2015. The Verification of a Distributed System. Queue 13, 9, pages 60 (December 2015), 11 pages. DOI=http://dx.doi.org/10.1145/2857274.2889274 2. Alvaro, P., Rosen, J. and Hellerstein, J.M. 2015. Lineage-driven fault injection. http://www.cs.berkeley.edu/~palvaro/molly.pdf 3. Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G. R., Zhao, X., Zhang, Y., Jain, P. U., Stumm, M. 2014. Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems; https://www.usenix.org/conference/osdi14/technical- sessions/presentation/yuan 4. Ghosh S. et al. 1997. Software Fault Injection Testing on a Distributed System – A Case Study 5. Lai, M.-Y., Wang S.Y. 1995. Software Fault Insertion Testing for Fault Tolerance. Software Fault Tolerance, Edited by Lyu, Chapter 13.

Slide 37

Slide 37 text

Questions 37

Slide 38

Slide 38 text

Thank you! 38