Consensus Inside

C31a5c8264c3752c43f1ca4b407c57c4?s=47 Tudor David
December 11, 2014

Consensus Inside

Presented at Middleware 2014; paper available at https://dl.acm.org/citation.cfm?id=2663321

C31a5c8264c3752c43f1ca4b407c57c4?s=128

Tudor David

December 11, 2014
Tweet

Transcript

  1. Consensus Inside Tudor David, Rachid Guerraoui, Maysam Yabandeh

  2. Consensus Inside 2 Consensus ~ non-blocking agreement between distributed processes

    on one out of possibly multiple proposed values in a single machine Why is this an interesting problem?
  3. 0 2 4 6 8 2 10 20 40 80

    Mops/s # cores CLH Lock Throughput Future multi-cores: challenges for software Proposed solution: view the multi-core as a distributed system 3 Scalability •  memory hierarchy •  core design •  interconnects Hardware diversity
  4. The multi-core as a distributed system 4 Implicit communication (shared

    memory) Explicit communication (message passing) Replicated state Locally cached data State machine replication Total ordering of updates Agreement How should we do message-passing agreement in a multi-core? High availability, High scalability
  5. Outline §  The multi-core as a distributed system §  Towards

    an agreement protocol for multi-cores §  1Paxos §  Evaluation 5
  6. Existing approaches 6 1. broadcast Prepare 2. wait for Acks

    3. broadcast Commit/Rollback 4. wait for Acks Two-Phase Commit (2PC) Blocking, all messages go through coordinator
  7. Is a blocking protocol appropriate? 7 “Latency numbers every programmer

    should know” L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes 3 000 ns Send 1K bytes over 1 Gbps network 10 000 ns Read 4K randomly from SSD 150 000 ns Read 1MB sequentially from memory 250 000 ns Round trip within datacenter 500 000 ns Read 1 MB sequentially from SSD 1 000 000 ns Disk seek 10 000 000 ns Read 1 MB sequentially from disk 20 000 000 ns Send packet CA->Netherlands->CA 150 000 000 ns Source: Jeff Dean Blocking agreement – only as fast as the slowest participant •  Scheduling? •  I/O? Use a non-blocking protocol
  8. Non-blocking agreement protocols Paxos •  Tolerates non-malicious faults or unresponsive

    nodes: in multi-cores, slow cores •  Needs a majority of responses to progress (tolerates partitions) 8 Phase 1: prepare Phase 2: accept Roles: •  Proposer •  Acceptor •  Learner Lots of variations and optimizations: CheapPaxos, MultiPaxos, FastPaxos etc. Usually – all roles on a physical node (Collapsed Paxos)
  9. MultiPaxos 9 P A L P A L P A

    L •  Unless failed, keep same leader in subsequent rounds
  10. Does MultiPaxos scale in a multi-core? 10 Limited scalability in

    the multi-core environment 1 10 100 1000 10000 100000 1 10 100 Throughput Number of clients MultiPaxos, 3 replicas Multi-core Large area network
  11. A closer look at the multi-core environment 11 0% 20%

    40% 60% 80% 100% Multi-core LAN % of time Propagation etc. Processing < 1 us ~100 us Where does time go when sending a message? Large networks: Minimize number of rounds/instance Multi-core: Minimize the number of messages
  12. Can we adapt Paxos to this scenario? 12 P A

    L P A L P A L Replication of data (reliability): Long-term memory Replication of service (availability): Advocate client commands Resolve contention between proposers, short-term memory (reliability, availability) Using one acceptor significantly reduces the number of messages
  13. Outline §  The multi-core as a distributed system §  Towards

    an agreement protocol for multi-cores §  1Paxos §  Evaluation 13
  14. 1Paxos: The failure-free case 14 P A L P A

    L P A L 1. P2: obtains active acceptor A1 and sends prepare_request(pn) 2. A1: if pn -> max. proposal received, replies to P2 with ack 4. A1 broadcasts value to learners 3. P2 -> A1 accept_request(pn, value) Common case: only steps 3 and 4 1 2 3
  15. 1Paxos: Switching the acceptor 15 P A L P A

    L P A L 1 2 3 A 1. P2 leader? 2.PaxosUtility: P2 proposes •  A3 active acceptor •  Uncommitted proposed values 3. P2 -> A3: prepare_request
  16. 1Paxos: Switching the leader 16 P A L P A

    L P A L 1 2 3 1. A1 – active acceptor? 2. PaxosUtility: P3 new leader and A1 active acceptor 3. P3 -> A1: prepare_request P
  17. Switching leader and acceptor The trade-off: while leader and active

    acceptor non-responsive at the same time ✖ liveness ✔safety 17 P A L P A L P A L Why is it reasonable? •  small probability event •  no network partitions •  if nodes not crashed, but slow -> system becomes responsive after a while
  18. Outline §  The multi-core as a distributed system §  Towards

    an agreement protocol for multi-cores §  1Paxos §  Evaluation 18
  19. Latency and throughput 19 0.00E+00 1.00E-04 2.00E-04 3.00E-04 4.00E-04 5.00E-04

    6.00E-04 7.00E-04 8.00E-04 0 20000 40000 60000 80000 100000 120000 140000 Latency (seconds) Throughput (updates/second) 3 replicas 2PC MultiPaxos 1Paxos 1Paxos provides smaller latency and increased throughput 45 clients 6 clients 7 clients 13 clients
  20. Degree of replication 20 0 5000 10000 15000 20000 25000

    0 5 10 15 20 25 30 35 40 45 50 Throughput (updates/second) Number of replicas 2PC MultiPaxos 1Paxos Smaller # of messages -> tolerance to more replication
  21. Slow leader 21 0 100 200 300 400 500 600

    0 100 200 300 427 527 627 727 827 927 1027 Throughput(updates/sec) Time (10s of ms) Leader becomes unresponsive 1Paxos 1Paxos - small recovery time
  22. Summary and Conclusions 22 Source code: github.com/lpd-epfl/consensusinside Proof: infoscience.epfl.ch/record/201600 Wandida

    video intro: wandida.com/en/archives/1832 Agreement in multi-cores •  non blocking •  reduced # of messages Use one acceptor: 1Paxos •  reduced latency •  increased throughput Thank you! Multi-core – message passing distributed system, but distributed algorithm implementations different