Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Iago: High Availability Postgres with etcd (Cor...

Iago: High Availability Postgres with etcd (CoreOS Fest 2017)

Chain’s enterprise blockchain networks rely on etcd an Postgres for storage. We use Postgres to store blockchain data, and rely on Postgres’s replication features to ensure that clients can always access their assets and create transactions. I wrote a service that manages Postgres replicas. This service was heavily inspired by Joyent’s Manatee project, but uses different technology; namely, Go and etcd instead of Node and Zookeeper.

In this talk, I’ll introduce blockchains and a couple of different consensus algorithms before covering how we use etcd to coordinate replicas in our high-availability management system. I’ll talk about how we configured both etcd and Postgres, walk through the high-level design as well as some code, demonstrate failover, and show how blockchain services communicate with this system.

Tess Rinearson

June 01, 2017
Tweet

More Decks by Tess Rinearson

Other Decks in Technology

Transcript

  1. Iago: High-Availability Databases with etcd C O R E O

    S F E S T J U N E 2 0 1 7 Tess Rinearson, Software Engineer at Chain @_tessr
  2. • Blockchains and Network Consensus • Postgres Replication • etcd

    and Raft • Iago Itself • What happens when things go wrong? • Future Work • Why “Iago” 2
  3. 4 Block 1 tx tx tx tx tx Block 2

    tx tx prev block hash tx tx tx
  4. 4 Block 1 tx tx tx tx tx Block 2

    tx tx prev block hash tx tx tx tx tx prev block hash tx tx tx
  5. 6

  6. 6 block block block block block block block block block

    block block block block block block block block block
  7. 8

  8. 8

  9. Blockchain Attributes: Pick n 1.Replicated Ledger 2.Cryptographic Commitment History 3.Transactions

    Authenticated by Public Key Crypto 4.Accounting Rules (UTXO model) 9
  10. Blockchain Attributes: Pick n 1.Replicated Ledger 2.Cryptographic Commitment History 3.Transactions

    Authenticated by Public Key Crypto 4.Accounting Rules (UTXO model) 5.Shared Access 9
  11. Blockchain Attributes: Pick n 1.Replicated Ledger 2.Cryptographic Commitment History 3.Transactions

    Authenticated by Public Key Crypto 4.Accounting Rules (UTXO model) 5.Shared Access 6.Proof of Work 9
  12. 10

  13. 10

  14. 12

  15. 13

  16. Federated Consensus • Designed for permissioned blockchains • One node

    is the generator: • Aggregates transactions and proposes blocks to the signer nodes • Some number of nodes are signers • Verify that block is valid and doesn’t conflict with another block at the same height, and add their own signatures to the block • Network requires m-of-n signers to sign each block (m,n specified by consensus program) • Remaining nodes are participants • Can audit blocks but network doesn’t rely on them 15
  17. Federated Consensus • Safety Guarantees: • There is only one

    valid block at a given height, assuming that no more than (2M-N-1) signers violate protocol • e.g. for a 3-of-4 network, 2 signers would have to maliciously fail for network failure • Protocol supports consensus improvements 17
  18. 19

  19. _.---.._ _ _.-' \ \ ''-. .' '-,_.-' / /

    / '''. ( _ o : '._ .-' '-._ \ \- ---] '-.___.-') )..-' (_/ 26
  20. etcd • Distributed key-value store built by CoreOS ❤ •

    Uses Raft consensus algorithm: https://raft.github.io/raft.pdf • Not Byzantine Fault Tolerant • …but that’s okay, because this is internal to a single node 27
  21. Raft • Consensus: “How do we ensure that multiple nodes

    apply state changes in the same order?” • Breaks “consensus” into three independent subproblems: • Leader election • Log replication • Safety 29
  22. Raft • Consensus: “How do we ensure that multiple nodes

    apply state changes in the same order?” • Breaks “consensus” into three independent subproblems: • Leader election • Log replication • Safety 29 out of scope
  23. // RunPeer runs a single Iago peer. It will wait

    until at least two other // peers are also online before it launches Postgres and starts its healthcheck. // PostgresPort and DatabaseName must be set before calling RunPeer. func RunPeer(ctx context.Context, kapi client.KeysAPI) error { err := createEphemeralNode(ctx, kapi) if err != nil { return err } // refreshEphemeralNode gets called every at increments of // nodeTTL / 2, so the ephemeral node's ttl is refreshed after // half of it has passed. go func() { for range time.Tick(nodeTTL / 2) { log.Println("refreshing ttl") refreshEphemeralNode(ctx, kapi) } }() return runPeer(ctx, kapi) } 31
  24. Primary S e Sync Rep S e Async Rep S

    e Async Rep 2 S e Asy S
  25. Cluster State • Primary • Sync rep • Ordered list

    of async reps • Generation • Initial WAL position 33
  26. func runSyncPeerMonitor(ctx context.Context, kapi client.KeysAPI, p *pg.PG, cluster *clusterState) error

    { primaryWatch, genWatch := watchAll(ctx, kapi, cluster.Primary, cluster.generation) var err error for { select { case primaryChange := <-primaryWatch: if isExpiry(primaryChange) { logChange(primaryChange, "primary has gone offline") p.Promote(ctx) return runPrimaryMonitor(ctx, kapi, p, cluster) } case genChange := <-genWatch: logChange(genChange, "generation changed; restarting") return runPeer(ctx, kapi) case pgErr := <-p.Err: logError(pgErr, "failed the pg healthcheck") stepDownFatal(ctx, kapi, p, "old sync peer failed the pg healthcheck") } } } 34
  27. func runSyncPeerMonitor(ctx context.Context, kapi client.KeysAPI, p *pg.PG, cluster *clusterState) error

    { primaryWatch, genWatch := watchAll(ctx, kapi, cluster.Primary, cluster.generation) var err error for { select { case primaryChange := <-primaryWatch: if isExpiry(primaryChange) { logChange(primaryChange, "primary has gone offline") p.Promote(ctx) return runPrimaryMonitor(ctx, kapi, p, cluster) } case genChange := <-genWatch: logChange(genChange, "generation changed; restarting") return runPeer(ctx, kapi) case pgErr := <-p.Err: logError(pgErr, "failed the pg healthcheck") stepDownFatal(ctx, kapi, p, "old sync peer failed the pg healthcheck") } } } 35
  28. func runSyncPeerMonitor(ctx context.Context, kapi client.KeysAPI, p *pg.PG, cluster *clusterState) error

    { primaryWatch, genWatch := watchAll(ctx, kapi, cluster.Primary, cluster.generation) var err error for { select { case primaryChange := <-primaryWatch: if isExpiry(primaryChange) { logChange(primaryChange, "primary has gone offline") p.Promote(ctx) return runPrimaryMonitor(ctx, kapi, p, cluster) } case genChange := <-genWatch: logChange(genChange, "generation changed; restarting") return runPeer(ctx, kapi) case pgErr := <-p.Err: logError(pgErr, "failed the pg healthcheck") stepDownFatal(ctx, kapi, p, "old sync peer failed the pg healthcheck") } } } 36
  29. 37 func runSyncPeerMonitor(ctx context.Context, kapi client.KeysAPI, p *pg.PG, cluster *clusterState)

    error { primaryWatch, genWatch := watchAll(ctx, kapi, cluster.Primary, cluster.generation) var err error for { select { case primaryChange := <-primaryWatch: if isExpiry(primaryChange) { logChange(primaryChange, "primary has gone offline") p.Promote(ctx) return runPrimaryMonitor(ctx, kapi, p, cluster) } case genChange := <-genWatch: logChange(genChange, "generation changed; restarting") return runPeer(ctx, kapi) case pgErr := <-p.Err: logError(pgErr, "failed the pg healthcheck") stepDownFatal(ctx, kapi, p, "old sync peer failed the pg healthcheck") } } }
  30. If the Primary goes offline… 40 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 A, B, C, D
  31. If the Primary goes offline… 40 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 A, B, C, D
  32. If the Primary goes offline… 41 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D
  33. If the Primary goes offline… 41 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D
  34. If the Primary goes offline… 42 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D
  35. If the Primary goes offline… 42 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D
  36. If the Primary goes offline… 42 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  37. If the Primary goes offline… 42 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  38. If the Primary goes offline… 43 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  39. If the Primary goes offline… 43 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  40. If the Primary goes offline… 43 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  41. If the Primary goes offline… 44 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  42. If the Primary goes offline… 44 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  43. If the Primary goes offline… 44 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  44. If the Primary goes offline… 45 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 , B, C, D
  45. If the Primary goes offline… 46 cored cored A B

    C D Primary Sync Rep Async Reps Generation Ephemeral Nodes A B C, D 1 , B, C, D Primary Sync Rep Async Reps Generation Ephemeral Nodes B C D 2 B, C, D
  46. 47 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A B C, D 1 A, B, C, D What if the Sync Rep goes offline (but doesn’t know it)?
  47. 47 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A B C, D 1 A, B, C, D What if the Sync Rep goes offline (but doesn’t know it)?
  48. 48 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A B C, D 1 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  49. 48 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A B C, D 1 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  50. 49 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  51. 50 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  52. 50 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  53. 50 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  54. 51 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  55. 51 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, , C, D What if the Sync Rep goes offline (but doesn’t know it)?
  56. 52 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, B, C, D What if the Sync Rep goes offline (but doesn’t know it)?
  57. 52 cored cored A B C D Primary Sync Rep

    Async Reps Generation Ephemeral Nodes A C D 2 A, B, C, D What if the Sync Rep goes offline (but doesn’t know it)?
  58. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D
  59. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D tx
  60. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D tx tx
  61. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D tx tx ack
  62. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D tx tx ack ack
  63. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D tx tx ack ack X
  64. A Very Hairy Failure Case 53 cored cored A B

    C D Primary Sync Rep Async Reps Gen Ephemeral Nodes A B C, D 1 A, B, C, D tx tx ack ack X
  65. A Very Hairy Failure Case 54 cored cored A B

    C D tx tx ack ack Primary Sync Rep Async Reps Gen Ephemeral Nodes B C D 2 B, C, D
  66. A Very Hairy Failure Case 54 cored cored A B

    C D tx tx ack ack Primary Sync Rep Async Reps Gen Ephemeral Nodes B C D 2 B, C, D
  67. Cluster State, revisited • Primary • Sync rep • Ordered

    list of async reps • Generation • Initial WAL position 55
  68. A Very Hairy Failure Case, with the WAL 56 cored

    cored A
 n B n C n D n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n Local WAL Position
  69. A Very Hairy Failure Case, with the WAL 56 cored

    cored A
 n B n C n D n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx Local WAL Position
  70. A Very Hairy Failure Case, with the WAL 56 cored

    cored A
 n B n C n D n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx tx Local WAL Position
  71. A Very Hairy Failure Case, with the WAL 56 cored

    cored A
 n B n C n D n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx tx ack Local WAL Position
  72. 57 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx tx ack Local WAL Position A Very Hairy Failure Case, with the WAL
  73. 58 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx tx ack Local WAL Position A Very Hairy Failure Case, with the WAL
  74. 58 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  75. 58 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 A, B, C, D n tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  76. 59 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 , B, C, D n tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  77. 59 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position A B C, D 1 , B, C, D n tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  78. 60 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position B C D 2 B, C, D n+1 tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  79. 61 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position B C D 2 B, C, D n+1 tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  80. 61 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position B C D 2 B, C, D n+1 tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  81. 61 cored cored A
 n+1 B n+1 C n D

    n Primary Sync Rep Async Reps Gen Ephemeral Nodes WAL Position B C D 2 B, C, D n+1 tx tx ack Local WAL Position ack A Very Hairy Failure Case, with the WAL
  82. Iago cored cored Iago Iago without etcd 66 Iago sitter

    cored raft /create-asset /build-transaction /raft/join /raft/msg
  83. 67