Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deniz Altınbüken on Chain Replication (old and ...

Deniz Altınbüken on Chain Replication (old and new)

Chain Replication (CR) is a variant of Primary-Backup Replication that supports high throughput and fast recovery from failures. CR has been widely used in both commercial systems and academic research prototypes. In so doing, various shortcomings of the original CR protocol have come to light. In this talk, I will summarize these findings and present a new version of CR that addresses the shortcomings. Our improved CR protocol supports different consistency guarantees, avoids the tail bottleneck for reads, and introduces autonomous reconfiguration of the system without requiring an external master. Additionally, we have developed a formal end-to-end specification of the protocol, including the actions of clients, detailing reconfiguration and linearizable execution of client requests. Through this specification, we are able to reason about the new protocol more precisely and implement the protocol effortlessly. Lastly, I will contrast our approach to the related work.

Papers_We_Love

July 27, 2016
Tweet

More Decks by Papers_We_Love

Other Decks in Research

Transcript

  1. Chain Replication Robbert  van  Renesse  and  Fred  B.  Schneider.  2004.

      Chain  replica:on  for  suppor:ng  high  throughput  and  availability.   In  Proceedings  of  OSDI'04.  
  2. replication failure models • fail-stop failure model • crash failure

    model • byzantine failure model replication techniques • quorum replication • stake replication • broker replication • primary-backup replication • state machine replication • chain replication • etc. consistency models • strong consistency • sequential consistency • eventual consistency • causal consistency • read-your-writes consistency • monotonic read consistency • etc.
  3. replication failure models • fail-stop failure model • crash failure

    model • byzantine failure model replication techniques • quorum replication • stake replication • broker replication • primary-backup replication • state machine replication • chain replication • etc. consistency models • strong consistency • sequential consistency • eventual consistency • causal consistency • read-your-writes consistency • monotonic read consistency • etc.
  4. primary-backup replication client query reply R1 R2 R3 Rprimary Primary

    has to make sure that all updates prior to this query are done!
  5. R2 R3 Rtail Rhead chain replication client reply query Higher

    throughput! Tail can respond directly!
  6. related work • Sage  A.  Weil,  Andrew  W.  Leung,  ScoE

     A.  Brandt,  and  Carlos  Maltzahn.  2007.  RADOS:  a  scalable,  reliable  storage  service  for   petabyte-­‐scale  storage  clusters.  In  Proceedings  of  the  2nd  interna8onal  workshop  on  Petascale  data  storage:  held  in  conjunc8on   with  Supercompu8ng  '07  (PDSW  '07).  ACM,  New  York,  NY,  USA,  35-­‐44.     • Jeff  Terrace  and  Michael  J.  Freedman.  2009.  Object  storage  on  CRAQ:  high-­‐throughput  chain  replica@on  for  read-­‐mostly   workloads.  In  Proceedings  of  the  2009  conference  on  USENIX  Annual  technical  conference  (USENIX'09).  USENIX  Associa:on,   Berkeley,  CA,  USA,  11-­‐11.   • David  G.  Andersen,  Jason  Franklin,  Michael  Kaminsky,  Amar  Phanishayee,  Lawrence  Tan,  and  Vijay  Vasudevan.  2009.  FAWN:  a  fast   array  of  wimpy  nodes.  In  Proceedings  of  the  ACM  SIGOPS  22nd  symposium  on  Opera8ng  systems  principles  (SOSP  '09).  ACM,  New   York,  NY,  USA,  1-­‐14.   • ScoE  Lys:g  Fritchie.  2010.  Chain  replica@on  in  theory  and  in  prac@ce.  In  Proceedings  of  the  9th  ACM  SIGPLAN  workshop  on   Erlang  (Erlang  '10).  ACM,  New  York,  NY,  USA,  33-­‐44.   • WyaE  Lloyd,  Michael  J.  Freedman,  Michael  Kaminsky,  and  David  G.  Andersen.  2011.  Don't  seIle  for  eventual:  scalable  causal   consistency  for  wide-­‐area  storage  with  COPS.  In  Proceedings  of  the  Twenty-­‐Third  ACM  Symposium  on  Opera8ng  Systems   Principles  (SOSP  '11).  ACM,  New  York,  NY,  USA,  401-­‐416.   • Mahesh  Balakrishnan,  Dahlia  Malkhi,  Vijayan  Prabhakaran,  Ted  Wobber,  Michael  Wei,  and  John  D.  Davis.  2012.  CORFU:  a  shared   log  design  for  flash  clusters.  In  Proceedings  of  the  9th  USENIX  conference  on  Networked  Systems  Design  and  Implementa8on   (NSDI'12).  USENIX  Associa:on,  Berkeley,  CA,  USA,  1-­‐1.   • Guy  Laden,  Roie  Melamed,  and  Ymir  Vigfusson.  2012.  Adap@ve  and  dynamic  funnel  replica@on  in  clouds.  SIGOPS  Oper.  Syst.  Rev.   46,  1  (February  2012),  40-­‐46.   • Sérgio  Almeida,  João  Leitão,  and  Luís  Rodrigues.  2013.  ChainReac@on:  a  causal+  consistent  datastore  based  on  chain  replica@on.   In  Proceedings  of  the  8th  ACM  European  Conference  on  Computer  Systems  (EuroSys  '13).  ACM,  New  York,  NY,  USA,  85-­‐98.     • Hussam  Abu-­‐Libdeh,  Robbert  van  Renesse,  and  Ymir  Vigfusson.  2013.  Leveraging  sharding  in  the  design  of  scalable  replica@on   protocols.  In  Proceedings  of  the  4th  annual  Symposium  on  Cloud  Compu8ng  (SOCC  '13).  ACM,  New  York,  NY   …
  7. chain replication limitations • tail is a bottleneck for queries.

    • CRAQ: read from “clean” nodes. • supports only strong consistency. • CRAQ: eventual consistency • Chain Reaction: causal consistency • requires a master to reconfigure.
  8. motivation • explain why suggested improvements work. • find further

    improvements. • make reconfiguration easier and cleaner. • create complete specifications. • prove chain replication works.
  9. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  10. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History R2 is the predecessor of R3 R3 is the successor of R2
  11. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  12. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History update
  13. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  14. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  15. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History propagation message
  16. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  17. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  18. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  19. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  20. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  21. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  22. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History reply
  23. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History reply
  24. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  25. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History acknowledgment message
  26. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  27. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  28. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  29. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  30. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History Stable History Stable History Stable History Stable History
  31. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History update update pdate Stable History Stable History Stable History Stable History reply reply
  32. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History update update pdate Stable History Stable History Stable History Stable History reply reply Multiple updates are handled simultaneously.
  33. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History update update pdate reply reply R2 R3 Rtail Rhead
  34. R2 R3 Rtail Speculative History ⊆ ⊇ ⊇ ⊆ ⊆

    ⊆ ⊆ Stable History Speculative History Speculative History ⊇ Speculative History ⊆ Stable History ⊆ Stable History Stable History Rhead ⊆
  35. ⊆ Stable History Stable History Speculative History Speculative History ⊇

    Speculative History ⊆ Stable History R2 R3 R2 R3 R2 R2 The speculative history of a node’s successor is a subset of that node’s speculative history. The speculative history of a node is a superset of its stable history. The stable history of a node’s successor is a superset of that node’s stable history.
  36. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History R2 R3 Rtail Rhead
  37. Speculative History Speculative History Speculative History Speculative History query Stable

    History Stable History Stable History Stable History R2 R3 Rtail Rhead
  38. Speculative History Speculative History Speculative History Speculative History query Stable

    History Stable History Stable History Stable History reply R2 R3 Rtail Rhead
  39. Speculative History Speculative History Speculative History Speculative History reply query

    Stable History Stable History Stable History Stable History R2 R3 Rtail Rhead
  40. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History update update pdate Stable History Stable History Stable History Stable History reply reply query reply
  41. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History

    Speculative History update update pdate Stable History Stable History Stable History Stable History reply reply query reply The tail is the point of linearization!
  42. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History head failure R2 R3 Rtail Rhead
  43. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History head failure R2 R3 Rtail Rhead
  44. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History head failure R2 R3 Rtail Rhead update
  45. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History head failure R2 R3 Rtail Rhead
  46. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History head failure R2 R3 Rtail Rhead
  47. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History middle node failure R2 R3 Rtail Rhead
  48. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History middle node failure R2 R3 Rtail Rhead
  49. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History middle node failure R2 R3 Rtail Rhead
  50. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History middle node failure R2 R3 Rtail Rhead
  51. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History middle node failure R2 R3 Rtail Rhead
  52. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History tail failure R2 R3 Rtail Rhead
  53. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History tail failure R2 R3 Rtail Rhead
  54. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History tail failure R2 R3 Rtail Rhead
  55. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Rnew Speculative History Stable History tail R2 R3 Rtail Rhead
  56. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Rnew Speculative History Stable History tail • new nodes are added to the chain with special configuration updates that are added to the history: add(nodeid) • by looking at the order of these updates, a node can determine the configuration of the chain add( ) new tail R2 R3 Rtail Rhead
  57. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Rnew Speculative History Stable History tail • new nodes are added to the chain with special configuration updates that are added to the history: add(nodeid) • by looking at the order of these updates, a node can determine the configuration of the chain add( ) new tail R2 R3 Rtail Rhead
  58. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Rnew Speculative History Stable History tail add( ) new tail R2 R3 Rtail Rhead
  59. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History • stable history of new tail should be a superset of the stable history of tail. • speculative history of new tail should be a superset of its stable history. • speculative and stable histories of new tail should be equal to the speculative history of tail • old tail should not answer to queries when the new tail should. add( ) new tail Rnew tail R2 R3 Rtail Rhead ⊆
  60. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History • stable history of new tail should be a superset of the stable history of tail. • speculative history of new tail should be a superset of its stable history. • speculative and stable histories of new tail should be equal to the speculative history of tail • old tail should not answer to queries when the new tail should. add( ) new tail Rnew tail R2 R3 Rtail Rhead ⊆ ⊇
  61. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History • stable history of new tail should be a superset of the stable history of tail. • speculative history of new tail should be a superset of its stable history. • speculative and stable histories of new tail should be equal to the speculative history of tail • old tail should not answer to queries when the new tail should. add( ) new tail Rnew tail R2 R3 Rtail Rhead ⊆ ⊇ ⊇
  62. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History • stable history of new tail should be a superset of the stable history of tail. • speculative history of new tail should be a superset of its stable history. • speculative and stable histories of new tail should be equal to the speculative history of tail • old tail should not answer to queries when the new tail should. add( ) new tail Rnew tail R2 R3 Rtail Rhead = = ⊆
  63. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History • stable history of new tail should be a superset of the stable history of tail. • speculative history of new tail should be a superset of its stable history. • speculative and stable histories of new tail should be equal to the speculative history of tail • old tail should not answer to queries when the new tail should. add( ) new tail Rnew tail R2 R3 Rtail Rhead
  64. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History add( ) new tail Rnew tail R2 R3 Rtail Rhead
  65. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History add( ) new tail Rnew tail R2 R3 Rtail Rhead
  66. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History add( ) new tail Rnew tail R2 R3 Rtail Rhead
  67. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History add( ) new tail reply Rnew tail R2 R3 Rtail Rhead
  68. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History add( ) new tail Rnew tail R2 R3 Rtail Rhead
  69. Speculative History Speculative History Speculative History Speculative History Stable History

    Stable History Stable History Stable History adding a new node Speculative History Stable History add( ) new tail Rnew tail R2 R3 Rtail Rhead
  70. strong consistency • tail can reply to queries. • nodes

    that have their speculative and stable histories equal to each other can reply to queries. (clean vs dirty nodes at CRAQ) • a node can record the speculative history when it received a query and reply to the client when its stable history becomes equal to it. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History reply query Stable History Stable History Stable History Stable History after an update completes, any subsequent query by any client will return the updated value.
  71. strong consistency • tail can reply to queries. • any

    node can record its speculative history when it received a query and reply to the client when its stable history becomes equal to it. • nodes that have their speculative and stable histories equal to each other can reply to queries. (clean vs dirty nodes at CRAQ) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query after an update completes, any subsequent query by any client will return the updated value.
  72. • tail can reply to queries. • any node can

    record its speculative history when it received a query and reply to the client when its stable history becomes equal to it. • nodes that have their speculative and stable histories equal to each other can reply to queries. (clean vs dirty nodes at CRAQ) strong consistency R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query after an update completes, any subsequent query by any client will return the updated value.
  73. strong consistency • tail can reply to queries. • any

    node can record its speculative history when it received a query and reply to the client when its stable history becomes equal to it. • nodes that have their speculative and stable histories equal to each other can reply to queries. (clean vs dirty nodes at CRAQ) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query after an update completes, any subsequent query by any client will return the updated value.
  74. strong consistency • tail can reply to queries. • any

    node can record its speculative history when it received a query and reply to the client when its stable history becomes equal to it. • nodes that have their speculative and stable histories equal to each other can reply to queries. (clean vs dirty nodes at CRAQ) R2 Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History reply query R3 Rtail after an update completes, any subsequent query by any client will return the updated value.
  75. strong consistency • tail can reply to queries. • any

    node can record its speculative history when it received a query and reply to the client when its stable history becomes equal to it. • nodes that have their speculative and stable histories equal to each other can reply to queries. (clean vs dirty nodes at CRAQ) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History reply query after an update completes, any subsequent query by any client will return the updated value.
  76. sequential consistency queries might return stale values, as long as

    they are not reordered. • any node can reply to query messages with their stable history. • the stable history of any node is a prefix of history at tail. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History reply query Stable History Stable History Stable History Stable History reply query
  77. eventual consistency if no new updates are made, eventually all

    queries will return a history including that last update.
  78. eventual consistency • any node can reply to query messages

    with their speculative history • the speculative history includes the history at tail and a sequence of updates that have been invoked but not yet stabilized (used in CRAQ) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History reply query Stable History Stable History Stable History Stable History reply query if no new updates are made, eventually all queries will return a history including that last update.
  79. causal consistency if client A has communicated to client B

    that it has completed an update, a subsequent query by client B will return that completed update.
  80. causal consistency • requires modeling communication between clients • if

    a client receives a query reply from a node, same client can only read from this node’s predecessors until all updates in the reply are stabilized (used in Chain Reaction) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query1 reply if client A has communicated to client B that it has completed an update, a subsequent query by client B will return that completed update.
  81. causal consistency • requires modeling communication between clients • if

    a client receives a query reply from a node, same client can only read from this node’s predecessors until all updates in the reply are stabilized (used in Chain Reaction) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History reply query2 Stable History Stable History Stable History Stable History query1 reply if client A has communicated to client B that it has completed an update, a subsequent query by client B will return that completed update.
  82. read-your-writes consistency if a client’s update completes, that client will

    never see an older version of the history. this is a special case of the causal consistency model.
  83. read-your-writes consistency • requires modeling client-side. • on the client-side,

    the proxy should ensure that a history returned by a query includes all updates that have been completed. • the proxy keeps track of updates that are invoked and completed. R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query reply if a client’s update completes, that client will never see an older version of the history. this is a special case of the causal consistency model.
  84. monotonic read consistency if a client has issued a query

    and received h as a response, all following queries will receive a response with a history that has h as a prefix.
  85. monotonic read consistency if a client has seen a particular

    update, any subsequent queries will never return any previous state.
  86. monotonic read consistency • any given client only queries a

    single node (used in CRAQ) R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query reply if a client has seen a particular update, any subsequent queries will never return any previous state.
  87. monotonic read consistency • requires modeling client-side. • on the

    client-side, the proxy should ensure that a history returned by a query should always be a suffix of histories returned by previous queries • the proxy keeps track of queries that are invoked and completed R2 R3 Rtail Rhead Speculative History Speculative History Speculative History Speculative History Stable History Stable History Stable History Stable History query reply if a client has seen a particular update, any subsequent queries will never return any previous state.
  88. objective • A linearizable data store replicated with chain replication

    should look like a centralized data store to all clients. • A centralized data store has a single history. • Make sure the data store replicated with chain replication looks like it has a single history. • Prove it :) • Write the specification for the centralized data store. • Write the specification for the replicated data store. • Show the replicated specification refines the centralized specification.
  89. conclusion • we have created a formal end-to-end specification of

    chain replication • through this specification we can reason about how chain replication works • chain replication is easy to understand or implement • it can support different consistency models • reconfiguration can be done without requiring a master
  90. TODO • open-source chain replication implementations • java and python

    in progress • chain replication wikipedia page :) website: http://www.cs.cornell.edu/~deniz e-mail: [email protected] denizalti