Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RICON West 2012: Bringing Consistency to Riak (Part 2)

RICON West 2012: Bringing Consistency to Riak (Part 2)

Joseph Blomstedt

October 30, 2013
Tweet

More Decks by Joseph Blomstedt

Other Decks in Programming

Transcript

  1. Joseph Blomstedt (@jtuple)
    Basho Technologies
    Bringing Consistency To Riak (Part 2)
    Tuesday, October 29, 13

    View Slide

  2. CAP Theorem
    2
    Tuesday, October 29, 13

    View Slide

  3. 3
    Partition-tolerance
    Consistency
    Availability
    Tuesday, October 29, 13

    View Slide

  4. 4
    Partition-tolerance
    Consistency Availability
    Tuesday, October 29, 13

    View Slide

  5. 5
    Partition-tolerance
    Consistency Availability
    CP AP
    Tuesday, October 29, 13

    View Slide

  6. 6
    Partition-tolerance
    Consistency Availability
    CP AP
    Tuesday, October 29, 13

    View Slide

  7. 7
    Partition-tolerance
    Consistency Availability
    CP AP
    Tuesday, October 29, 13

    View Slide

  8. 8
    C/P
    Strict Quorum A/P
    Sloppy Quorum A/P
    Tuesday, October 29, 13

    View Slide

  9. 9
    C/P
    Strict Quorum A/P
    Sloppy Quorum A/P
    Tuesday, October 29, 13

    View Slide

  10. 10
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  11. 11
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  12. 12
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  13. 13
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  14. 14
    C/P
    Strict Quorum A/P
    Sloppy Quorum A/P
    Tuesday, October 29, 13

    View Slide

  15. 15
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  16. 16
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  17. 17
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  18. 18
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  19. 19
    C/P
    Strict Quorum A/P
    Sloppy Quorum A/P
    Tuesday, October 29, 13

    View Slide

  20. 20
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  21. 21
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  22. 22
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  23. 23
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client
    Tuesday, October 29, 13

    View Slide

  24. 24
    Node 1 Node 2 Node 3 Node 4 Node 5
    client client
    client client client
    Tuesday, October 29, 13

    View Slide

  25. Eventual Consistency
    25
    Tuesday, October 29, 13

    View Slide

  26. 26
    A A A
    Tuesday, October 29, 13

    View Slide

  27. 27
    A A A
    Tuesday, October 29, 13

    View Slide

  28. 28
    A A A
    B
    Tuesday, October 29, 13

    View Slide

  29. 29
    A A A
    B
    Tuesday, October 29, 13

    View Slide

  30. 30
    A A A
    B
    B B B
    Tuesday, October 29, 13

    View Slide

  31. 31
    A A A
    Tuesday, October 29, 13

    View Slide

  32. 32
    A A A
    B C
    Tuesday, October 29, 13

    View Slide

  33. 33
    A A A
    B C
    Tuesday, October 29, 13

    View Slide

  34. 34
    A A A
    B
    {B,C} {B,C} {B,C}
    C
    Tuesday, October 29, 13

    View Slide

  35. 35
    Write Once
    Immutable
    Last Write Wins
    Business Rules
    CRDTs/Monotonicity
    Tuesday, October 29, 13

    View Slide

  36. 36
    Write Once
    Immutable
    Last Write Wins
    Business Rules
    CRDTs/Monotonicity
    Tuesday, October 29, 13

    View Slide

  37. 37
    Write Once
    Immutable
    Last Write Wins
    Business Rules
    CRDTs/Monotonicity
    Tuesday, October 29, 13

    View Slide

  38. 38
    Write Once
    Immutable
    Last Write Wins
    Business Rules
    CRDTs/Monotonicity
    Tuesday, October 29, 13

    View Slide

  39. 39
    Write Once
    Immutable
    Last Write Wins
    Business Rules
    CRDTs/Monotonicity
    Tuesday, October 29, 13

    View Slide

  40. 40
    Write Once
    Immutable
    Last Write Wins
    Business Rules
    CRDTs/Monotonicity
    Tuesday, October 29, 13

    View Slide

  41. Strong Consistency
    41
    Tuesday, October 29, 13

    View Slide

  42. Strong Consistency
    42
    Why?
    Tuesday, October 29, 13

    View Slide

  43. Strong Consistency
    43
    Recency
    Tuesday, October 29, 13

    View Slide

  44. Strong Consistency
    44
    Recency
    Partial Writes
    Tuesday, October 29, 13

    View Slide

  45. Strong Consistency
    45
    Recency
    Partial Writes
    Atomicity
    Tuesday, October 29, 13

    View Slide

  46. 46
    Recency
    Partial Writes
    Atomicity
    Tuesday, October 29, 13

    View Slide

  47. 47
    Recency
    Partial Writes
    Atomicity
    Tuesday, October 29, 13

    View Slide

  48. 48
    Eventual consistency
    is great
    Tuesday, October 29, 13

    View Slide

  49. 49
    But, when is eventual?
    Tuesday, October 29, 13

    View Slide

  50. 50
    Do I have the
    most recent
    value?
    Tuesday, October 29, 13

    View Slide

  51. 51
    CRDTs don’t help
    Tuesday, October 29, 13

    View Slide

  52. 52
    (a,1) (a,1) (a,1)
    =1
    Tuesday, October 29, 13

    View Slide

  53. 53
    (a,1) (a,1) (a,1)
    Tuesday, October 29, 13

    View Slide

  54. 54
    (a,1)
    +1 +3
    (a,1) (a,1)
    (a,2) (a,1),(b,3)
    =2 =4
    Tuesday, October 29, 13

    View Slide

  55. 55
    (a,1)
    +1 +3
    (a,1) (a,1)
    (a,2) (a,1),(b,3)
    Tuesday, October 29, 13

    View Slide

  56. 56
    (a,1)
    +1 +3
    (a,1) (a,1)
    (a,2) (a,1),(b,3)
    (a,2),(b,3) (a,2),(b,3) (a,2),(b,3)
    Tuesday, October 29, 13

    View Slide

  57. 57
    (a,1)
    +1 +3
    (a,1) (a,1)
    (a,2) (a,1),(b,3)
    (a,2),(b,3) (a,2),(b,3) (a,2),(b,3)
    =5
    Tuesday, October 29, 13

    View Slide

  58. 58
    (a,1)
    +1 +3
    (a,1) (a,1)
    (a,2) (a,1),(b,3)
    =2 =4
    Tuesday, October 29, 13

    View Slide

  59. 59
    Recency
    Partial Writes
    Atomicity
    Tuesday, October 29, 13

    View Slide

  60. 60
    A
    write B (fail)
    A A
    B A A
    Tuesday, October 29, 13

    View Slide

  61. 61
    B A A
    Tuesday, October 29, 13

    View Slide

  62. 62
    B A A
    read A
    read A
    read A
    Tuesday, October 29, 13

    View Slide

  63. 63
    B A A
    read A
    read A
    read A
    Tuesday, October 29, 13

    View Slide

  64. 64
    B A A
    read A
    read A
    read A
    read B
    Tuesday, October 29, 13

    View Slide

  65. 65
    Recency
    Partial Writes
    Atomicity
    Tuesday, October 29, 13

    View Slide

  66. Strong Consistency
    66
    Tuesday, October 29, 13

    View Slide

  67. Strong Consistency
    67
    What does
    mean for Riak 2.0?
    Tuesday, October 29, 13

    View Slide

  68. 68
    Conditional
    single key
    atomic operations
    Tuesday, October 29, 13

    View Slide

  69. 69
    No siblings
    Tuesday, October 29, 13

    View Slide

  70. 70
    get sees
    most recent put
    Tuesday, October 29, 13

    View Slide

  71. 71
    get/modify/put
    fails if object changed
    Tuesday, October 29, 13

    View Slide

  72. 72
    get/modify/put
    fails if object changed
    (eg. concurrent put)
    Tuesday, October 29, 13

    View Slide

  73. 73
    puts w/o vclock
    fails if object exists
    Tuesday, October 29, 13

    View Slide

  74. 74
    partial writes
    resolved on read
    Tuesday, October 29, 13

    View Slide

  75. 75
    Consensus
    Tuesday, October 29, 13

    View Slide

  76. 76
    Paxos
    Tuesday, October 29, 13

    View Slide

  77. 77


    1RGH


    1RGH


    1RGH


    1

    SUHSDUH1


    SURPLVH1 9
    %


    SURPLVH1 9
    &


    9
    1
    I9
    $
    9
    %
    9
    &


    FRPPLW1 9
    1




    DFFHSW1

    Tuesday, October 29, 13

    View Slide

  78. 78
    Rinse/repeat for
    each request
    Tuesday, October 29, 13

    View Slide

  79. 79
    2 round trips/request
    Tuesday, October 29, 13

    View Slide

  80. 80
    Multi-Paxos
    Tuesday, October 29, 13

    View Slide

  81. 81
    First Request
    Tuesday, October 29, 13

    View Slide

  82. 82


    1RGH


    1RGH


    1RGH


    1 ,

    SUHSDUH1 ,


    SURPLVH1 , 9
    %


    SURPLVH1 , 9
    &


    9
    1
    I9
    $
    9
    %
    9
    &


    FRPPLW1 , 9
    1




    DFFHSW1 ,

    Tuesday, October 29, 13

    View Slide

  83. 83
    Each Additional Request
    Tuesday, October 29, 13

    View Slide

  84. 84


    1RGH


    1RGH


    1RGH


    ,

    FRPPLW1 , 9



    DFFHSW1 ,

    Tuesday, October 29, 13

    View Slide

  85. 85
    1 round trip/request
    (common case)
    Tuesday, October 29, 13

    View Slide

  86. 86
    Problem
    Shipping entire state
    each request is
    expensive
    Tuesday, October 29, 13

    View Slide

  87. 87
    Solution
    Paxos
    +
    Replicated Log
    Tuesday, October 29, 13

    View Slide

  88. 88
    Problem
    Now I have
    N problems
    Tuesday, October 29, 13

    View Slide

  89. 89
    Log recovery
    Log trimming
    Rollup
    Snapshots
    Fault Recovery
    Tuesday, October 29, 13

    View Slide

  90. 90
    Choose your own
    adventure...
    Tuesday, October 29, 13

    View Slide

  91. 91
    Better Solution
    Build log replication
    into protocol
    Tuesday, October 29, 13

    View Slide

  92. 92
    Better Solution
    ZK Atomic Broadcast
    Raft
    Tuesday, October 29, 13

    View Slide

  93. Zab
    93
    Tuesday, October 29, 13

    View Slide

  94. 94
    Tuesday, October 29, 13

    View Slide

  95. 95
    Tuesday, October 29, 13

    View Slide

  96. 96
    Tuesday, October 29, 13

    View Slide

  97. 97
    Tuesday, October 29, 13

    View Slide

  98. Raft
    98
    Tuesday, October 29, 13

    View Slide

  99. 99
    Tuesday, October 29, 13

    View Slide

  100. 100
    raftconsensus.github.io
    Tuesday, October 29, 13

    View Slide

  101. 101
    Text
    Tuesday, October 29, 13

    View Slide

  102. Back to Riak
    102
    Tuesday, October 29, 13

    View Slide

  103. 103
    Key/Value
    Keys are independent
    Active Anti-Entropy
    Tunable backends
    Tuesday, October 29, 13

    View Slide

  104. 104
    Each key is
    independent state
    Tuesday, October 29, 13

    View Slide

  105. 105
    Simple multi-paxos
    per key
    Tuesday, October 29, 13

    View Slide

  106. 106
    1B keys
    =
    1B consensus groups?
    Tuesday, October 29, 13

    View Slide

  107. 107
    No
    Tuesday, October 29, 13

    View Slide

  108. 108
    Consensus group
    per preflist (replica set)
    Tuesday, October 29, 13

    View Slide

  109. 109
    Emulate paxos per key
    Tuesday, October 29, 13

    View Slide

  110. Node 0
    Node 1
    Node 2
    Tuesday, October 29, 13

    View Slide

  111. 111
    1 2
    3
    4
    5
    6
    7
    123
    Tuesday, October 29, 13

    View Slide

  112. 112
    1 2
    3
    4
    5
    6
    7
    123
    234
    Tuesday, October 29, 13

    View Slide

  113. 113
    1 2
    3
    4
    5
    6
    7
    123
    234
    345
    Tuesday, October 29, 13

    View Slide

  114. 114
    1 2
    3
    4
    5
    6
    7
    123
    234
    345
    456
    Tuesday, October 29, 13

    View Slide

  115. 115
    1 2
    3
    4
    5
    6
    7
    123
    234
    345
    456
    567
    ...
    Tuesday, October 29, 13

    View Slide

  116. 116
    1 2
    3
    4
    5
    6
    7
    123
    234
    345
    456
    567
    Ensembles
    ...
    Tuesday, October 29, 13

    View Slide

  117. 117
    64 partition ring
    =
    64 ensembles
    Tuesday, October 29, 13

    View Slide

  118. 118
    Each Ensemble
    Elects leader
    Establishes epoch
    Supports get/put ops
    Tuesday, October 29, 13

    View Slide

  119. 119
    Establish a new epoch
    Tuesday, October 29, 13

    View Slide

  120. 120


    1RGH


    1RGH


    1RGH


    1 ,

    SUHSDUH1 ,


    SURPLVH1 , 9
    %


    SURPLVH1 , 9
    &


    9
    1
    I9
    $
    9
    %
    9
    &


    FRPPLW1 , 9
    1




    DFFHSW1 ,

    Tuesday, October 29, 13

    View Slide

  121. 121
    consensus state
    epoch
    sequence
    membership
    leader
    Tuesday, October 29, 13

    View Slide

  122. 122
    K/V objects
    epoch
    sequence
    key
    value
    Tuesday, October 29, 13

    View Slide

  123. 123
    GET
    leader reads local object
    if obj.epoch old: refresh
    reply w/ val
    Tuesday, October 29, 13

    View Slide

  124. 124


    1RGH


    1RGH


    1RGH


    REMHSRFK HSRFK

    JHW.H\


    UHSO\(SRFK
    %
    6HT
    %
    9DO
    %


    UHSO\(SRFK
    &
    6HT
    &
    9DO
    &


    9DO ODWHVW9DO
    $
    9DO
    %
    9DO
    &

    9DOHSRFK HSRFK

    ZULWH(SRFK 6HT 9DO



    DFN(SRFK 6HT

    Tuesday, October 29, 13

    View Slide

  125. 125


    1RGH


    1RGH


    1RGH


    REMHSRFK HSRFK

    5HSO\ ORFDOBJHW.H\

    Tuesday, October 29, 13

    View Slide

  126. 126
    2 roundtrips/get (worst)
    0 roundtrip/get (best)
    Tuesday, October 29, 13

    View Slide

  127. 127
    PUT
    leader reads local object
    if obj.epoch old: refresh
    if modify(obj) false: fail
    commit modified obj
    reply ok
    Tuesday, October 29, 13

    View Slide

  128. 128


    1RGH


    1RGH


    1RGH


    REMHSRFK HSRFK

    JHW.H\


    UHSO\(SRFK
    %
    6HT
    %
    9DO
    %


    UHSO\(SRFK
    &
    6HT
    &
    9DO
    &


    /DWHVW ODWHVW9DO
    $
    9DO
    %
    9DO
    &

    9DO PRGLI\/DWHVW

    ZULWH(SRFK 6HT 9DO



    DFN(SRFK 6HT

    Tuesday, October 29, 13

    View Slide

  129. 129


    1RGH


    1RGH


    1RGH


    REMHSRFK HSRFK

    /DWHVW ORFDOBJHW.H\
    9DO PRGLI\/DWHVW

    ZULWH(SRFK 6HT 9DO



    DFN(SRFK 6HT

    Tuesday, October 29, 13

    View Slide

  130. 130
    2 roundtrips/put (worst)
    1 roundtrip/put (best)
    Tuesday, October 29, 13

    View Slide

  131. 131
    Leader abandons
    leadership if any quorum
    operation ever fails
    Tuesday, October 29, 13

    View Slide

  132. 132
    Which forces new epoch
    to be established
    Tuesday, October 29, 13

    View Slide

  133. 133
    Partial Writes
    Tuesday, October 29, 13

    View Slide

  134. failed partial write
    X
    (2)
    X
    (2)
    X
    (2)
    X
    (2)
    X
    (2)
    Y
    (2)
    epoch
    2
    epoch
    3
    Tuesday, October 29, 13

    View Slide

  135. read / rewrite / reply X
    X
    (2)
    X
    (2)
    Y
    (2)
    X
    (3)
    X
    (3)
    Y
    (2)
    epoch
    3
    epoch
    3
    Tuesday, October 29, 13

    View Slide

  136. X
    (3)
    X
    (3)
    Y
    (2)
    X
    (3)
    X
    (3)
    X
    (3)
    read / repair / reply X
    epoch
    3
    epoch
    3
    Tuesday, October 29, 13

    View Slide

  137. Usage
    137
    Tuesday, October 29, 13

    View Slide

  138. 138
    AP or CP per bucket type
    Tuesday, October 29, 13

    View Slide

  139. 139
    consistent = true
    Tuesday, October 29, 13

    View Slide

  140. 140
    $ riak-admin bucket-type create strong \
    '{"props": {"consistent": true}}'
    strong created
    Tuesday, October 29, 13

    View Slide

  141. 141
    $ riak-admin bucket-type activate strong
    strong has been activated
    Tuesday, October 29, 13

    View Slide

  142. 142
    > riakc_pb_socket:get(Socket,
    {<<"strong">>, <<"bucket">>},
    <<"key">>).
    {error,notfound}
    Tuesday, October 29, 13

    View Slide

  143. 143
    > Obj = riakc_obj:new({<<"strong">>, <<"bucket">>},
    <<"key">>,
    <<"1">>)).
    > riakc_pb_socket:put(Socket, Obj).
    ok
    Tuesday, October 29, 13

    View Slide

  144. 144
    > Obj2 = riakc_obj:new({<<"strong">>, <<"bucket">>},
    <<"key">>,
    <<"2">>)).
    > riakc_pb_socket:put(Socket, Obj2).
    {error, failed}
    Tuesday, October 29, 13

    View Slide

  145. 145
    {ok, Obj3} =
    riakc_pb_socket:get(Socket,
    {<<"strong">>, <<"bucket">>},
    <<"key">>).
    Tuesday, October 29, 13

    View Slide

  146. 146
    Obj4 = riakc_obj:update_value(Obj3, <<"2">>).
    Tuesday, October 29, 13

    View Slide

  147. 147
    Obj5 = riakc_obj:update_value(Obj3, <<"22">>).
    Tuesday, October 29, 13

    View Slide

  148. 148
    > riakc_pb_socket:put(Socket, Obj4).
    ok
    Tuesday, October 29, 13

    View Slide

  149. 149
    > riakc_pb_socket:put(Socket, Obj5).
    {error,<<"failed">>}
    Tuesday, October 29, 13

    View Slide

  150. 150
    Your client may vary
    Tuesday, October 29, 13

    View Slide

  151. 151
    Your client may vary
    We’re working on it
    Tuesday, October 29, 13

    View Slide

  152. Tech Preview
    152
    Tuesday, October 29, 13

    View Slide

  153. 153
    No AAE syncing
    No 2i
    No stats
    Tuesday, October 29, 13

    View Slide

  154. 154
    Will be in 2.0 final
    Tuesday, October 29, 13

    View Slide

  155. Coming Soon
    155
    Tuesday, October 29, 13

    View Slide

  156. 156
    Datatypes
    Multi-DC
    Lightweight Tx?
    Perf benchmarks
    Tuesday, October 29, 13

    View Slide

  157. 157
    Datatypes
    Multi-DC
    Lightweight Tx?
    Perf benchmarks
    Tuesday, October 29, 13

    View Slide

  158. 158
    Datatypes
    Multi-DC
    Lightweight Tx?
    Perf benchmarks
    Tuesday, October 29, 13

    View Slide

  159. 159
    Datatypes
    Multi-DC
    Lightweight Tx?
    Perf benchmarks
    Tuesday, October 29, 13

    View Slide

  160. 160
    Datatypes
    Multi-DC
    Lightweight Tx?
    Perf benchmarks
    Tuesday, October 29, 13

    View Slide

  161. Questions?
    161
    Tuesday, October 29, 13

    View Slide