Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Brief History of Time in Riak

Sean Cribbs
October 28, 2014

A Brief History of Time in Riak

"Time, it will not erase me! For what is time? It's just passing by." -- Lost In The Trees (2007)

What is the meaning of time in a distributed system? How can we make sense of events separated by great spans of space-time? Why do I need to give Riak a vector clock with every write? How do vector clocks even work?

This talk is the history of how we have tried to make sense of time, ordering, and causality in Riak, the trade-offs associated with each solution, and representations of logical time we are investigating for future versions.

Sean Cribbs

October 28, 2014
Tweet

More Decks by Sean Cribbs

Other Decks in Technology

Transcript

  1. A B R I E F H I S T

    O RY O F T I M E I N R I A K S E A N C R I B B S Image: CC-BY Judy Schmidt [geckzilla.com]
  2. – L O S T I N T H E

    T R E E S “Time, it will not erase me, for what is time?
 It’s just passing by.” photo: CC BY-SA “Tabercil” (wikimedia commons)
  3. P H Y S I C A L C L

    O C K S M E A S U R I N G T I M E W I T H
  4. P H Y S I C A L C L

    O C K S M E A S U R I N G T I M E W I T H many light-years
  5. P H Y S I C A L C L

    O C K S M E A S U R I N G T I M E W I T H
  6. P H Y S I C A L C L

    O C K S M E A S U R I N G T I M E W I T H SF 14ms NY
  7. T I M E , C L O C K

    S A N D T H E O R D E R I N G O F E V E N T S I N A D I S T R I B U T E D S Y S T E M L E S L I E L A M P O R T, 1 9 7 8
  8. T I M E , C L O C K

    S A N D T H E O R D E R I N G O F E V E N T S I N A D I S T R I B U T E D S Y S T E M L E S L I E L A M P O R T, 1 9 7 8
  9. T I M E , C L O C K

    S A N D T H E O R D E R I N G O F E V E N T S I N A D I S T R I B U T E D S Y S T E M L E S L I E L A M P O R T, 1 9 7 8
  10. D E T E C T I O N O

    F M U T U A L I N C O N S I S T E N C Y I N D I S T R I B U T E D S Y S T E M S D . S T O T T PA R K E R J R , E T A L , 1 9 8 3
  11. – K Y L E K I N G S

    B U RY ( @ A P H Y R ) “Do you even know how vector clocks work?”
  12. – K Y L E K I N G S

    B U RY ( @ A P H Y R ) “Do you even know how vector clocks work?” version vectors
  13. V E R S I O N V E C

    T O R S A R E N O T V E C T O R C L O C K S C A R L O S B A Q U E R O , 2 0 1 1 http://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/
  14. V E R S I O N V E C

    T O R S A C B
  15. V E R S I O N V E C

    T O R S A C B {a, 1} {b, 1} {c, 1} {a, 2} [ ] {a, 2}, {b, 1}, {c, 1}
  16. V E R S I O N V E C

    T O R S A C B {a, 1} {b, 1} {c, 1} {a, 2} [ ] {a, 2}, {b, 1}, {c, 1}
  17. V E R S I O N V E C

    T O R S A C B {a, 1} {b, 1} {c, 1} {a, 2} [ ] {a, 2},{b, 1},{c, 1}
  18. V E R S I O N V E C

    T O R S A C B {a, 1} {b, 1} {c, 1} {a, 2} [{a,2}, {b,1}, {c,1}]
  19. V E R S I O N V E C

    T O R S [{a,2}, {b,1}, {c,1}]
  20. U P D AT E V E R S I

    O N V E C T O R [{a,2}, {b,1}, {c,1}]
  21. U P D AT E V E R S I

    O N V E C T O R [{a,2}, {b,2}, {c,1}]
  22. U P D AT E V E R S I

    O N V E C T O R [{a,2}, {b,3}, {c,1}]
  23. U P D AT E V E R S I

    O N V E C T O R [{a,2}, {b,3}, {c,2}]
  24. C O M PA R I N G V E

    R S I O N V E C T O R S • Descends : A >= B • Dominates : A > B • Concurrent : A | B >= |
  25. M E R G I N G V E R

    S I O N V E C T O R S • A ⊔ B = C :: pairwise maximum of each actor • C >= A and C >= B • A | B ⟹ C > A and C > B ⊔ =
  26. S Y N TA C T I C M E

    R G I N G • Discards seen values • Retains concurrent values • Merges divergent version vectors
  27. N O W Y O U E V E N

    K N O W H O W V E R S I O N V E C T O R S W O R K
  28. C A U S A L I T Y I

    N R I A K A B R I E F H I S T O RY O F
  29. C L I E N T- S I D E

    I D S R I A K 0 . 1 - 0 . 1 4
  30. C L I E N T- S I D E

    I D S ( R I A K 0 . X ) • Client IDs are used in the version vector • Riak returns VV on fetch • Riak increments at API layer on PUT • Syntactic merge / store • I >= L ⟹ overwrite • I =< L ⟹ ignore • I | L ⟹ add sibling and merge clocks
  31. C L I E N T- S I D E

    I D S B E N E F I T S • Any node can take updates — no forwarding • Idempotent writes • No sibling explosion
  32. C L I E N T- S I D E

    I D S D R A W B A C K S • Actor explosion (Charron-Bost result) • Client application manages ID • RYOW required for correctness
  33. “ V N O D E V C L O

    C K S ” R I A K 1 . 0 - 1 . 4
  34. V N O D E V C L O C

    K S ( R I A K 1 . X ) • Use the Virtual Node (vnode) as the Actor • Coordinated writes with forwarding • Addition of false concurrency
  35. V N O D E V C L O C

    K S FA L S E C O N C U R R E N C Y C 1 C 2 R I A K GET Foo GET Foo
  36. V N O D E V C L O C

    K S FA L S E C O N C U R R E N C Y C 1 C 2 R I A K [{a,1},{b4}]->”bob” [{a,1},{b4}]->”bob”
  37. V N O D E V C L O C

    K S FA L S E C O N C U R R E N C Y C 1 C 2 R I A K PUT [{a,1},{b,4}]=“Rita” PUT [{a,1},{b,4}]=“Sue”
  38. V N O D E V C L O C

    K S FA L S E C O N C U R R E N C Y C 1 C 2 P U T F S M 1 P U T F S M 2 V N O D E Q RITA SUE V N O D E
  39. V N O D E V C L O C

    K S FA L S E C O N C U R R E N C Y V N O D E Q RITA V N O D E A [{a,2},{b,4}]=“SUE” [{a,1},{b,4}]
  40. V N O D E V C L O C

    K S FA L S E C O N C U R R E N C Y V N O D E Q [{a,3},{b,4}]=[RITA,SUE] V N O D E A [{a,2},{b,4}]=“SUE”
  41. V N O D E V C L O C

    K S ( R I A K 1 . X ) • I >= L ⟹ 
 increment, overwrite, replicate • Not I >= L ⟹ 
 merge, increment, add sibling, replicate
  42. V N O D E V C L O C

    K S B E N E F I T S • Far fewer actors • Simpler for users • Contextless writes are OK
  43. V N O D E V C L O C

    K S D R A W B A C K S • Forwarding adds latency • Writes not idempotent • Sibling explosion
  44. S I B L I N G E X P

    L O S I O N FA L S E C O N C U R R E N C Y + R E T R I E S = L A R G E O B J E C T S
  45. S I B L I N G E X P

    L O S I O N C 1 C 2 R I A K GET Foo GET Foo
  46. S I B L I N G E X P

    L O S I O N C 1 C 2 R I A K not found not found
  47. S I B L I N G E X P

    L O S I O N C 1 R I A K PUT []=“Rita” [{a,1}]->”Rita”
  48. S I B L I N G E X P

    L O S I O N C 2 R I A K PUT []=“Sue” [{a,2}]->[”Rita”, “Sue”]
  49. S I B L I N G E X P

    L O S I O N C 1 R I A K PUT [{a, 1}]=“Bob” [{a,3}]->[”Rita”, “Sue”, “Bob”]
  50. S I B L I N G E X P

    L O S I O N C 2 R I A K PUT [{a,2}]=“Babs” [{a,4}]->[”Rita”, “Sue”, 
 “Bob”, “Babs”]
  51. V N O D E V C L O C

    K S + D O T S R I A K 2 . 0
  52. D O T T E D V E R S

    I O N V E C T O R S : E F F I C I E N T C A U S A L I T Y T R A C K I N G F O R D I S T R I B U T E D K E Y- VA L U E S T O R E S ! P R E G U I Ç A , B A Q U E R O , E T A L : 2 0 1 2
  53. W H AT E V E N I S A

    D O T ? A C B {a, 1} {b, 1} {c, 1} {a, 2} [{a,2}, {b,1}, {c,1}]
  54. W H AT E V E N I S A

    D O T ? A C B {a, 1} {b, 1} {c, 1} {a, 2} [{a,2}, {b,1}, {c,1}]
  55. V N O D E V C L O C

    K S + D O T S • I >= L ⟹ 
 increment, apply dot, overwrite, replicate • Not I >= L ⟹ 
 merge, increment, apply dot, add sibling, replicate
  56. V N O D E V C L O C

    K S + D O T S [{a, 4}] Rita Sue Babs Bob [{a, 3}] Pete
  57. V N O D E V C L O C

    K S + D O T S [{a, 4}] Rita Sue Babs Bob [{a, 3}] Pete {a,1} {a,2} {a,3} {a,4}
  58. V N O D E V C L O C

    K S + D O T S [{a, 4}] Babs [{a, 3}] Pete {a,4}
  59. V N O D E V C L O C

    K S + D O T S [{a, 5}] Babs Pete {a,4} {a,5}
  60. R E P L I C A M E R

    G E + D O T S [{a, 4}, {b,1}] Babs Bob [ { a , 3 } , { b , 2 } ] Pete {a,3} {a,4} Phil {b,1} Sue {a,2} {b,2} Bob {a,3}
  61. [ { A , 3 } , { B ,

    2 } ] R E P L I C A M E R G E + D O T S [{a, 4}, {b,2}] Babs Bob Pete {a,3} {a,4} {b, 2} Bob
  62. R I A K - K V # 6 7

    9 I N FA M O U S D A TA - L O S S B U G
  63. K V 6 7 9 : L I N G

    E R I N G T O M B S T O N E • Delete key (tombstone) with at least one fallback • Primaries read and reap tombstone • Recreate key (new context) • “Doomstone” handed off • New data lost!
  64. K V 6 7 9 : B A C K

    U P / R E S T O R E • Back up data • Delete key and reap • Recreate key (new context) • Restore from backup • New data lost!
  65. K V 6 7 9 : FA I L E

    D L O C A L R E A D • Local data corrupted / lost • Local read doesn’t get context, create new • Replicate write where higher contexts exist • New data lost!
  66. T I M E G O E S B A

    C K WA R D S C O R E P R O B L E M :
  67. C U R R E N T S O L

    U T I O N • Add an “epoch” to actor in VV • Increment epoch when getting not_found or error • New actor for resurrected key = disjoint history • Better actor ID generation
  68. C O N C L U S I O N

    S & F U T U R E R I A K 2 . X +
  69. C O N C L U S I O N

    S • Understand causality and partial-ordering • Our understanding has evolved over time • Causality tracking has lots of room for optimization • Positive interaction with academia
  70. F U T U R E P O S S

    I B I L I T I E S • Global version vectors • Riak KV “as a CRDT” • More efficient object representations
  71. R U S S E L L B R O

    W N C R D T G U R U K V 6 7 9 K I L L E R H U G E T H A N K S
  72. A C K N O W L E D G

    M E N T S • Jon Meredith, Christopher Meiklejohn, Jordan West, Tyler Hannan, Carlos Baquero (and team), Nuno Preguiça • NASA Astronomy Picture of the Day (APOD)
 Judy Schmidt