Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consistency, Availability, Partition: Make Your...

Consistency, Availability, Partition: Make Your Choice

Shared data systems try hardly to satisfy data consistency, system availability and tolerance to network partitions.
In a distributed system it is impossible to simultaneously provide all these guarantees at any given moment in time.
The purpose of the talk is to show the mechanism used by data storage systems such as Dynamo and BigTable in order to satisfy two guarantees at a time.

Andrea Giuliano

February 18, 2015
Tweet

More Decks by Andrea Giuliano

Other Decks in Technology

Transcript

  1. M A K E Y O U R C H

    O I C E C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N A n d re a G i u l i a n o @ b i t _ s h a r k
  2. D I S T R I B U T E

    D S Y S T E M S
  3. W H AT A D I S T R I

    B U T E D S Y S T E M I S “A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages”
  4. D I S T R I B U T E

    D S Y S T E M S E X A M P L E S
  5. D I S T R I B U T E

    D S Y S T E M S R E P L I C AT I O N
  6. R E P L I C AT E D S

    E R V I C E P R O P E R T I E S CONSISTENCY AVAILABILITY
  7. C O N S I S T E N C

    Y The result of operations will be predictable
  8. C O N S I S T E N C

    Y Strong consistency all replicas return the same value for the same object
  9. C O N S I S T E N C

    Y Strong consistency all replicas return the same value for the same object Weak consistency different replicas can return different values for the same object
  10. S T R O N G V S W E

    A K C O N S I S T E N C Y
  11. S T R O N G V S W E

    A K C O N S I S T E N C Y Strong consistency Atomic, consistent, isolated, durable database Weak consistency Basically Available Soft-state Eventual consistency database
  12. E X A M P L E C O N

    S I S T E N C Y put(price, 10)
  13. E X A M P L E C O N

    S I S T E N C Y get(price) price = 10
  14. E X A M P L E A VA I

    L A B I L I T Y
  15. PA R T I T I O N T O

    L E R A N C E continue to operate even in presence of partitions
  16. PA R T I T I O N T O

    L E R A N C E Network failure groups at each side of a faulty entity network (switch, backbone) Process failure system split in two groups: correct nodes and crashed node
  17. C A P T H E O R E M

    “Of three properties of shared-data systems (data consistency, system availability and tolerance to network partitions) only two can be achieved at any given moment in time.”
  18. T H E P R O O F C A

    P T H E O R E M put(price, 10) get(price) price = 0 price = 0 price = 0 price = 0 no response not consistent not available t2 t1 partition 1 partition 2
  19. CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ distributed databases ➡ distributed locking

    ➡ majority protocol ➡ active/passive replication ➡ quorum-based systems BigTable C A P T H E O R E M I N P R A C T I C E
  20. C A P T H E O R E M

    CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ web caches ➡ stateless systems ➡ DNS DynamoDB
  21. C A P T H E O R E M

    CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ Single site database ➡ cluster databases ➡ ldap
  22. R E Q U I R E M E N

    T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”
  23. R E Q U I R E M E N

    T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.” ➡ reliable ➡ high scalable ➡ always available
  24. S I M P L E I N T E

    R FA C E D Y N A M O get(key) returns the object associated with the key and returns a single object or a list of objects with conflicting versions along with a context. put(key, context, object) determines where the replicas of the object should be placed based on the associated key. The context includes information such as the version of the object.
  25. R E P L I C AT I O N

    : T H E C H O I C E D Y N A M O Synchronous replica coordination ‣ strong consistency ‣ availability tradeoff Optimistic replication technique ‣ high availability ‣ conflicts probability
  26. C O N F L I C T S :

    W H E N D Y N A M O At write time ‣ writes rejection probability At read time ‣ “always writable” datastore
  27. C O N F L I C T S :

    W H O D Y N A M O The data store ‣ e.g. “last write win” policy The application ‣ resolution as implementation detail
  28. A R I N G T O R U L

    E T H E M A L L D Y N A M O
  29. PA R T I T I O N I N

    G : T H E R I N G D Y N A M O A B C D E F G DATA hash
  30. R E P L I C AT I O N

    D Y N A M O A B C D E F G N = 3 D will store keys in the range (A, B], (B, C], (C, D] DATA hash
  31. D ATA V E R S I O N I

    N G D Y N A M O put() may return before the update has been propagated to all replicas. get() subsequent get() may return an object that does not have the latest update
  32. R E C O N C I L I AT

    I O N D Y N A M O
  33. R E C O N C I L I AT

    I O N D Y N A M O Syntactic reconciliation ‣ new version subsumes the previous Semantic reconciliation ‣ conflicting versions of the same object
  34. V E C T O R C L O C

    K D Y N A M O
  35. V E C T O R C L O C

    K D Y N A M O Definition ‣ list of (node, counter) pairs
  36. V E C T O R C L O C

    K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] write handled by Sx
  37. V E C T O R C L O C

    K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] write handled by Sx write handled by Sx
  38. V E C T O R C L O C

    K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] write handled by Sx write handled by Sx handled by Sy write
  39. V E C T O R C L O C

    K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz
  40. V E C T O R C L O C

    K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] D5 [Sx,3], [Sy,1], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz reconciled and written by Sx
  41. P U T ( ) A N D G E

    T ( ) D Y N A M O R ‣ minimum number of nodes that must partecipate in a read operation. W ‣ minimum number of nodes that must participate in a successful write operation
  42. P U T ( ) A N D G E

    T ( ) D Y N A M O put() ‣ the coordinator generates the vector clock for the new version and writes the new version locally ‣ the new version is sent to N nodes ‣ the write is successful if W-1 nodes respond get() ‣ the coordinator requests all existing versions of data ‣ the coordinator waits for R responses before returning the result ‣ the coordinator returns all the version causally unrelated ‣ the divergent versions are reconciled and written back
  43. S L O P P Y Q U O R

    U M D Y N A M O A B C D E F G N = 3
  44. W H Y I S A P ? D Y

    N A M O ‣ requests served even if some replicas are not available ‣ if some node is down the write is stored to another node ‣ consistency conflicts resolved at read time or in the background ‣ eventually, all the replicas will converge ‣ concurrent read/write operation can make distinct clients see distinct versions of the same key
  45. R E Q U I R E M E N

    T S G O O G L E B I G TA B L E ‣ scale to petabyte of data ‣ thousand of machines ‣ high availability ‣ high performance
  46. D ATA M O D E L G O O

    G L E B I G TA B L E ‣ sparse, distributed, persistent multi-dimensional sorted map (row: string, column: string, time: int64) string
  47. R O W S G O O G L E

    B I G TA B L E ‣ arbitrary strings ‣ read/write operations are atomic ‣ data is maintained in lexicographic order by row key ‣ each row range is called a tablet maps.google.com com.google.maps
  48. C O L U M N S G O O

    G L E B I G TA B L E ‣ columns keys are grouped into sets: column families ‣ a column family must be created before data can be stored under any column key in that family ‣ column key named as family:qualifier ‣ access control and both disk and memory accounting are performed at the column-family level
  49. T I M E S TA M P S G

    O O G L E B I G TA B L E C O N T E N T S : c o m . e x a m p l e < h t m l > … < h t m l > … t 1 t 2
  50. D ATA M O D E L : E X

    A M P L E G O O G L E B I G TA B L E L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A c o m . e x a m p l e e n < ! D O C T Y P E h t m l P U B L I C … c o m . c n n . w w w e n < ! D O C T Y P E h t m l P U B L I C … “ c n n " “ c n n . c o m ” c o m . c n n . w w w / f o o e n < ! D O C T Y P E h t m l P U B L I C … column families row keys sorted rows
  51. D I F F E R E N C E

    S W I T H R D B M S G O O G L E B I G TA B L E R D B M S B I G TA B L E q u e r y l a n g u a g e s p e c i f i c a p i j o i n s n o re f e re n t i a l i n t e g r i t y e x p l i c i t s o r t i n g s o r t i n g d e f i n e d a p r i o r i i n t h e c o l u m n f a m i l y
  52. A R C H I T E C T U

    R E G O O G L E B I G TA B L E Google File System (GFS) ‣ store data files and logs Google SSTable ‣ store BigTable data Chubby ‣ high-available distributed lock service
  53. C O M P O N E N T S

    G O O G L E B I G TA B L E library ‣ linked into every client one master server ‣ assigning tablets to tablet server ‣ detecting the addition and expiration of tablet servers ‣ balancing tablet-server load ‣ garbaging collection of files in GFS ‣ handling schema changes many tablet servers ‣ manages 10 to 100 tablets ‣ handles read and write requests to the tablets ‣ splits tablets that have grown too large
  54. C O M P O N E N T S

    G O O G L E B I G TA B L E Master server Client Tablet server Tablet server Tablet server Metadata read/write
  55. S TA R T U P A N D G

    R O W T H G O O G L E B I G TA B L E Chubby file Root tablet 1st Metadata tablet other metadata tablets UserTableN UserTable1 … … … … … … … … … … …
  56. TA B L E T A S S I G

    N M E N T G O O G L E B I G TA B L E tablet server ‣ when started, creates and acquires a lock in Chubby master ‣ grabs a unique master lock in Chubby ‣ scans Chubby to find live tablet servers ‣ asks each tablet server to discover its tablets ‣ scans the Metadata table to learn the full set of tablets ‣ builds a set of unassigned tablet server, for future tablet assignment
  57. W H Y I S C P ? G O

    O G L E B I G TA B L E ‣ master death cause services no longer functioning ‣ tablet server death cause tablets unavailable ‣ Chubby death cause BigTable inability to execute synchronization operations and to serve client requests ‣ Google File System is a CP system
  58. G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store”

    F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data” Assets: https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg https://www.flickr.com/photos/avardwoolaver/7137096221 R E F E R E N C E S