Consistency, Availability, Partition: Make Your Choice

M A K E Y O U R C H
O I C E C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N A n d re a G i u l i a n o @ b i t _ s h a r k

D I S T R I B U T E
D S Y S T E M S

W H AT A D I S T R I
B U T E D S Y S T E M I S “A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages”

D I S T R I B U T E
D S Y S T E M S E X A M P L E S

D I S T R I B U T E
D S Y S T E M S R E P L I C AT I O N

R E P L I C AT E D S
E R V I C E P R O P E R T I E S CONSISTENCY AVAILABILITY

C O N S I S T E N C
Y The result of operations will be predictable

C O N S I S T E N C
Y Strong consistency all replicas return the same value for the same object

C O N S I S T E N C
Y Strong consistency all replicas return the same value for the same object Weak consistency different replicas can return different values for the same object

S T R O N G V S W E
A K C O N S I S T E N C Y

S T R O N G V S W E
A K C O N S I S T E N C Y Strong consistency Atomic, consistent, isolated, durable database Weak consistency Basically Available Soft-state Eventual consistency database

E X A M P L E C O N
S I S T E N C Y put(price, 10)

E X A M P L E C O N
S I S T E N C Y get(price) price = 10

AVA I L A B I L I T Y

E X A M P L E A VA I
L A B I L I T Y

C O M M U N I C AT I
O N

PA R T I T I O N T O
L E R A N C E continue to operate even in presence of partitions

PA R T I T I O N T O
L E R A N C E Network failure groups at each side of a faulty entity network (switch, backbone) Process failure system split in two groups: correct nodes and crashed node

C A P T H E O R E M
“Of three properties of shared-data systems (data consistency, system availability and tolerance to network partitions) only two can be achieved at any given moment in time.”

T H E P R O O F C A
P T H E O R E M put(price, 10) get(price) price = 0 price = 0 price = 0 price = 0 no response not consistent not available t2 t1 partition 1 partition 2

CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ distributed databases ➡ distributed locking
➡ majority protocol ➡ active/passive replication ➡ quorum-based systems BigTable C A P T H E O R E M I N P R A C T I C E

C A P T H E O R E M
CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ web caches ➡ stateless systems ➡ DNS DynamoDB

C A P T H E O R E M
CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ Single site database ➡ cluster databases ➡ ldap

D Y N A M O

R E Q U I R E M E N
T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”

R E Q U I R E M E N
T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.” ➡ reliable ➡ high scalable ➡ always available

S I M P L E I N T E
R FA C E D Y N A M O get(key) returns the object associated with the key and returns a single object or a list of objects with conflicting versions along with a context. put(key, context, object) determines where the replicas of the object should be placed based on the associated key. The context includes information such as the version of the object.

R E P L I C AT I O N
: T H E C H O I C E D Y N A M O Synchronous replica coordination ‣ strong consistency ‣ availability tradeoff Optimistic replication technique ‣ high availability ‣ conflicts probability

C O N F L I C T S :
W H E N D Y N A M O At write time ‣ writes rejection probability At read time ‣ “always writable” datastore

C O N F L I C T S :
W H O D Y N A M O The data store ‣ e.g. “last write win” policy The application ‣ resolution as implementation detail

A R I N G T O R U L
E T H E M A L L D Y N A M O

PA R T I T I O N I N
G : T H E R I N G D Y N A M O A B C D E F G DATA hash

R E P L I C AT I O N
D Y N A M O A B C D E F G N = 3 D will store keys in the range (A, B], (B, C], (C, D] DATA hash

D ATA V E R S I O N I
N G D Y N A M O put() may return before the update has been propagated to all replicas. get() subsequent get() may return an object that does not have the latest update

R E C O N C I L I AT
I O N D Y N A M O

R E C O N C I L I AT
I O N D Y N A M O Syntactic reconciliation ‣ new version subsumes the previous Semantic reconciliation ‣ conflicting versions of the same object

V E C T O R C L O C
K D Y N A M O

V E C T O R C L O C
K D Y N A M O Definition ‣ list of (node, counter) pairs

V E C T O R C L O C
K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] write handled by Sx

V E C T O R C L O C
K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] write handled by Sx write handled by Sx

V E C T O R C L O C
K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] write handled by Sx write handled by Sx handled by Sy write

V E C T O R C L O C
K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz

V E C T O R C L O C
K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] D5 [Sx,3], [Sy,1], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz reconciled and written by Sx

P U T ( ) A N D G E
T ( ) D Y N A M O R ‣ minimum number of nodes that must partecipate in a read operation. W ‣ minimum number of nodes that must participate in a successful write operation

P U T ( ) A N D G E
T ( ) D Y N A M O put() ‣ the coordinator generates the vector clock for the new version and writes the new version locally ‣ the new version is sent to N nodes ‣ the write is successful if W-1 nodes respond get() ‣ the coordinator requests all existing versions of data ‣ the coordinator waits for R responses before returning the result ‣ the coordinator returns all the version causally unrelated ‣ the divergent versions are reconciled and written back

S L O P P Y Q U O R
U M D Y N A M O A B C D E F G N = 3

W H Y I S A P ? D Y
N A M O ‣ requests served even if some replicas are not available ‣ if some node is down the write is stored to another node ‣ consistency conflicts resolved at read time or in the background ‣ eventually, all the replicas will converge ‣ concurrent read/write operation can make distinct clients see distinct versions of the same key

B I G TA B L E

R E Q U I R E M E N
T S G O O G L E B I G TA B L E ‣ scale to petabyte of data ‣ thousand of machines ‣ high availability ‣ high performance

D ATA M O D E L G O O
G L E B I G TA B L E ‣ sparse, distributed, persistent multi-dimensional sorted map (row: string, column: string, time: int64) string

R O W S G O O G L E
B I G TA B L E ‣ arbitrary strings ‣ read/write operations are atomic ‣ data is maintained in lexicographic order by row key ‣ each row range is called a tablet maps.google.com com.google.maps

C O L U M N S G O O
G L E B I G TA B L E ‣ columns keys are grouped into sets: column families ‣ a column family must be created before data can be stored under any column key in that family ‣ column key named as family:qualiﬁer ‣ access control and both disk and memory accounting are performed at the column-family level

T I M E S TA M P S G
O O G L E B I G TA B L E C O N T E N T S : c o m . e x a m p l e < h t m l > … < h t m l > … t 1 t 2

D ATA M O D E L : E X
A M P L E G O O G L E B I G TA B L E L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A c o m . e x a m p l e e n < ! D O C T Y P E h t m l P U B L I C … c o m . c n n . w w w e n < ! D O C T Y P E h t m l P U B L I C … “ c n n " “ c n n . c o m ” c o m . c n n . w w w / f o o e n < ! D O C T Y P E h t m l P U B L I C … column families row keys sorted rows

D I F F E R E N C E
S W I T H R D B M S G O O G L E B I G TA B L E R D B M S B I G TA B L E q u e r y l a n g u a g e s p e c i f i c a p i j o i n s n o re f e re n t i a l i n t e g r i t y e x p l i c i t s o r t i n g s o r t i n g d e f i n e d a p r i o r i i n t h e c o l u m n f a m i l y

A R C H I T E C T U
R E G O O G L E B I G TA B L E Google File System (GFS) ‣ store data files and logs Google SSTable ‣ store BigTable data Chubby ‣ high-available distributed lock service

C O M P O N E N T S
G O O G L E B I G TA B L E library ‣ linked into every client one master server ‣ assigning tablets to tablet server ‣ detecting the addition and expiration of tablet servers ‣ balancing tablet-server load ‣ garbaging collection of files in GFS ‣ handling schema changes many tablet servers ‣ manages 10 to 100 tablets ‣ handles read and write requests to the tablets ‣ splits tablets that have grown too large

C O M P O N E N T S
G O O G L E B I G TA B L E Master server Client Tablet server Tablet server Tablet server Metadata read/write

S TA R T U P A N D G
R O W T H G O O G L E B I G TA B L E Chubby file Root tablet 1st Metadata tablet other metadata tablets UserTableN UserTable1 … … … … … … … … … … …

TA B L E T A S S I G
N M E N T G O O G L E B I G TA B L E tablet server ‣ when started, creates and acquires a lock in Chubby master ‣ grabs a unique master lock in Chubby ‣ scans Chubby to find live tablet servers ‣ asks each tablet server to discover its tablets ‣ scans the Metadata table to learn the full set of tablets ‣ builds a set of unassigned tablet server, for future tablet assignment

W H Y I S C P ? G O
O G L E B I G TA B L E ‣ master death cause services no longer functioning ‣ tablet server death cause tablets unavailable ‣ Chubby death cause BigTable inability to execute synchronization operations and to serve client requests ‣ Google File System is a CP system

$ W H O A M I Andrea Giuliano @bit_shark
www.andreagiuliano.it

joind.in/13224 Please rate the talk!

G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store”
F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data” Assets: https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg https://www.flickr.com/photos/avardwoolaver/7137096221 R E F E R E N C E S

Consistency, Availability, Partition: Make Your...

Consistency, Availability, Partition: Make Your Choice

More Decks by Andrea Giuliano

Other Decks in Technology

Featured

Transcript