Slide 1

Slide 1 text

M A K E Y O U R C H O I C E C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N A n d re a G i u l i a n o @ b i t _ s h a r k

Slide 2

Slide 2 text

D I S T R I B U T E D S Y S T E M S

Slide 3

Slide 3 text

W H AT A D I S T R I B U T E D S Y S T E M I S “A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages”

Slide 4

Slide 4 text

D I S T R I B U T E D S Y S T E M S E X A M P L E S

Slide 5

Slide 5 text

D I S T R I B U T E D S Y S T E M S R E P L I C AT I O N

Slide 6

Slide 6 text

R E P L I C AT E D S E R V I C E P R O P E R T I E S CONSISTENCY AVAILABILITY

Slide 7

Slide 7 text

C O N S I S T E N C Y The result of operations will be predictable

Slide 8

Slide 8 text

C O N S I S T E N C Y Strong consistency all replicas return the same value for the same object

Slide 9

Slide 9 text

C O N S I S T E N C Y Strong consistency all replicas return the same value for the same object Weak consistency different replicas can return different values for the same object

Slide 10

Slide 10 text

S T R O N G V S W E A K C O N S I S T E N C Y

Slide 11

Slide 11 text

S T R O N G V S W E A K C O N S I S T E N C Y Strong consistency Atomic, consistent, isolated, durable database Weak consistency Basically Available Soft-state Eventual consistency database

Slide 12

Slide 12 text

E X A M P L E C O N S I S T E N C Y put(price, 10)

Slide 13

Slide 13 text

E X A M P L E C O N S I S T E N C Y get(price) price = 10

Slide 14

Slide 14 text

AVA I L A B I L I T Y

Slide 15

Slide 15 text

E X A M P L E A VA I L A B I L I T Y

Slide 16

Slide 16 text

C O M M U N I C AT I O N

Slide 17

Slide 17 text

PA R T I T I O N T O L E R A N C E continue to operate even in presence of partitions

Slide 18

Slide 18 text

PA R T I T I O N T O L E R A N C E Network failure groups at each side of a faulty entity network (switch, backbone) Process failure system split in two groups: correct nodes and crashed node

Slide 19

Slide 19 text

C A P T H E O R E M “Of three properties of shared-data systems (data consistency, system availability and tolerance to network partitions) only two can be achieved at any given moment in time.”

Slide 20

Slide 20 text

T H E P R O O F C A P T H E O R E M put(price, 10) get(price) price = 0 price = 0 price = 0 price = 0 no response not consistent not available t2 t1 partition 1 partition 2

Slide 21

Slide 21 text

CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ distributed databases ➡ distributed locking ➡ majority protocol ➡ active/passive replication ➡ quorum-based systems BigTable C A P T H E O R E M I N P R A C T I C E

Slide 22

Slide 22 text

C A P T H E O R E M CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ web caches ➡ stateless systems ➡ DNS DynamoDB

Slide 23

Slide 23 text

C A P T H E O R E M CONSISTENCY AVAILABILITY PARTITION TOLERANCE ➡ Single site database ➡ cluster databases ➡ ldap

Slide 24

Slide 24 text

D Y N A M O

Slide 25

Slide 25 text

R E Q U I R E M E N T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”

Slide 26

Slide 26 text

R E Q U I R E M E N T S D Y N A M O “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.” ➡ reliable ➡ high scalable ➡ always available

Slide 27

Slide 27 text

S I M P L E I N T E R FA C E D Y N A M O get(key) returns the object associated with the key and returns a single object or a list of objects with conflicting versions along with a context. put(key, context, object) determines where the replicas of the object should be placed based on the associated key. The context includes information such as the version of the object.

Slide 28

Slide 28 text

R E P L I C AT I O N : T H E C H O I C E D Y N A M O Synchronous replica coordination ‣ strong consistency ‣ availability tradeoff Optimistic replication technique ‣ high availability ‣ conflicts probability

Slide 29

Slide 29 text

C O N F L I C T S : W H E N D Y N A M O At write time ‣ writes rejection probability At read time ‣ “always writable” datastore

Slide 30

Slide 30 text

C O N F L I C T S : W H O D Y N A M O The data store ‣ e.g. “last write win” policy The application ‣ resolution as implementation detail

Slide 31

Slide 31 text

A R I N G T O R U L E T H E M A L L D Y N A M O

Slide 32

Slide 32 text

PA R T I T I O N I N G : T H E R I N G D Y N A M O A B C D E F G DATA hash

Slide 33

Slide 33 text

R E P L I C AT I O N D Y N A M O A B C D E F G N = 3 D will store keys in the range (A, B], (B, C], (C, D] DATA hash

Slide 34

Slide 34 text

D ATA V E R S I O N I N G D Y N A M O put() may return before the update has been propagated to all replicas. get() subsequent get() may return an object that does not have the latest update

Slide 35

Slide 35 text

R E C O N C I L I AT I O N D Y N A M O

Slide 36

Slide 36 text

R E C O N C I L I AT I O N D Y N A M O Syntactic reconciliation ‣ new version subsumes the previous Semantic reconciliation ‣ conflicting versions of the same object

Slide 37

Slide 37 text

V E C T O R C L O C K D Y N A M O

Slide 38

Slide 38 text

V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs

Slide 39

Slide 39 text

V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] write handled by Sx

Slide 40

Slide 40 text

V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] write handled by Sx write handled by Sx

Slide 41

Slide 41 text

V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] write handled by Sx write handled by Sx handled by Sy write

Slide 42

Slide 42 text

V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz

Slide 43

Slide 43 text

V E C T O R C L O C K D Y N A M O Definition ‣ list of (node, counter) pairs D1 [Sx,1] D2 [Sx,2] D3 [Sx,2], [Sy,1] D4 [Sx,2], [Sz,1] D5 [Sx,3], [Sy,1], [Sz,1] write handled by Sx write handled by Sx write handled by Sy write handled by Sz reconciled and written by Sx

Slide 44

Slide 44 text

P U T ( ) A N D G E T ( ) D Y N A M O R ‣ minimum number of nodes that must partecipate in a read operation. W ‣ minimum number of nodes that must participate in a successful write operation

Slide 45

Slide 45 text

P U T ( ) A N D G E T ( ) D Y N A M O put() ‣ the coordinator generates the vector clock for the new version and writes the new version locally ‣ the new version is sent to N nodes ‣ the write is successful if W-1 nodes respond get() ‣ the coordinator requests all existing versions of data ‣ the coordinator waits for R responses before returning the result ‣ the coordinator returns all the version causally unrelated ‣ the divergent versions are reconciled and written back

Slide 46

Slide 46 text

S L O P P Y Q U O R U M D Y N A M O A B C D E F G N = 3

Slide 47

Slide 47 text

W H Y I S A P ? D Y N A M O ‣ requests served even if some replicas are not available ‣ if some node is down the write is stored to another node ‣ consistency conflicts resolved at read time or in the background ‣ eventually, all the replicas will converge ‣ concurrent read/write operation can make distinct clients see distinct versions of the same key

Slide 48

Slide 48 text

B I G TA B L E

Slide 49

Slide 49 text

R E Q U I R E M E N T S G O O G L E B I G TA B L E ‣ scale to petabyte of data ‣ thousand of machines ‣ high availability ‣ high performance

Slide 50

Slide 50 text

D ATA M O D E L G O O G L E B I G TA B L E ‣ sparse, distributed, persistent multi-dimensional sorted map (row: string, column: string, time: int64) string

Slide 51

Slide 51 text

R O W S G O O G L E B I G TA B L E ‣ arbitrary strings ‣ read/write operations are atomic ‣ data is maintained in lexicographic order by row key ‣ each row range is called a tablet maps.google.com com.google.maps

Slide 52

Slide 52 text

C O L U M N S G O O G L E B I G TA B L E ‣ columns keys are grouped into sets: column families ‣ a column family must be created before data can be stored under any column key in that family ‣ column key named as family:qualifier ‣ access control and both disk and memory accounting are performed at the column-family level

Slide 53

Slide 53 text

T I M E S TA M P S G O O G L E B I G TA B L E C O N T E N T S : c o m . e x a m p l e < h t m l > … < h t m l > … t 1 t 2

Slide 54

Slide 54 text

D ATA M O D E L : E X A M P L E G O O G L E B I G TA B L E L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A c o m . e x a m p l e e n < ! D O C T Y P E h t m l P U B L I C … c o m . c n n . w w w e n < ! D O C T Y P E h t m l P U B L I C … “ c n n " “ c n n . c o m ” c o m . c n n . w w w / f o o e n < ! D O C T Y P E h t m l P U B L I C … column families row keys sorted rows

Slide 55

Slide 55 text

D I F F E R E N C E S W I T H R D B M S G O O G L E B I G TA B L E R D B M S B I G TA B L E q u e r y l a n g u a g e s p e c i f i c a p i j o i n s n o re f e re n t i a l i n t e g r i t y e x p l i c i t s o r t i n g s o r t i n g d e f i n e d a p r i o r i i n t h e c o l u m n f a m i l y

Slide 56

Slide 56 text

A R C H I T E C T U R E G O O G L E B I G TA B L E Google File System (GFS) ‣ store data files and logs Google SSTable ‣ store BigTable data Chubby ‣ high-available distributed lock service

Slide 57

Slide 57 text

C O M P O N E N T S G O O G L E B I G TA B L E library ‣ linked into every client one master server ‣ assigning tablets to tablet server ‣ detecting the addition and expiration of tablet servers ‣ balancing tablet-server load ‣ garbaging collection of files in GFS ‣ handling schema changes many tablet servers ‣ manages 10 to 100 tablets ‣ handles read and write requests to the tablets ‣ splits tablets that have grown too large

Slide 58

Slide 58 text

C O M P O N E N T S G O O G L E B I G TA B L E Master server Client Tablet server Tablet server Tablet server Metadata read/write

Slide 59

Slide 59 text

S TA R T U P A N D G R O W T H G O O G L E B I G TA B L E Chubby file Root tablet 1st Metadata tablet other metadata tablets UserTableN UserTable1 … … … … … … … … … … …

Slide 60

Slide 60 text

TA B L E T A S S I G N M E N T G O O G L E B I G TA B L E tablet server ‣ when started, creates and acquires a lock in Chubby master ‣ grabs a unique master lock in Chubby ‣ scans Chubby to find live tablet servers ‣ asks each tablet server to discover its tablets ‣ scans the Metadata table to learn the full set of tablets ‣ builds a set of unassigned tablet server, for future tablet assignment

Slide 61

Slide 61 text

W H Y I S C P ? G O O G L E B I G TA B L E ‣ master death cause services no longer functioning ‣ tablet server death cause tablets unavailable ‣ Chubby death cause BigTable inability to execute synchronization operations and to serve client requests ‣ Google File System is a CP system

Slide 62

Slide 62 text

$ W H O A M I Andrea Giuliano @bit_shark www.andreagiuliano.it

Slide 63

Slide 63 text

joind.in/13224 Please rate the talk!

Slide 64

Slide 64 text

G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store” F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data” Assets: https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg https://www.flickr.com/photos/avardwoolaver/7137096221 R E F E R E N C E S