Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consistency, Availability, Partition: Make Your Choice

Consistency, Availability, Partition: Make Your Choice

Shared data systems try hardly to satisfy data consistency, system availability and tolerance to network partitions.
In a distributed system it is impossible to simultaneously provide all these guarantees at any given moment in time.
The purpose of the talk is to show the mechanism used by data storage systems such as Dynamo and BigTable in order to satisfy two guarantees at a time.

Andrea Giuliano

February 18, 2015
Tweet

More Decks by Andrea Giuliano

Other Decks in Technology

Transcript

  1. M A K E Y O U R C H O I C E
    C O N S I S T E N C Y, A VA I L A B I L I T Y, PA R T I T I O N
    A n d re a G i u l i a n o
    @ b i t _ s h a r k

    View full-size slide

  2. D I S T R I B U T E D S Y S T E M S

    View full-size slide

  3. W H AT A D I S T R I B U T E D S Y S T E M I S
    “A distributed system is a software system in which
    components located on networked computers communicate
    and coordinate their actions by passing messages”

    View full-size slide

  4. D I S T R I B U T E D S Y S T E M S
    E X A M P L E S

    View full-size slide

  5. D I S T R I B U T E D S Y S T E M S
    R E P L I C AT I O N

    View full-size slide

  6. R E P L I C AT E D S E R V I C E
    P R O P E R T I E S
    CONSISTENCY
    AVAILABILITY

    View full-size slide

  7. C O N S I S T E N C Y
    The result of operations will be predictable

    View full-size slide

  8. C O N S I S T E N C Y
    Strong consistency
    all replicas return the same value for the same object

    View full-size slide

  9. C O N S I S T E N C Y
    Strong consistency
    all replicas return the same value for the same object
    Weak consistency
    different replicas can return different values for the same object

    View full-size slide

  10. S T R O N G V S W E A K
    C O N S I S T E N C Y

    View full-size slide

  11. S T R O N G V S W E A K
    C O N S I S T E N C Y
    Strong consistency
    Atomic, consistent, isolated, durable database
    Weak consistency
    Basically Available Soft-state Eventual consistency database

    View full-size slide

  12. E X A M P L E
    C O N S I S T E N C Y
    put(price, 10)

    View full-size slide

  13. E X A M P L E
    C O N S I S T E N C Y
    get(price)
    price = 10

    View full-size slide

  14. AVA I L A B I L I T Y

    View full-size slide

  15. E X A M P L E
    A VA I L A B I L I T Y

    View full-size slide

  16. C O M M U N I C AT I O N

    View full-size slide

  17. PA R T I T I O N T O L E R A N C E
    continue to operate even in presence of partitions

    View full-size slide

  18. PA R T I T I O N T O L E R A N C E
    Network failure
    groups at each side of a faulty entity network (switch, backbone)
    Process failure
    system split in two groups: correct nodes and crashed node

    View full-size slide

  19. C A P T H E O R E M
    “Of three properties of shared-data systems
    (data consistency, system availability and
    tolerance to network partitions) only two can
    be achieved at any given moment in time.”

    View full-size slide

  20. T H E P R O O F
    C A P T H E O R E M
    put(price, 10)
    get(price)
    price = 0
    price = 0 price = 0
    price = 0
    no response
    not consistent
    not available
    t2
    t1
    partition 1
    partition 2

    View full-size slide

  21. CONSISTENCY AVAILABILITY
    PARTITION
    TOLERANCE
    ➡ distributed databases
    ➡ distributed locking
    ➡ majority protocol
    ➡ active/passive replication
    ➡ quorum-based systems
    BigTable
    C A P T H E O R E M
    I N P R A C T I C E

    View full-size slide

  22. C A P T H E O R E M
    CONSISTENCY AVAILABILITY
    PARTITION
    TOLERANCE
    ➡ web caches
    ➡ stateless systems
    ➡ DNS
    DynamoDB

    View full-size slide

  23. C A P T H E O R E M
    CONSISTENCY AVAILABILITY
    PARTITION
    TOLERANCE
    ➡ Single site database
    ➡ cluster databases
    ➡ ldap

    View full-size slide

  24. R E Q U I R E M E N T S
    D Y N A M O
    “customers should be able to view and add items
    to their shopping cart even if disks are failing,
    network routes are flapping, or data centers are
    being destroyed by tornados.”

    View full-size slide

  25. R E Q U I R E M E N T S
    D Y N A M O
    “customers should be able to view and add items
    to their shopping cart even if disks are failing,
    network routes are flapping, or data centers are
    being destroyed by tornados.”
    ➡ reliable
    ➡ high scalable
    ➡ always available

    View full-size slide

  26. S I M P L E I N T E R FA C E
    D Y N A M O
    get(key)
    returns the object associated with the key and returns a
    single object or a list of objects with conflicting versions
    along with a context.
    put(key, context, object)
    determines where the replicas of the object should be
    placed based on the associated key. The context
    includes information such as the version of the object.

    View full-size slide

  27. R E P L I C AT I O N : T H E C H O I C E
    D Y N A M O
    Synchronous replica coordination
    ‣ strong consistency
    ‣ availability tradeoff
    Optimistic replication technique
    ‣ high availability
    ‣ conflicts probability

    View full-size slide

  28. C O N F L I C T S : W H E N
    D Y N A M O
    At write time
    ‣ writes rejection probability
    At read time
    ‣ “always writable” datastore

    View full-size slide

  29. C O N F L I C T S : W H O
    D Y N A M O
    The data store
    ‣ e.g. “last write win” policy
    The application
    ‣ resolution as implementation detail

    View full-size slide

  30. A R I N G T O R U L E T H E M A L L
    D Y N A M O

    View full-size slide

  31. PA R T I T I O N I N G : T H E R I N G
    D Y N A M O
    A
    B
    C
    D
    E
    F
    G
    DATA
    hash

    View full-size slide

  32. R E P L I C AT I O N
    D Y N A M O
    A
    B
    C
    D
    E
    F
    G
    N = 3 D will store keys in the range (A, B], (B, C], (C, D]
    DATA
    hash

    View full-size slide

  33. D ATA V E R S I O N I N G
    D Y N A M O
    put()
    may return before the update has been propagated to
    all replicas.
    get()
    subsequent get() may return an object that does not
    have the latest update

    View full-size slide

  34. R E C O N C I L I AT I O N
    D Y N A M O

    View full-size slide

  35. R E C O N C I L I AT I O N
    D Y N A M O
    Syntactic reconciliation
    ‣ new version subsumes the previous
    Semantic reconciliation
    ‣ conflicting versions of the same object

    View full-size slide

  36. V E C T O R C L O C K
    D Y N A M O

    View full-size slide

  37. V E C T O R C L O C K
    D Y N A M O
    Definition
    ‣ list of (node, counter) pairs

    View full-size slide

  38. V E C T O R C L O C K
    D Y N A M O
    Definition
    ‣ list of (node, counter) pairs
    D1
    [Sx,1]
    write
    handled by Sx

    View full-size slide

  39. V E C T O R C L O C K
    D Y N A M O
    Definition
    ‣ list of (node, counter) pairs
    D1
    [Sx,1]
    D2
    [Sx,2]
    write
    handled by Sx
    write
    handled by Sx

    View full-size slide

  40. V E C T O R C L O C K
    D Y N A M O
    Definition
    ‣ list of (node, counter) pairs
    D1
    [Sx,1]
    D2
    [Sx,2]
    D3
    [Sx,2], [Sy,1]
    write
    handled by Sx
    write
    handled by Sx
    handled by Sy
    write

    View full-size slide

  41. V E C T O R C L O C K
    D Y N A M O
    Definition
    ‣ list of (node, counter) pairs
    D1
    [Sx,1]
    D2
    [Sx,2]
    D3
    [Sx,2], [Sy,1]
    D4
    [Sx,2], [Sz,1]
    write
    handled by Sx
    write
    handled by Sx
    write
    handled by Sy
    write
    handled by Sz

    View full-size slide

  42. V E C T O R C L O C K
    D Y N A M O
    Definition
    ‣ list of (node, counter) pairs
    D1
    [Sx,1]
    D2
    [Sx,2]
    D3
    [Sx,2], [Sy,1]
    D4
    [Sx,2], [Sz,1]
    D5 [Sx,3], [Sy,1], [Sz,1]
    write
    handled by Sx
    write
    handled by Sx
    write
    handled by Sy
    write
    handled by Sz
    reconciled and
    written by Sx

    View full-size slide

  43. P U T ( ) A N D G E T ( )
    D Y N A M O
    R
    ‣ minimum number of nodes that must partecipate
    in a read operation.
    W
    ‣ minimum number of nodes that must participate
    in a successful write operation

    View full-size slide

  44. P U T ( ) A N D G E T ( )
    D Y N A M O
    put()
    ‣ the coordinator generates the vector clock for the new version and
    writes the new version locally
    ‣ the new version is sent to N nodes
    ‣ the write is successful if W-1 nodes respond
    get()
    ‣ the coordinator requests all existing versions of data
    ‣ the coordinator waits for R responses before returning the result
    ‣ the coordinator returns all the version causally unrelated
    ‣ the divergent versions are reconciled and written back

    View full-size slide

  45. S L O P P Y Q U O R U M
    D Y N A M O
    A
    B
    C
    D
    E
    F
    G
    N = 3

    View full-size slide

  46. W H Y I S A P ?
    D Y N A M O
    ‣ requests served even if some replicas are not available
    ‣ if some node is down the write is stored to another node
    ‣ consistency conflicts resolved at read time or in the
    background
    ‣ eventually, all the replicas will converge
    ‣ concurrent read/write operation can make distinct clients
    see distinct versions of the same key

    View full-size slide

  47. B I G TA B L E

    View full-size slide

  48. R E Q U I R E M E N T S
    G O O G L E B I G TA B L E
    ‣ scale to petabyte of data
    ‣ thousand of machines
    ‣ high availability
    ‣ high performance

    View full-size slide

  49. D ATA M O D E L
    G O O G L E B I G TA B L E
    ‣ sparse, distributed, persistent multi-dimensional
    sorted map
    (row: string, column: string, time: int64) string

    View full-size slide

  50. R O W S
    G O O G L E B I G TA B L E
    ‣ arbitrary strings
    ‣ read/write operations are atomic
    ‣ data is maintained in lexicographic order by row key
    ‣ each row range is called a tablet
    maps.google.com com.google.maps

    View full-size slide

  51. C O L U M N S
    G O O G L E B I G TA B L E
    ‣ columns keys are grouped into sets: column families
    ‣ a column family must be created before data can be
    stored under any column key in that family
    ‣ column key named as family:qualifier
    ‣ access control and both disk and memory
    accounting are performed at the column-family level

    View full-size slide

  52. T I M E S TA M P S
    G O O G L E B I G TA B L E
    C O N T E N T S :
    c o m . e x a m p l e
    < h t m l > …
    < h t m l > …
    t 1
    t 2

    View full-size slide

  53. D ATA M O D E L : E X A M P L E
    G O O G L E B I G TA B L E
    L A N G U A G E : C O N T E N T S : A N C H O R : C N N S I . C O M A N C H R : M Y L O O K . C A
    c o m . e x a m p l e e n
    < ! D O C T Y P E
    h t m l P U B L I C

    c o m . c n n . w w w e n
    < ! D O C T Y P E
    h t m l P U B L I C

    “ c n n " “ c n n . c o m ”
    c o m . c n n . w w w / f o o e n
    < ! D O C T Y P E
    h t m l P U B L I C

    column families
    row keys
    sorted rows

    View full-size slide

  54. D I F F E R E N C E S W I T H R D B M S
    G O O G L E B I G TA B L E
    R D B M S B I G TA B L E
    q u e r y l a n g u a g e s p e c i f i c a p i
    j o i n s n o re f e re n t i a l i n t e g r i t y
    e x p l i c i t s o r t i n g
    s o r t i n g d e f i n e d a p r i o r i
    i n t h e c o l u m n f a m i l y

    View full-size slide

  55. A R C H I T E C T U R E
    G O O G L E B I G TA B L E
    Google File System (GFS)
    ‣ store data files and logs
    Google SSTable
    ‣ store BigTable data
    Chubby
    ‣ high-available distributed lock service

    View full-size slide

  56. C O M P O N E N T S
    G O O G L E B I G TA B L E
    library
    ‣ linked into every client
    one master server
    ‣ assigning tablets to tablet server
    ‣ detecting the addition and expiration of tablet servers
    ‣ balancing tablet-server load
    ‣ garbaging collection of files in GFS
    ‣ handling schema changes
    many tablet servers
    ‣ manages 10 to 100 tablets
    ‣ handles read and write requests to the tablets
    ‣ splits tablets that have grown too large

    View full-size slide

  57. C O M P O N E N T S
    G O O G L E B I G TA B L E
    Master server
    Client
    Tablet server Tablet server Tablet server
    Metadata
    read/write

    View full-size slide

  58. S TA R T U P A N D G R O W T H
    G O O G L E B I G TA B L E
    Chubby file
    Root tablet
    1st Metadata tablet
    other
    metadata
    tablets
    UserTableN
    UserTable1











    View full-size slide

  59. TA B L E T A S S I G N M E N T
    G O O G L E B I G TA B L E
    tablet server
    ‣ when started, creates and acquires a lock in Chubby
    master
    ‣ grabs a unique master lock in Chubby
    ‣ scans Chubby to find live tablet servers
    ‣ asks each tablet server to discover its tablets
    ‣ scans the Metadata table to learn the full set of tablets
    ‣ builds a set of unassigned tablet server, for future tablet
    assignment

    View full-size slide

  60. W H Y I S C P ?
    G O O G L E B I G TA B L E
    ‣ master death cause services no longer functioning
    ‣ tablet server death cause tablets unavailable
    ‣ Chubby death cause BigTable inability to execute
    synchronization operations and to serve client requests
    ‣ Google File System is a CP system

    View full-size slide

  61. $ W H O A M I
    Andrea Giuliano
    @bit_shark
    www.andreagiuliano.it

    View full-size slide

  62. joind.in/13224
    Please rate the talk!

    View full-size slide

  63. G. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-value Store”
    F. Chang et al. “Bigtable: A Distributed Storage System for Structured Data”
    Assets:
    https://farm1.staticflickr.com/41/86744006_0026864df8_b_d.jpg
    https://farm9.staticflickr.com/8305/7883634326_4e51a1a320_b_d.jpg
    https://farm5.staticflickr.com/4145/4958650244_65b2eddffc_b_d.jpg
    https://farm4.staticflickr.com/3677/10023456065_e54212c52e_b_d.jpg
    https://farm4.staticflickr.com/3076/2871264822_261dafa44c_o_d.jpg
    https://farm1.staticflickr.com/7/6111406_30005bdae5_b_d.jpg
    https://farm4.staticflickr.com/3928/15416585502_92d5e608c7_b_d.jpg
    https://farm8.staticflickr.com/7046/6873109431_d3b5199f7d_b_d.jpg
    https://farm4.staticflickr.com/3007/2835755867_c530b0e0c6_o_d.jpg
    https://farm3.staticflickr.com/2788/4202444169_2079db9580_o_d.jpg
    https://farm1.staticflickr.com/55/129619657_907b480c7c_b_d.jpg
    https://farm5.staticflickr.com/4046/4368269562_b3e05e3f06_b_d.jpg
    https://farm8.staticflickr.com/7344/12137775834_d0cecc5004_k_d.jpg
    https://farm5.staticflickr.com/4073/4895191036_1cb9b58d75_b_d.jpg
    https://farm4.staticflickr.com/3144/3025249284_b77dec2d29_o_d.jpg
    https://www.flickr.com/photos/avardwoolaver/7137096221
    R E F E R E N C E S

    View full-size slide