Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Silence is Golden: Coordination-Avoiding Systems Design

pbailis
August 21, 2015

Silence is Golden: Coordination-Avoiding Systems Design

MesosCon 2015 Keynote
26 August 2015
Seattle, WA

Talk video: https://www.youtube.com/watch?v=EYJnWttrC9k
More information: http://bailis.org/

Abstract:

Computer networks make it difficult to design scalable, robust distributed systems that exhibit good performance. Networks can be slow, have limited capacity, and are often unreliable. In an ideal world, we'd build systems that don't rely on the network at all. Unfortunately, as a slew of negative results like the CAP Theorem illustrate, this isn't always possible. Traditional systems abstractions like ACID transactions fundamentally require synchronous communication, or coordination, to implement. As a result, coordination-free systems designs often forego many programmer-friendly abstractions. These systems leave the task of reasoning about correctness to the application developer or, worse, to the end user.

In this talk, I'll discuss an alternative: system designs that coordinate only when necessary to guarantee application correctness. This coordination avoidance maximizes scalability and robustness by minimizing reliance on the network. To illustrate the power of coordination-avoiding systems design, I'll present several case studies from our research spanning database isolation guarantees, indexes and constraints, and open source applications. Perhaps surprisingly, even though traditional implementations of these tasks rely on coordination, many of these tasks don't actually require coordination for correctness. The resulting systems are among the fastest prototypes ever built and operated at scale. Based on these case studies, I'll provide concrete and practical design principles for reasoning about and applying coordination avoidance in the wild.

pbailis

August 21, 2015
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. Attendee Login Room Reservations Social Media Monitoring Database •Should you

    and I be able to simultaneously reserve rooms? •Can you reserve a room while I log in? •Can you tweet while I change my username? Reasoning about Distribution is Hard
  2. Mechanisms: Consensus (Paxos, VR, Raft) Zookeeper, etcd, Doozer ACID transactions

    Simple, classic strategy: Hide concurrency by coordinating Abstraction: Serial access to state Replicated State Machines
  3. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  4. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  5. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  6. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  7. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  8. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  9. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  10. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  11. Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)

    Low Latency 4.) Availability Processes cannot make progress independently
  12. A B C D E F G H IN-MEMORY LOCKING

    DISTRIBUTED TRANSACTIONS (EC2) 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) Number of Servers (Items) Accessed per Transaction
  13. A B C D E F G H IN-MEMORY LOCKING

    COORDINATED 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) DISTRIBUTED TRANSACTIONS (EC2) Number of Servers (Items) Accessed per Transaction
  14. A B C D E F G H IN-MEMORY LOCKING

    COORDINATED 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) DISTRIBUTED TRANSACTIONS (EC2) LOG SCALE! -398x Number of Servers (Items) Accessed per Transaction
  15. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  16. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  17. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  18. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  19. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  20. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  21. This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)

    Availability Coordination is expensive Processes cannot make progress independently
  22. High cost! Scalability Throughput Latency Availability Simple, classic strategy: Hide

    concurrency by coordinating Abstraction: Serial access to state Fundamental penalties to
  23. Why do we feel it's necessary to yak in order

    to be comfortable? That's when you know you've found somebody really special: when you can just shut up for a minute and comfortably share silence.
  24. Why do we feel it's necessary to yak in order

    to be comfortable? That's when you know you've found somebody really special: when you can just shut up for a minute and comfortably share silence.
  25. Scalable systems can just shut up and comfortably share silence

    1.) Why is shutting up good for systems? 2.) When can systems comfortably share silence? This talk:
  26. Scalable systems can just shut up and comfortably share silence

    1.) Why is shutting up good for systems? 2.) When can systems comfortably share silence? This talk:
  27. A B C D E F G H IN-MEMORY LOCKING

    COORDINATED 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) DISTRIBUTED TRANSACTIONS (EC2) -398x Number of Servers (Items) Accessed per Transaction
  28. A B C D E F G H IN-MEMORY LOCKING

    1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) COORDINATED COORDINATION-FREE DISTRIBUTED TRANSACTIONS (EC2) -398x Number of Servers (Items) Accessed per Transaction
  29. Why is shutting up good? Coordination-free systems: 1.) Enable infinite

    scale-out 2.) Improve throughput 3.) Ensure low latency 4.) Improve availability
  30. Coordination-free systems: 1.) Enable infinite scale-out 2.) Improve throughput 3.)

    Ensure low latency 4.) Guarantee “always on” response Why is shutting up good?
  31. Coordination-free systems: 1.) Enable infinite scale-out 2.) Improve throughput 3.)

    Ensure low latency 4.) Guarantee “always on” response Why is shutting up good?
  32. Coordination-free systems: 1.) Enable infinite scale-out 2.) Improve throughput 3.)

    Ensure low latency 4.) Guarantee “always on” response Why is shutting up good?
  33. Coordination-free systems: 1.) Enable infinite scale-out 2.) Improve throughput 3.)

    Ensure low latency 4.) Guarantee “always on” response Why is shutting up good? Silence is key to scalability!
  34. Scalable systems can just shut up and comfortably share silence

    1.) Why is shutting up good for systems? 2.) When can systems comfortably share silence? This talk:
  35. Scalable systems can just shut up and comfortably share silence

    1.) Why is shutting up good for systems? 2.) When can systems comfortably share silence? This talk:
  36. Attendee Login Room Reservations Social Media Monitoring Database •Should you

    and I be able to simultaneously reserve rooms? •Can you reserve a room while I log in? •Can you tweet while I change my username? Reasoning about Distribution is Hard
  37. IN A WAY THAT MAKES “SENSE” COMPOSED 1+1=2 {“a”}+{“b”}={“a”, “b”}

    (“merged”) Counters are positive (invariants over state will hold) No two talks share a timeslot No NULL values Usernames are unique
  38. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015]
  39. Key question: Can invariants can be violated by merging independent

    operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB ICT: Invariant Confluence Test [VLDB 2015]
  40. Key question: Can invariants can be violated by merging independent

    operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {} ICT: Invariant Confluence Test [VLDB 2015]
  41. Key question: Can invariants can be violated by merging independent

    operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {} add {Stu,ID=1} ICT: Invariant Confluence Test [VLDB 2015]
  42. Key question: Can invariants can be violated by merging independent

    operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {} add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test [VLDB 2015]
  43. Key question: Can invariants can be violated by merging independent

    operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {{Stu,ID=1}, {Ann,ID=1}} {} MERGE add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test [VLDB 2015]
  44. Key question: Can invariants can be violated by merging independent

    operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {{Stu,ID=1}, {Ann,ID=1}} Invariant holds! {} MERGE add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test [VLDB 2015]
  45. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015] INVARIANT: User IDs are unique OPERATION: Save new user MERGE: Add both records to DB
  46. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015] INVARIANT: User IDs are unique OPERATION: Save new user MERGE: Add both records to DB
  47. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015] INVARIANT: User IDs are unique OPERATION: Save new user MERGE: Add both records to DB {{Stu,ID=1}, {Ann,ID=1}} Invariant broken! {} MERGE add {Stu,ID=1} add {Ann,ID=1}
  48. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015]
  49. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015] ICT passes? Coordination not required
  50. Key question: Can invariants can be violated by merging independent

    operations? ICT: Invariant Confluence Test [VLDB 2015] ICT passes? ICT fails? Coordination not required Coordination required
  51. THOSE LIGHT CONES If operations happen concurrently… …ensure their side-effects

    can be COMPOSED IN A WAY THAT MAKES “SENSE” formalized by ICT
  52. Attendee Login Room Reservations Social Media Monitoring Database Can we

    simultaneously reserve rooms? Can I log in while you reserve a room? Can I tweet while you change your username? When can we comfortably share silence?
  53. Attendee Login Room Reservations Social Media Monitoring Database Can we

    simultaneously reserve rooms? Can I log in while you reserve a room? Can I tweet while you change your username? When can we comfortably share silence?
  54. Attendee Login Room Reservations Social Media Monitoring Database Can we

    simultaneously reserve rooms? Can I log in while you reserve a room? Can I tweet while you change your username? When can we comfortably share silence? When operations are composable
  55. Constraint Operation Passes ICT? Equality, Inequality Any ??? Generate unique

    ID Any ??? Specify unique ID Insert ??? > Increment ??? > Decrement ??? < Decrement ??? < Increment ??? Foreign Key Insert ??? Foreign Key Delete ??? Secondary Indexing Any ??? Materialized Views Any ??? AUTO_INCREMENT Insert ??? [VLDB 2015] Typical database constraints and operations (SQL)
  56. Constraint Operation Passes ICT? Equality, Inequality Any Y Generate unique

    ID Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N [VLDB 2015] Typical database constraints and operations (SQL)
  57. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable- mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena
  58. 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1

    per table 86.9% PASS ICT [SIGMOD 2015]
  59. Always coordinating is inefficient! 67 projects 1.77M LoC 1957 tables

    9986 total; avg. 5.1 per table 86.9% PASS ICT [SIGMOD 2015]
  60. Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything

    Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit;
  61. Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything

    Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; Classic implementation: lock records during access
  62. name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS

    Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; Classic implementation: lock records during access
  63. name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS

    Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; Classic implementation: lock records during access
  64. name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS

    Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; “peter” Classic implementation: lock records during access
  65. name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS

    Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; “peter” Classic implementation: lock records during access
  66. name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS

    Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access
  67. name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS

    Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access
  68. name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy

    Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access
  69. name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy

    Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access Better implementation: use multi-versioning, commit tag
  70. name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy

    Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record Better implementation: use multi-versioning, commit tag
  71. name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy

    Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record “peter” Better implementation: use multi-versioning, commit tag
  72. name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy

    Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record “peter” Better implementation: use multi-versioning, commit tag “pbailis”
  73. name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy

    Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record “peter” Better implementation: use multi-versioning, commit tag “pbailis” OK
  74. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit;
  75. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; used in indexing, materialized views, foreign keys
  76. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; used in indexing, materialized views, foreign keys Classic implementation: lock records
  77. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; used in indexing, materialized views, foreign keys Classic implementation: lock records Result: typically implemented incorrectly at scale
  78. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit;
  79. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata
  80. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record
  81. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record loc/record
  82. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record
  83. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status)
  84. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK
  85. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK OK
  86. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK
  87. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK
  88. Everything Happens At Once Next Level Technique: RAMP Transactions Desired

    property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) Key: Prevent read stalls Compact metadata SIGMOD 2014 OK
  89. 14/16 INVARIANTS PASS ICT TPC-C scale to over 25x best

    listed result 0 50 100 150 200 2M 4M 6M 8M 10M 12M 14M Total Throughput (txn/s) 0 50 100 150 200 Number of Servers 0 20K 40K 60K 80K Throughput (txn/s/server) 6-11x faster than ACID/serializability 8 16 32 48 64 Number of Warehouses 40K 100K 600K Throughput (txns/s) Coordination-Avoiding Serializable (2PL)
  90. Everything Happens At Once Key Design Patterns • Datatype libraries

    can automatically merge operations e.g., Bloom^L, CRDTs
  91. Everything Happens At Once Key Design Patterns • Datatype libraries

    can automatically merge operations e.g., Bloom^L, CRDTs • Multi-versioning can prevent stalls during partial updates e.g., RAMP, COPS, SwiftCloud
  92. Everything Happens At Once Key Design Patterns • Datatype libraries

    can automatically merge operations e.g., Bloom^L, CRDTs • Multi-versioning can prevent stalls during partial updates e.g., RAMP, COPS, SwiftCloud •When you must coordinate, distribute as little as possible e.g., Transaction Chopping
  93. Rethink The API Read/Write Transaction Distributed Log Consensus Object Are

    too low level! Distributed Log Consensus Object
  94. WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

    write read write read write write read write write write read write WHAT THE SYSTEM HEARS read read read read read read write write write read read write read write write
  95. WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

    write read write read read write write read WHAT THE SYSTEM HEARS read read read read write write read read write read write write “post on timeline” “accept friend request” write write
  96. The Good Stuff (Papers) ICT in theory and practice Coordination-avoiding

    analytics Index, graph, and view maintenance Transaction isolation Upgrading existing stores Quantifying visibility SIGMOD 2015, VLDB 2015 CIDR 2015 SIGMOD 2014 VLDB 2014 SIGMOD 2013 VLDB 2012, VLDBJ 2014
  97. To avoid coordination, maximize composability of operations Scalable systems can

    comfortably share silence Joint work with Ali Ghodsi, Alan Fekete, Joe Hellerstein, Ion Stoica, and many others (see bailis.org)
  98. Many illustrations by the Noun Project (CC-Attribution): surprised by Julian

    Derveaux world by Wayne Tyler Sall database by Austin Condiff earth by Martin Vanco Woman by Simon Child Man by Simon Child Doctor by Simon Child David-Hockney by Simon Child Server by Simon Child clock by christoph robausch