Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Coordination and the Art of Scaling

pbailis
June 17, 2014

Coordination and the Art of Scaling

CloudantCON 2014
17 June 2014
http://www.cloudantcon.com/#schedule

For more information/details/nuance (!):
http://www.bailis.org/blog/
http://www.bailis.org/pubs.html
@pbailis

pbailis

June 17, 2014
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. COORDINATION AND THE ART OF SCALING Peter Bailis • UC

    Berkeley • @pbailis CloudantCON 2014
  2. A distributed system is one in which the failure of

    a computer you didn't even know existed can render your own computer unusable. —Leslie Lamport 2013 Turing Award Winner
  3. THE NETWORK INCURS LATENCY THE NETWORK IS UNRELIABLE SO HOW

    CAN WE BUILD ROBUST AND SCALABLE DISTRIBUTED SYSTEMS?
  4. THE SIMPLE ANSWER: SINGLE-SYSTEM IMAGE TIME Impose a total order

    on events in the system Ask Am anda: “how ’s the w eather on the farm ?” Am anda replies: “Let m e check w ith the tractor.” Am anda replies: “It’s a beautiful day!” Tractor replies: current tem perature is 75°F
  5. THE SIMPLE ANSWER: SINGLE-SYSTEM IMAGE Impose a total order on

    events in the system TIME Illusion created by a partially ordered protocol
  6. THE SIMPLE ANSWER: SINGLE-SYSTEM IMAGE TIME Impose a total order

    on events in the system Illusion created by a partially ordered protocol Remarkably powerful abstraction core to ACID transactions
  7. THE SIMPLE ANSWER: SINGLE-SYSTEM IMAGE TIME Impose a total order

    on events in the system Illusion created by a partially ordered protocol Remarkably powerful abstraction This is the way you’d want to program distributed systems, but… core to ACID transactions
  8. THE SIMPLE ANSWER: SINGLE-SYSTEM IMAGE TIME Impose a total order

    on events in the system Illusion created by a partially ordered protocol COST:
  9. THE SIMPLE ANSWER: SINGLE-SYSTEM IMAGE TIME Impose a total order

    on events in the system Illusion created by a partially ordered protocol COST: BLOCKING COMMUNICATION COORDINATION
  10. 1 2 3 4 5 6 7 Number of Items

    per Transaction Throughput (txns/s) SERIALIZABLE TRANSACTIONS ON EC2 IN-MEMORY LOCKING LOG SCALE! “Coordination-Avoiding Database Systems” arXiv:1402.2237
  11. 1 2 3 4 5 6 7 Number of Items

    per Transaction Throughput (txns/s) SERIALIZABLE TRANSACTIONS ON EC2 IN-MEMORY LOCKING COORDINATED “Coordination-Avoiding Database Systems” arXiv:1402.2237
  12. SERIALIZABLE TRANSACTIONS ON EC2 IN-MEMORY LOCKING 1 2 3 4

    5 6 7 Number of Items per Transaction Throughput (txns/s) COORDINATED COORDINATION-FREE “Coordination-Avoiding Database Systems” arXiv:1402.2237
  13. SERIALIZABLE TRANSACTIONS ON EC2 IN-MEMORY LOCKING SINGLE SERVER: 10x faster

    (multi-core parallelism) MULTI-SERVER: ~1000x faster 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) COORDINATED COORDINATION-FREE “Coordination-Avoiding Database Systems” arXiv:1402.2237
  14. do not support! SSI/serializability HANA Actian Ingres YES Aerospike NO!

    N Persistit NO! N Clustrix NO! N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO! N MS SQL Server YES NuoDB NO! N Oracle 11G NO! N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP HANA NO! N ScaleDB NO! N VoltDB YES 8/18 databases! surveyed did not “Highly Available Transactions: Virtues and Limitations” VLDB 2014
  15. do not support! SSI/serializability HANA Actian Ingres YES Aerospike NO!

    N Persistit NO! N Clustrix NO! N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO! N MS SQL Server YES NuoDB NO! N Oracle 11G NO! N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP HANA NO! N ScaleDB NO! N VoltDB YES 8/18 databases! surveyed did not 15/18 used! weaker models! by default “Highly Available Transactions: Virtues and Limitations” VLDB 2014
  16. do not support! SSI/serializability HANA Actian Ingres YES Aerospike NO!

    N Persistit NO! N Clustrix NO! N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO! N MS SQL Server YES NuoDB NO! N Oracle 11G NO! N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP HANA NO! N ScaleDB NO! N VoltDB YES 8/18 databases! surveyed did not 15/18 used! weaker models! by default “Highly Available Transactions: Virtues and Limitations” VLDB 2014
  17. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately SINGLE DC: .5 ms on public cloud 5 µs on Infiniband
  18. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately SINGLE DC: .5 ms on public cloud 5 µs on Infiniband MULTI-DC?
  19. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately Unavailable during failures Progress despite failures
  20. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately Unavailable during failures Progress despite failures WHEN DO WE HAVE TO COORDINATE?
  21. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately Unavailable during failures Progress despite failures WHEN DO WE HAVE TO COORDINATE?
  22. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately Unavailable during failures Progress despite failures CAP Theorem (for recency guarantees) FLP result (for consensus; e.g., Paxos) WHEN DO WE HAVE TO COORDINATE? Davidson result (for SSI)
  23. COORDINATION REQUIRED? COORDINATION FREE? Throughput: 1/delay Limited by physical resources

    Latency: 1+ RTT Can return immediately Unavailable during failures Progress despite failures CAP Theorem (for recency guarantees) FLP result (for consensus; e.g., Paxos) BUT DO APPS ALWAYS HAVE TO COORDINATE? WHEN DO WE HAVE TO COORDINATE? Davidson result (for SSI)
  24. INVARIANT: TICKET IDs SHOULD BE UNIQUE TICKET 241 TICKET 242

    PRE-PARTITION ID SPACE (1,4,…) (2,5,…) (3,6,…)
  25. INVARIANT: TICKET IDs SHOULD BE NON-NEGATIVE COORDINATION-FREE! INVARIANT: TICKET IDs

    SHOULD BE UNIQUE PRE-PARTITION ID SPACE INVARIANT: TICKET IDs SHOULD BE SEQUENTIAL COORDINATION REQUIRED!
  26. INVARIANT: TICKET IDs SHOULD BE NON-NEGATIVE COORDINATION-FREE! INVARIANT: TICKET IDs

    SHOULD BE UNIQUE PRE-PARTITION ID SPACE INVARIANT: TICKET IDs SHOULD BE SEQUENTIAL COORDINATION REQUIRED! WHEN DO WE HAVE TO COORDINATE? DEPENDS ON APPLICATION SAFE ANSWER: ALWAYS COORDINATE
  27. WHEN DO WE HAVE TO COORDINATE? SAFE ANSWER: ALWAYS COORDINATE

    BETTER ANSWER: (YOUR TAX DOLLARS AT WORK)
  28. WHEN DO WE HAVE TO COORDINATE? SAFE ANSWER: ALWAYS COORDINATE

    BETTER ANSWER: COORDINATION AVOIDANCE COORDINATE ONLY WHEN STRICTLY NECESSARY MOVE COMMUNICATION TO BACKGROUND “Coordination-Avoiding Database Systems” arXiv:1402.2237
  29. Invariant Confluence is necessary and sufficient for ensuring safety, convergence,

    availability, and coordination-free execution. Invariant Confluence holds?! A safe, c-free execution strategy exists. Invariant Confluence fails?! No safe, c-free mechanism exists. “Coordination-Avoiding Database Systems” arXiv:1402.2237
  30. Invariant Operation C.F. Equality, Inequality Any ??? Generate unique ID

    Any ??? Specify unique ID Insert ??? >! Increment ??? >! Decrement ??? < Decrement ??? < Increment ??? Foreign Key Insert ??? Foreign Key Delete ??? Secondary Indexing Any ??? Materialized Views Any ??? AUTO_INCREMENT Insert ??? Typical DB! operations and ! invariants! (SQL) “Coordination-Avoiding Database Systems” arXiv:1402.2237
  31. Invariant Operation C.F. Equality, Inequality Any Y Generate unique ID

    Any Y Specify unique ID Insert N >! Increment Y >! Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y! AUTO_INCREMENT Insert N Typical DB! operations and ! invariants! (SQL) “Coordination-Avoiding Database Systems” arXiv:1402.2237
  32. Test fails? Cannot avoid coordination Invariant Operation C.F. Equality, Inequality

    Any Y Generate unique ID Any Y Specify unique ID Insert N >! Increment Y >! Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y! AUTO_INCREMENT Insert N Typical DB! operations and ! invariants! (SQL) “Coordination-Avoiding Database Systems” arXiv:1402.2237
  33. Test fails? Cannot avoid coordination Invariant Operation C.F. Equality, Inequality

    Any Y Generate unique ID Any Y Specify unique ID Insert N >! Increment Y >! Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y! AUTO_INCREMENT Insert N MANY TRADITIONAL DB APPS OK Typical DB! operations and ! invariants! (SQL) “Coordination-Avoiding Database Systems” arXiv:1402.2237
  34. Test fails? Cannot avoid coordination Invariant Operation C.F. Equality, Inequality

    Any Y Generate unique ID Any Y Specify unique ID Insert N >! Increment Y >! Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y! AUTO_INCREMENT Insert N MANY TRADITIONAL DB APPS OK Typical DB! operations and ! invariants! (SQL) “Coordination-Avoiding Database Systems” arXiv:1402.2237
  35. as FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for

    the Social Graph” USENIX ATC 2013 FRIENDS FRIENDS
  36. as FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for

    the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates
  37. as FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for

    the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s
  38. as FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for

    the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s
  39. as FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for

    the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates Not cleanly partitionable s
  40. NEED ATOMIC VISIBILITY SEE ALL OF A TXN’S UPDATES, OR

    NONE OF THEM FOREIGN KEY DEPENDENCIES “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  41. NEED ATOMIC VISIBILITY SEE ALL OF A TXN’S UPDATES, OR

    NONE OF THEM FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  42. STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1) R(X=1) R(Y=1) “Scalable Atomic

    Visibility with RAMP Transactions” SIGMOD 2014
  43. Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) R(X=?) R(Y=?) “Scalable Atomic

    Visibility with RAMP Transactions” SIGMOD 2014
  44. Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) R(X=?) R(Y=?) ATOMIC VISIBILITY

    COUPLED WITH MUTUAL EXCLUSION “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  45. STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY

    COUPLED WITH MUTUAL EXCLUSION SLOW unavailable “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  46. TRANSACTIONS R A M P TOMIC EAD ULTI- ARTITION “Scalable

    Atomic Visibility with RAMP Transactions” SIGMOD 2014
  47. TRANSACTIONS R A M P TOMIC EAD ULTI- ARTITION “Scalable

    Atomic Visibility with RAMP Transactions” SIGMOD 2014
  48. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) X=1 “Scalable Atomic

    Visibility with RAMP Transactions” SIGMOD 2014
  49. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) X=1 “Scalable Atomic

    Visibility with RAMP Transactions” SIGMOD 2014
  50. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE,

    but HAVE READERS “CLEAN UP” X=1 “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  51. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE,

    but HAVE READERS “CLEAN UP” X=1 LIMITED MULTI-VERSIONING + METADATA “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  52. BASIC IDEA LET CLIENTS RACE, but HAVE READERS “CLEAN UP”

    LIMITED MULTI-VERSIONING + METADATA X=0 Y=0 W(X=1) W(Y=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  53. BASIC IDEA LET CLIENTS RACE, but HAVE READERS “CLEAN UP”

    X=1 LIMITED MULTI-VERSIONING + METADATA X=0 Y=0 W(X=1) W(Y=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  54. BASIC IDEA LET CLIENTS RACE, but HAVE READERS “CLEAN UP”

    X=1 LIMITED MULTI-VERSIONING + METADATA X=0 Y=1 Y=0 W(X=1) W(Y=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  55. BASIC IDEA LET CLIENTS RACE, but HAVE READERS “CLEAN UP”

    X=1 LIMITED MULTI-VERSIONING + METADATA X=0 Y=1 Y=0 W(X=1) W(Y=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  56. BASIC IDEA LET CLIENTS RACE, but HAVE READERS “CLEAN UP”

    X=1 LIMITED MULTI-VERSIONING + METADATA X=0 Y=1 Y=0 W(X=1) W(Y=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  57. BASIC IDEA W(X=1) W(Y=1) R(X=?) R(Y=?) LET CLIENTS RACE, but

    HAVE READERS “CLEAN UP” X=1 [t=124, {Y}] LIMITED MULTI-VERSIONING + METADATA X=0 [t=0, {}] Y=1 [t=124, {X}] Y=0 [t=0, {}] R(X=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  58. BASIC IDEA W(X=1) W(Y=1) R(X=?) R(Y=?) LET CLIENTS RACE, but

    HAVE READERS “CLEAN UP” X=1 [t=124, {Y}] LIMITED MULTI-VERSIONING + METADATA X=0 [t=0, {}] Y=1 [t=124, {X}] Y=0 [t=0, {}] R(Y=0) R(X=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  59. BASIC IDEA W(X=1) W(Y=1) R(X=?) R(Y=?) LET CLIENTS RACE, but

    HAVE READERS “CLEAN UP” X=1 [t=124, {Y}] LIMITED MULTI-VERSIONING + METADATA X=0 [t=0, {}] Y=1 [t=124, {X}] Y=0 [t=0, {}] R(Y=0) R(X=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  60. BASIC IDEA W(X=1) W(Y=1) R(X=?) R(Y=?) LET CLIENTS RACE, but

    HAVE READERS “CLEAN UP” X=1 [t=124, {Y}] LIMITED MULTI-VERSIONING + METADATA X=0 [t=0, {}] Y=1 [t=124, {X}] Y=0 [t=0, {}] R(Y=0) ITEM HIGHEST TS X 124 Y 124 R(X=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  61. BASIC IDEA W(X=1) W(Y=1) R(X=?) R(Y=?) LET CLIENTS RACE, but

    HAVE READERS “CLEAN UP” X=1 [t=124, {Y}] LIMITED MULTI-VERSIONING + METADATA X=0 [t=0, {}] Y=1 [t=124, {X}] Y=0 [t=0, {}] R(Y=0) ITEM HIGHEST TS X 124 Y 124 R(X=1) R(Y=1) “Scalable Atomic Visibility with RAMP Transactions” SIGMOD 2014
  62. 47,852 Serializable locking bottlenecks on coordination over network 632,589 Coordination-avoiding

    implementation (RAMP with fast ID assignment) bottlenecks on CPU EC2 cr1.8xlarge here, 8 servers “Coordination-Avoiding Database Systems” arXiv:1402.2237 New-Order Transactions/s
  63. 0 50 100 150 200 Number of Servers 2M 4M

    6M 8M 10M 12M 14M Total Throughput (txn/s)
  64. 0 50 100 150 200 Number of Servers 2M 4M

    6M 8M 10M 12M 14M Total Throughput (txn/s) INDUSTRY-STANDARD TRANSACTIONAL WORKLOADS CAN SCALE JUST FINE*
  65. INDUSTRY-STANDARD TRANSACTIONAL WORKLOADS CAN SCALE JUST FINE* GIVEN THE RIGHT

    SYSTEM DESIGN CONCURRENCY PRIMITIVES ATTENTION TO SCALE MANY
  66. INDUSTRY-STANDARD TRANSACTIONAL WORKLOADS CAN SCALE JUST FINE* GIVEN THE RIGHT

    SYSTEM DESIGN CONCURRENCY PRIMITIVES ATTENTION TO SCALE LEVEL OF COORDINATION MANY
  67. THE NETWORK INCURS LATENCY THE NETWORK IS UNRELIABLE SO HOW

    CAN WE BUILD ROBUST AND SCALABLE DISTRIBUTED SYSTEMS?
  68. THE NETWORK INCURS LATENCY THE NETWORK IS UNRELIABLE SO HOW

    CAN WE BUILD ROBUST AND SCALABLE DISTRIBUTED SYSTEMS? UNDERSTAND COORDINATION
  69. COORDINATION AVOIDANCE UNDERSTAND IF/WHEN COORDINATION IS REQUIRED INVARIANT CONFLUENCE (arXiv

    2014) necessary and sufficient condition for c-free operation HIGHLY AVAILABLE TRANSACTIONS (CACM, VLDB 2014) what database isolation levels are coordination-free? RAMP ATOMIC VISIBILITY (SIGMOD 2014) fast and intuitive multi-put, multi-get, indexing BLOOM and BLAZES (ICDE 2014) language-level automated coordination analysis CRDTS and BLOOM^L (SoCC 2013, USENIX ATC 2014) correct-by-design distributed data types PBS INCONSISTENCY (VLDBJ 2014) how stale is data if we don’t coordinate?
  70. Traditional distributed systems designs! suffer from coordination bottlenecks By understanding

    application requirements,! we can avoid coordination We can build systems that actually scale! while providing correct behavior Thanks!! ! [email protected]! @pbailis! http://bailis.org/ http://amplab.cs.berkeley.edu/!
  71. Punk designed by my name is mud from the Noun

    Project Creative Commons – Attribution (CC BY 3.0) Queen designed by Bohdan Burmich from the Noun Project Creative Commons – Attribution (CC BY 3.0) Guy Fawkes designed by Anisha Varghese from the Noun Project Creative Commons – Attribution (CC BY 3.0) Emperor designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0) Database designed by Shmidt Sergey from the Noun Project Creative Commons – Attribution (CC BY 3.0) List designed by Nicholas Menghini from the Noun Project Creative Commons – Attribution (CC BY 3.0) Warehouse designed by Wilson Joseph from the Noun Project Creative Commons – Attribution (CC BY 3.0) User designed by JM Waideaswaran from the Noun Project Creative Commons – Attribution (CC BY 3.0) Thermostat designed by Michael Senkow from the Noun Project Creative Commons – Attribution (CC BY 3.0) Customer Service designed by Bybzee from the Noun Project Creative Commons – Attribution (CC BY 3.0) Punk Rocker designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0) Jackhammer designed by Jamie Dickinson from the Noun Project Creative Commons – Attribution (CC BY 3.0) Earth designed by Martin Vanco from the Noun Project Creative Commons – Attribution (CC BY 3.0) Smart-Phone designed by Emily Haasch from the Noun Project Creative Commons – Attribution (CC BY 3.0) Cloud designed by Piotrek Chuchla from the Noun Project Creative Commons – Attribution (CC BY 3.0) Server designed by Jaime Carrion from the Noun Project Creative Commons – Attribution (CC BY 3.0) Computer designed by Matthew Hawdon from the Noun Project Creative Commons – Attribution (CC BY 3.0) Computer designed by james zamyslianskyj from the Noun Project Creative Commons – Attribution (CC BY 3.0) Computer designed by Alyssa Mahlberg from the Noun Project Creative Commons – Attribution (CC BY 3.0) Lock designed by dylan voisard from the Noun Project Creative Commons – Attribution (CC BY 3.0) ! COCOGOOSE font by ZetaFonts COMMON CREATIVE NON COMMERCIAL USE IMAGE/FONT CREDITs