Save 37% off PRO during our Black Friday Sale! »

Bad as I Wanna Be: Coordination and Consistency in Distributed Databases

B7dc26518988058faa50712248c80bd3?s=47 pbailis
October 29, 2013

Bad as I Wanna Be: Coordination and Consistency in Distributed Databases

RICON West 2013

Talk video at: http://www.youtube.com/watch?v=_rAdJkAbGls

B7dc26518988058faa50712248c80bd3?s=128

pbailis

October 29, 2013
Tweet

Transcript

  1. BAD AS I WANNA BE Coordination and Consistency in Distributed

    Databases Peter Bailis UC Berkeley @pbailis
  2. stateless horizontally scalable soft state stateful concurrent durable A portrait

    of big services
  3. Users have application-level properties that the database should maintain “account

    balances should be positive” “every patient should have a primary care physician” “usernames should be unique” Linearizability, causal consistency, PRAM, regular semantics, timeline consistency, eventual consistency are not application properties This talk: Consistency is an application-level invariant over data
  4. How do we maintain invariants: despite concurrent accesses? across multiple

    copies? despite failures? This talk: Consistency is an application-level invariant over data
  5. Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do

    It Freedom in a Database “I Love Pain” You and the Future
  6. Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do

    It Freedom in a Database “I Love Pain” You and the Future
  7. Traditional answer: use single system image

  8. Traditional answer: use single system image Equivalent Serial Execution if

    users maintain invariants in isolation, consistency is guaranteed during execution Isolation provides Consistency serializable (SSI) execution means users implicitly maintain database state ACID ACID
  9. Conflict serializability requires reasoning about low-level read/write traces

  10. Given only reads and writes: - two writes to the

    same key - any read and write to same key might be a problem (conflict) T2 T1 T3 ww(c) rw(a) rw(c) rw(b) T3 T1 T2 write(a=1) read(a=0) Conflict serializability requires reasoning about low-level read/write traces serializable isolation prevents anomalies
  11. Problem: SSI requires coordination One (or both) of these users

    must stall to preserve serializability
  12. synchronous coordination = stalls during network partitions = RTT latency

    during operations = possible stall during concurrent access no (or asynchronous) coordination = Gilbert and Lynch “High Availability” = low latency (no RTT) = indefinite horizontal scaling (even for a single record; not BS “scalability” claim) benefits also apply to concurrent access in single-node systems Problem: SSI requires coordination
  13. Minimum coordination means maximum scalability How much coordination is necessary?

    DB results: serializability requires coordination CAP result: linearizability requires coordination (fun fact: infinite number of models on either side of this trade-off; less fun fact: there are many existing models to choose from) Are these models always required for maintaining application-level consistency?
  14. do not support SSI/serializability HANA Actian Ingres YES Aerospike NO

    N Persistit NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weaker models by default “Highly Available Transactions: Virtues and Limitations,” VLDB 2014
  15. Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do

    It Freedom in a Database “I Love Pain” You and the Future
  16. When can we safely forego coordination? Which anomalies matter? Requires

    information about application: invariants I(DB)→{True, False} and operations T(DB)→DB
  17. Invariant: each employee is in a department Operations: add employees

    l_emp = employees.find(id=“louise”) l_dept = dept.find(l_emp.dept) ENORECORD Anomaly (to avoid):
  18. employees = {{“Harry”:1}, {“Sue”:2}} dept = {{“test”:1}, {“dev”:2}} employees =

    {} dept = {{“ops”:1}, {“dev”:2}} Invariant: each employee is in a department Operations: add employees d2 = dept.find(“dev”) employees.add({“Sue”:d2}) d1 = dept.find(“ops”) employees.add({“Harry”:d1}) Invariant holds!
  19. on_duty = employees.find(staffed=”T”) assert(len(on_duty) == 1) ASSERTION FAILS Anomaly (to

    avoid): Invariant: only one ops on staff at a time Operations: change staffing
  20. staff = {“Laura”:T, “Harry”:F, “Gary”:F} staff.set({“Laura”:F}, “Gary”:T}) staff.set({“Laura”:F}, {“Harry”:T}) Invariant

    violated! staff = {“Laura”:F, “Harry”:T, “Gary”:T} Invariant: only one ops on staff at a time Operations: change staffing
  21. None
  22. SAFETY invariants hold across all states LIVENESS database states eventually

    agree (converge)
  23. *(N.B. for readers concerned with the formalism: assume that database

    states are sets of mutations, such that ⊔ is set union and each operation simply adds to the set of mutations; a bit more on ⊔ is coming up; also, this formalism as presented is overdone and not particularly elegant; with more space/time, this can be simplified--feel free to email me) “I don’t live my life by anybody else’s clock. If I feel like doing something, I don’t care what time it is. I just do it.” --Dennis Rodman, Bad As I Wanna Be. New York: Delacorte Press, 1996. Print. Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state* Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D))
  24. Sufficient? Necessary? App-Level? Conflict Serializability Yes No No State-based Commutativity

    Yes* No Depends Coordination- Freedom Yes Yes Yes “Everybody wants to stop Dennis Rodman.” --Dennis Rodman, Bad As I Wanna Be To maintain consistency... Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D)) Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state*
  25. Formal framework for reasoning about application coordination requirements Coordination depends

    on combination of: - expressiveness of operations - strength of invariants STRENGTH OF INVARIANTS EXPRESSIVENESS OF OPERATIONS *Okay, so this is simplified, and there isn’t really a linear order on either axis (rather, it’s more about equivalence classes), but humor me here... COORDINATION REQUIRED COORDINATION-FREE
  26. Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do

    It Freedom in a Database “I Love Pain” You and the Future
  27. Constraint: record IDs are unique DECLARE TABLE users ( ID

    int UNIQUE, FirstName string, LastName string ) Anomaly:
  28. let the DB decide the ID; use node ID or

    UUID C-FREE! Constraint: record IDs are unique Operation: insert record INSERT INTO users (firstname, lastname) VALUES (“Leslie”, “Lamport”) DECLARE TABLE users ( ID int UNIQUE, FirstName string, LastName string ) NOT C-FREE NOT C-FREE Operation: insert record with sequential ID DECLARE TABLE users ( ID int UNIQUE AUTO_INCREMENT ... don’t have to abort, just have to coordinate on commit Operation: insert record with specific ID INSERT INTO users (ID, firstname, lastname) VALUES (1, “Leslie”, “Lamport”)
  29. Foreign key constraints DECLARE TABLE users ( U_ID int UNIQUE,

    D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); Anomalies: EMPTY “badass division” department “lamport” has no department
  30. (342, “badass division”) Foreign key constraints DECLARE TABLE users (

    U_ID int UNIQUE, D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); users shard department shard Visible to all readers Visible to all readers Not yet visible to all readers Not yet visible to all readers (402, 342, “lamport”) 2 RTT writes (prepare and make visible) Between 1-2 RTTs for reads Magic trick: store metadata to record sibling writes txid=5 txid=5
  31. 2 RTT writes (prepare and make visible) Between 1-2 RTTs

    for reads Magic trick: store metadata to record sibling writes Also applicable to: --Distributed secondary indexing --Materialized views (e.g., pre-computed aggregates, alerts) --Multi-entity update (e.g., Tao, Espresso, PNUTS) --Cheap snapshot reads aligned along transaction boundaries Key: design with coordination-freedom as primary goal Interested parties: paper in pipeline; contact me ([O(1) to O(N) metadata-efficiency trade-off]) http://www.bailis.org/blog/non-blocking-transactional-atomicity/ N.B.: This leverages 2PC protocols, but it’s more than 2PC. Individual rounds can block, but readers resolve incomplete commits autonomously ATOMICALLY VISIBLE MULTI-PUT, -GET ACROSS MULTIPLE SHARDS without LOCKING, BLOCKING (RAMP: Read Atomic Multi-Partition Transactions)
  32. LIVENESS CRDTs, CALM, Immutability guarantee well-defined merge (sometimes deterministic outcome)

    ...but few safety guarantees (e.g., can’t safely read) ! SAFETY invariants hold across all states +
  33. Formal framework for reasoning about application coordination requirements Remainder: Get

    yourself a CAS Use that sweet Riak AP Invariant Operation C.F. ? Equality, Inequality Any Y Generate unique ID Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N RAMP Transaction Check those CRDTs
  34. Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do

    It Freedom in a Database “I Love Pain” You and the Future
  35. None
  36. “I’m no good working from a comfort zone. I need

    pain. I love pain.” --Dennis Rodman, Bad As I Wanna Be
  37. TPCC Combine fkeys with sequence number insert on commit...

  38. TPC-C New-Order Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) RAMP transaction

    on counter CRDT warehouse district orders neworders +100 insert 100
  39. TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized

    aggregates (e.g., W_YTD=SUM(orders for warehouse)) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID
  40. TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized

    aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID
  41. TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized

    aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID
  42. TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized

    aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID O NLY SYNCH CO O RDINATIO N REQ UIRED
  43. “You can like me or hate me. But all I

    can say is, when I get on that damn floor, all I’m going to do is get solid.” --Dennis Rodman, Bad As I Wanna Be
  44. Linear Scaling via Minimized Coordination No magic in implementation Single-node

    perf is poor (2.5K lines Java) but only one non-CF operation: incrementAndGet() on D_NEXT_O_ID Coordination need not be a bottleneck (if implemented in a coordination-free manner): UC Berkeley database prototype, 100 EC2 CC2.8xlarge instances (thank you AWS folks! currently poor single-node performance, but unimportant if you can scale out [for the time being]), linearizable masters, only blocking coordination: incrementAndGet for “district next order ID” key, CPU-bound on in-memory data; ~2500 lines Java; 120 clients/warehouse, 5 warehouses/machine, no THINK TIME (i.e., more contention than stock configuration)
  45. Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do

    It Freedom in a Database “I Love Pain” You and the Future
  46. > WARNING: Orders.O_ID requires coordination! INSERT found in CreateOrder >

    WARNING: CreateOrder requires remote check for @C_ID! CREATE TABLE Orders ( O_ID int AUTO_INCREMENT, C_ID int, O_QTY int, DATE datetime NOT NULL PRIMARY KEY (OrderID), FOREIGN KEY (CustomerID) REFERENCES Customers(C_ID), CONSTRAINT [O_QTY > 0] ) CREATE PROCEDURE CreateOrder(@C_ID int, @O_QTY int) AS INSERT INTO Orders (C_ID, O_QTY, DATE) VALUES (C_ID, O_QTY, NOW()); GO Standard SQL with extensions and analysis
  47. How do I web scale?

  48. 1.) Maximize safe concurrency Analyze operations, invariants for coordination- freedom;

    delay synchronization when possible next level: automated analysis 2.) Minimize distribution of conflicts Resolve conflicts using as few servers (space) and with as short a critical section (time) as possible next level: automated conflict avoidance, rewriting use pessimistic locking, optimistic execution with validation, or rewrite queries to be coordination-free
  49. http://martinfowler.com/articles/nosql-intro-original.pdf

  50. Get a: CAS/OCC/Lock Mgr Get some: CRDTs/RAMPs/EC Lots of re-use:

    Query model, local persistence, cluster membership, sharding protocol, failure detection, metrics, monitoring, administration Use one system!
  51. it’s an anti-pattern introduces unnecessary complexity fundamental differences are small

    symptomatic of an immature ecosystem ...if Polyglot Persistence for online data serving is still standard for non-legacy apps in 2023, the OSS DB community will have failed Building correct, reliable, and high-performance databases is hard and takes time, lol “Polyglot Persistence” is apt for 2013, but...
  52. Joint work with great folks including: Aaron Davidson, Ali Ghodsi,

    Alan Fekete, Mike Franklin, Joe Hellerstein, Ion Stoica Special thanks to: Peter Alvaro, Phil Bernstein, Dan Bruckner, Neil Conway, Robert Hodges, Evan Sparks, Doug Terry “The game matters... This is a great game.” --Dennis Rodman, Bad As I Wanna Be
  53. Coordination-freedom is a necessary and sufficient condition for availability and

    indefinite scalability (more precisely: for validity, availability, and convergence) Application-level correctness criteria are often but not always maintainable without coordination Your (future) database can manage this for you. Reason about your application, not your database replication protocol. Hint: you probably won’t need a different database for each case Know what “consistency” means to your application. Hint: linearizability is not an application-level concept. Hint: you can’t “beat” CAP when “C” means SSI or linearizability. There is a fundamental trade-off between limited coordination and application-level consistency