balances should be positive” “every patient should have a primary care physician” “usernames should be unique” Linearizability, causal consistency, PRAM, regular semantics, timeline consistency, eventual consistency are not application properties This talk: Consistency is an application-level invariant over data
users maintain invariants in isolation, consistency is guaranteed during execution Isolation provides Consistency serializable (SSI) execution means users implicitly maintain database state ACID ACID
same key - any read and write to same key might be a problem (conflict) T2 T1 T3 ww(c) rw(a) rw(c) rw(b) T3 T1 T2 write(a=1) read(a=0) Conflict serializability requires reasoning about low-level read/write traces serializable isolation prevents anomalies
during operations = possible stall during concurrent access no (or asynchronous) coordination = Gilbert and Lynch “High Availability” = low latency (no RTT) = indefinite horizontal scaling (even for a single record; not BS “scalability” claim) benefits also apply to concurrent access in single-node systems Problem: SSI requires coordination
DB results: serializability requires coordination CAP result: linearizability requires coordination (fun fact: infinite number of models on either side of this trade-off; less fun fact: there are many existing models to choose from) Are these models always required for maintaining application-level consistency?
N Persistit NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weaker models by default “Highly Available Transactions: Virtues and Limitations,” VLDB 2014
{} dept = {{“ops”:1}, {“dev”:2}} Invariant: each employee is in a department Operations: add employees d2 = dept.find(“dev”) employees.add({“Sue”:d2}) d1 = dept.find(“ops”) employees.add({“Harry”:d1}) Invariant holds!
states are sets of mutations, such that ⊔ is set union and each operation simply adds to the set of mutations; a bit more on ⊔ is coming up; also, this formalism as presented is overdone and not particularly elegant; with more space/time, this can be simplified--feel free to email me) “I don’t live my life by anybody else’s clock. If I feel like doing something, I don’t care what time it is. I just do it.” --Dennis Rodman, Bad As I Wanna Be. New York: Delacorte Press, 1996. Print. Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state* Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D))
Yes* No Depends Coordination- Freedom Yes Yes Yes “Everybody wants to stop Dennis Rodman.” --Dennis Rodman, Bad As I Wanna Be To maintain consistency... Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D)) Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state*
on combination of: - expressiveness of operations - strength of invariants STRENGTH OF INVARIANTS EXPRESSIVENESS OF OPERATIONS *Okay, so this is simplified, and there isn’t really a linear order on either axis (rather, it’s more about equivalence classes), but humor me here... COORDINATION REQUIRED COORDINATION-FREE
UUID C-FREE! Constraint: record IDs are unique Operation: insert record INSERT INTO users (firstname, lastname) VALUES (“Leslie”, “Lamport”) DECLARE TABLE users ( ID int UNIQUE, FirstName string, LastName string ) NOT C-FREE NOT C-FREE Operation: insert record with sequential ID DECLARE TABLE users ( ID int UNIQUE AUTO_INCREMENT ... don’t have to abort, just have to coordinate on commit Operation: insert record with specific ID INSERT INTO users (ID, firstname, lastname) VALUES (1, “Leslie”, “Lamport”)
D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); Anomalies: EMPTY “badass division” department “lamport” has no department
U_ID int UNIQUE, D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); users shard department shard Visible to all readers Visible to all readers Not yet visible to all readers Not yet visible to all readers (402, 342, “lamport”) 2 RTT writes (prepare and make visible) Between 1-2 RTTs for reads Magic trick: store metadata to record sibling writes txid=5 txid=5
for reads Magic trick: store metadata to record sibling writes Also applicable to: --Distributed secondary indexing --Materialized views (e.g., pre-computed aggregates, alerts) --Multi-entity update (e.g., Tao, Espresso, PNUTS) --Cheap snapshot reads aligned along transaction boundaries Key: design with coordination-freedom as primary goal Interested parties: paper in pipeline; contact me ([O(1) to O(N) metadata-efficiency trade-off]) http://www.bailis.org/blog/non-blocking-transactional-atomicity/ N.B.: This leverages 2PC protocols, but it’s more than 2PC. Individual rounds can block, but readers resolve incomplete commits autonomously ATOMICALLY VISIBLE MULTI-PUT, -GET ACROSS MULTIPLE SHARDS without LOCKING, BLOCKING (RAMP: Read Atomic Multi-Partition Transactions)
yourself a CAS Use that sweet Riak AP Invariant Operation C.F. ? Equality, Inequality Any Y Generate unique ID Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N RAMP Transaction Check those CRDTs
aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID
aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID
aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID O NLY SYNCH CO O RDINATIO N REQ UIRED
perf is poor (2.5K lines Java) but only one non-CF operation: incrementAndGet() on D_NEXT_O_ID Coordination need not be a bottleneck (if implemented in a coordination-free manner): UC Berkeley database prototype, 100 EC2 CC2.8xlarge instances (thank you AWS folks! currently poor single-node performance, but unimportant if you can scale out [for the time being]), linearizable masters, only blocking coordination: incrementAndGet for “district next order ID” key, CPU-bound on in-memory data; ~2500 lines Java; 120 clients/warehouse, 5 warehouses/machine, no THINK TIME (i.e., more contention than stock configuration)
delay synchronization when possible next level: automated analysis 2.) Minimize distribution of conflicts Resolve conflicts using as few servers (space) and with as short a critical section (time) as possible next level: automated conflict avoidance, rewriting use pessimistic locking, optimistic execution with validation, or rewrite queries to be coordination-free
symptomatic of an immature ecosystem ...if Polyglot Persistence for online data serving is still standard for non-legacy apps in 2023, the OSS DB community will have failed Building correct, reliable, and high-performance databases is hard and takes time, lol “Polyglot Persistence” is apt for 2013, but...
Alan Fekete, Mike Franklin, Joe Hellerstein, Ion Stoica Special thanks to: Peter Alvaro, Phil Bernstein, Dan Bruckner, Neil Conway, Robert Hodges, Evan Sparks, Doug Terry “The game matters... This is a great game.” --Dennis Rodman, Bad As I Wanna Be
indefinite scalability (more precisely: for validity, availability, and convergence) Application-level correctness criteria are often but not always maintainable without coordination Your (future) database can manage this for you. Reason about your application, not your database replication protocol. Hint: you probably won’t need a different database for each case Know what “consistency” means to your application. Hint: linearizability is not an application-level concept. Hint: you can’t “beat” CAP when “C” means SSI or linearizability. There is a fundamental trade-off between limited coordination and application-level consistency