Bad as I Wanna Be: Coordination and Consistency in Distributed Databases

BAD AS I WANNA BE Coordination and Consistency in Distributed
Databases Peter Bailis UC Berkeley @pbailis

stateless horizontally scalable soft state stateful concurrent durable A portrait
of big services

Users have application-level properties that the database should maintain “account
balances should be positive” “every patient should have a primary care physician” “usernames should be unique” Linearizability, causal consistency, PRAM, regular semantics, timeline consistency, eventual consistency are not application properties This talk: Consistency is an application-level invariant over data

How do we maintain invariants: despite concurrent accesses? across multiple
copies? despite failures? This talk: Consistency is an application-level invariant over data

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do
It Freedom in a Database “I Love Pain” You and the Future

Traditional answer: use single system image

Traditional answer: use single system image Equivalent Serial Execution if
users maintain invariants in isolation, consistency is guaranteed during execution Isolation provides Consistency serializable (SSI) execution means users implicitly maintain database state ACID ACID

Conﬂict serializability requires reasoning about low-level read/write traces

Given only reads and writes: - two writes to the
same key - any read and write to same key might be a problem (conﬂict) T2 T1 T3 ww(c) rw(a) rw(c) rw(b) T3 T1 T2 write(a=1) read(a=0) Conﬂict serializability requires reasoning about low-level read/write traces serializable isolation prevents anomalies

Problem: SSI requires coordination One (or both) of these users
must stall to preserve serializability

synchronous coordination = stalls during network partitions = RTT latency
during operations = possible stall during concurrent access no (or asynchronous) coordination = Gilbert and Lynch “High Availability” = low latency (no RTT) = indeﬁnite horizontal scaling (even for a single record; not BS “scalability” claim) beneﬁts also apply to concurrent access in single-node systems Problem: SSI requires coordination

Minimum coordination means maximum scalability How much coordination is necessary?
DB results: serializability requires coordination CAP result: linearizability requires coordination (fun fact: inﬁnite number of models on either side of this trade-off; less fun fact: there are many existing models to choose from) Are these models always required for maintaining application-level consistency?

do not support SSI/serializability HANA Actian Ingres YES Aerospike NO
N Persistit NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weaker models by default “Highly Available Transactions: Virtues and Limitations,” VLDB 2014

When can we safely forego coordination? Which anomalies matter? Requires
information about application: invariants I(DB)→{True, False} and operations T(DB)→DB

Invariant: each employee is in a department Operations: add employees
l_emp = employees.find(id=“louise”) l_dept = dept.find(l_emp.dept) ENORECORD Anomaly (to avoid):

employees = {{“Harry”:1}, {“Sue”:2}} dept = {{“test”:1}, {“dev”:2}} employees =
{} dept = {{“ops”:1}, {“dev”:2}} Invariant: each employee is in a department Operations: add employees d2 = dept.find(“dev”) employees.add({“Sue”:d2}) d1 = dept.find(“ops”) employees.add({“Harry”:d1}) Invariant holds!

on_duty = employees.find(staffed=”T”) assert(len(on_duty) == 1) ASSERTION FAILS Anomaly (to
avoid): Invariant: only one ops on staff at a time Operations: change staffing

staff = {“Laura”:T, “Harry”:F, “Gary”:F} staff.set({“Laura”:F}, “Gary”:T}) staff.set({“Laura”:F}, {“Harry”:T}) Invariant
violated! staff = {“Laura”:F, “Harry”:T, “Gary”:T} Invariant: only one ops on staff at a time Operations: change staffing

SAFETY invariants hold across all states LIVENESS database states eventually
agree (converge)

*(N.B. for readers concerned with the formalism: assume that database
states are sets of mutations, such that ⊔ is set union and each operation simply adds to the set of mutations; a bit more on ⊔ is coming up; also, this formalism as presented is overdone and not particularly elegant; with more space/time, this can be simpliﬁed--feel free to email me) “I don’t live my life by anybody else’s clock. If I feel like doing something, I don’t care what time it is. I just do it.” --Dennis Rodman, Bad As I Wanna Be. New York: Delacorte Press, 1996. Print. Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state* Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D))

Sufﬁcient? Necessary? App-Level? Conﬂict Serializability Yes No No State-based Commutativity
Yes* No Depends Coordination- Freedom Yes Yes Yes “Everybody wants to stop Dennis Rodman.” --Dennis Rodman, Bad As I Wanna Be To maintain consistency... Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D)) Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state*

Formal framework for reasoning about application coordination requirements Coordination depends
on combination of: - expressiveness of operations - strength of invariants STRENGTH OF INVARIANTS EXPRESSIVENESS OF OPERATIONS *Okay, so this is simpliﬁed, and there isn’t really a linear order on either axis (rather, it’s more about equivalence classes), but humor me here... COORDINATION REQUIRED COORDINATION-FREE

Constraint: record IDs are unique DECLARE TABLE users ( ID
int UNIQUE, FirstName string, LastName string ) Anomaly:

let the DB decide the ID; use node ID or
UUID C-FREE! Constraint: record IDs are unique Operation: insert record INSERT INTO users (firstname, lastname) VALUES (“Leslie”, “Lamport”) DECLARE TABLE users ( ID int UNIQUE, FirstName string, LastName string ) NOT C-FREE NOT C-FREE Operation: insert record with sequential ID DECLARE TABLE users ( ID int UNIQUE AUTO_INCREMENT ... don’t have to abort, just have to coordinate on commit Operation: insert record with speciﬁc ID INSERT INTO users (ID, firstname, lastname) VALUES (1, “Leslie”, “Lamport”)

Foreign key constraints DECLARE TABLE users ( U_ID int UNIQUE,
D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); Anomalies: EMPTY “badass division” department “lamport” has no department

(342, “badass division”) Foreign key constraints DECLARE TABLE users (
U_ID int UNIQUE, D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); users shard department shard Visible to all readers Visible to all readers Not yet visible to all readers Not yet visible to all readers (402, 342, “lamport”) 2 RTT writes (prepare and make visible) Between 1-2 RTTs for reads Magic trick: store metadata to record sibling writes txid=5 txid=5

2 RTT writes (prepare and make visible) Between 1-2 RTTs
for reads Magic trick: store metadata to record sibling writes Also applicable to: --Distributed secondary indexing --Materialized views (e.g., pre-computed aggregates, alerts) --Multi-entity update (e.g., Tao, Espresso, PNUTS) --Cheap snapshot reads aligned along transaction boundaries Key: design with coordination-freedom as primary goal Interested parties: paper in pipeline; contact me ([O(1) to O(N) metadata-efﬁciency trade-off]) http://www.bailis.org/blog/non-blocking-transactional-atomicity/ N.B.: This leverages 2PC protocols, but it’s more than 2PC. Individual rounds can block, but readers resolve incomplete commits autonomously ATOMICALLY VISIBLE MULTI-PUT, -GET ACROSS MULTIPLE SHARDS without LOCKING, BLOCKING (RAMP: Read Atomic Multi-Partition Transactions)

LIVENESS CRDTs, CALM, Immutability guarantee well-deﬁned merge (sometimes deterministic outcome)
...but few safety guarantees (e.g., can’t safely read) ! SAFETY invariants hold across all states +

Formal framework for reasoning about application coordination requirements Remainder: Get
yourself a CAS Use that sweet Riak AP Invariant Operation C.F. ? Equality, Inequality Any Y Generate unique ID Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N RAMP Transaction Check those CRDTs

“I’m no good working from a comfort zone. I need
pain. I love pain.” --Dennis Rodman, Bad As I Wanna Be

TPCC Combine fkeys with sequence number insert on commit...

TPC-C New-Order Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) RAMP transaction
on counter CRDT warehouse district orders neworders +100 insert 100

TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized
aggregates (e.g., W_YTD=SUM(orders for warehouse)) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID

aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID

aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID

aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID O NLY SYNCH CO O RDINATIO N REQ UIRED

“You can like me or hate me. But all I
can say is, when I get on that damn ﬂoor, all I’m going to do is get solid.” --Dennis Rodman, Bad As I Wanna Be

Linear Scaling via Minimized Coordination No magic in implementation Single-node
perf is poor (2.5K lines Java) but only one non-CF operation: incrementAndGet() on D_NEXT_O_ID Coordination need not be a bottleneck (if implemented in a coordination-free manner): UC Berkeley database prototype, 100 EC2 CC2.8xlarge instances (thank you AWS folks! currently poor single-node performance, but unimportant if you can scale out [for the time being]), linearizable masters, only blocking coordination: incrementAndGet for “district next order ID” key, CPU-bound on in-memory data; ~2500 lines Java; 120 clients/warehouse, 5 warehouses/machine, no THINK TIME (i.e., more contention than stock conﬁguration)

> WARNING: Orders.O_ID requires coordination! INSERT found in CreateOrder >
WARNING: CreateOrder requires remote check for @C_ID! CREATE TABLE Orders ( O_ID int AUTO_INCREMENT, C_ID int, O_QTY int, DATE datetime NOT NULL PRIMARY KEY (OrderID), FOREIGN KEY (CustomerID) REFERENCES Customers(C_ID), CONSTRAINT [O_QTY > 0] ) CREATE PROCEDURE CreateOrder(@C_ID int, @O_QTY int) AS INSERT INTO Orders (C_ID, O_QTY, DATE) VALUES (C_ID, O_QTY, NOW()); GO Standard SQL with extensions and analysis

How do I web scale?

1.) Maximize safe concurrency Analyze operations, invariants for coordination- freedom;
delay synchronization when possible next level: automated analysis 2.) Minimize distribution of conflicts Resolve conflicts using as few servers (space) and with as short a critical section (time) as possible next level: automated conflict avoidance, rewriting use pessimistic locking, optimistic execution with validation, or rewrite queries to be coordination-free

http://martinfowler.com/articles/nosql-intro-original.pdf

Get a: CAS/OCC/Lock Mgr Get some: CRDTs/RAMPs/EC Lots of re-use:
Query model, local persistence, cluster membership, sharding protocol, failure detection, metrics, monitoring, administration Use one system!

it’s an anti-pattern introduces unnecessary complexity fundamental differences are small
symptomatic of an immature ecosystem ...if Polyglot Persistence for online data serving is still standard for non-legacy apps in 2023, the OSS DB community will have failed Building correct, reliable, and high-performance databases is hard and takes time, lol “Polyglot Persistence” is apt for 2013, but...

Joint work with great folks including: Aaron Davidson, Ali Ghodsi,
Alan Fekete, Mike Franklin, Joe Hellerstein, Ion Stoica Special thanks to: Peter Alvaro, Phil Bernstein, Dan Bruckner, Neil Conway, Robert Hodges, Evan Sparks, Doug Terry “The game matters... This is a great game.” --Dennis Rodman, Bad As I Wanna Be

Coordination-freedom is a necessary and sufﬁcient condition for availability and
indeﬁnite scalability (more precisely: for validity, availability, and convergence) Application-level correctness criteria are often but not always maintainable without coordination Your (future) database can manage this for you. Reason about your application, not your database replication protocol. Hint: you probably won’t need a different database for each case Know what “consistency” means to your application. Hint: linearizability is not an application-level concept. Hint: you can’t “beat” CAP when “C” means SSI or linearizability. There is a fundamental trade-off between limited coordination and application-level consistency

Bad as I Wanna Be: Coordination and Consistency...

Bad as I Wanna Be: Coordination and Consistency in Distributed Databases

More Decks by pbailis

Other Decks in Programming

Featured

Transcript