Slide 1

Slide 1 text

BAD AS I WANNA BE Coordination and Consistency in Distributed Databases Peter Bailis UC Berkeley @pbailis

Slide 2

Slide 2 text

stateless horizontally scalable soft state stateful concurrent durable A portrait of big services

Slide 3

Slide 3 text

Users have application-level properties that the database should maintain “account balances should be positive” “every patient should have a primary care physician” “usernames should be unique” Linearizability, causal consistency, PRAM, regular semantics, timeline consistency, eventual consistency are not application properties This talk: Consistency is an application-level invariant over data

Slide 4

Slide 4 text

How do we maintain invariants: despite concurrent accesses? across multiple copies? despite failures? This talk: Consistency is an application-level invariant over data

Slide 5

Slide 5 text

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do It Freedom in a Database “I Love Pain” You and the Future

Slide 6

Slide 6 text

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do It Freedom in a Database “I Love Pain” You and the Future

Slide 7

Slide 7 text

Traditional answer: use single system image

Slide 8

Slide 8 text

Traditional answer: use single system image Equivalent Serial Execution if users maintain invariants in isolation, consistency is guaranteed during execution Isolation provides Consistency serializable (SSI) execution means users implicitly maintain database state ACID ACID

Slide 9

Slide 9 text

Conflict serializability requires reasoning about low-level read/write traces

Slide 10

Slide 10 text

Given only reads and writes: - two writes to the same key - any read and write to same key might be a problem (conflict) T2 T1 T3 ww(c) rw(a) rw(c) rw(b) T3 T1 T2 write(a=1) read(a=0) Conflict serializability requires reasoning about low-level read/write traces serializable isolation prevents anomalies

Slide 11

Slide 11 text

Problem: SSI requires coordination One (or both) of these users must stall to preserve serializability

Slide 12

Slide 12 text

synchronous coordination = stalls during network partitions = RTT latency during operations = possible stall during concurrent access no (or asynchronous) coordination = Gilbert and Lynch “High Availability” = low latency (no RTT) = indefinite horizontal scaling (even for a single record; not BS “scalability” claim) benefits also apply to concurrent access in single-node systems Problem: SSI requires coordination

Slide 13

Slide 13 text

Minimum coordination means maximum scalability How much coordination is necessary? DB results: serializability requires coordination CAP result: linearizability requires coordination (fun fact: infinite number of models on either side of this trade-off; less fun fact: there are many existing models to choose from) Are these models always required for maintaining application-level consistency?

Slide 14

Slide 14 text

do not support SSI/serializability HANA Actian Ingres YES Aerospike NO N Persistit NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weaker models by default “Highly Available Transactions: Virtues and Limitations,” VLDB 2014

Slide 15

Slide 15 text

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do It Freedom in a Database “I Love Pain” You and the Future

Slide 16

Slide 16 text

When can we safely forego coordination? Which anomalies matter? Requires information about application: invariants I(DB)→{True, False} and operations T(DB)→DB

Slide 17

Slide 17 text

Invariant: each employee is in a department Operations: add employees l_emp = employees.find(id=“louise”) l_dept = dept.find(l_emp.dept) ENORECORD Anomaly (to avoid):

Slide 18

Slide 18 text

employees = {{“Harry”:1}, {“Sue”:2}} dept = {{“test”:1}, {“dev”:2}} employees = {} dept = {{“ops”:1}, {“dev”:2}} Invariant: each employee is in a department Operations: add employees d2 = dept.find(“dev”) employees.add({“Sue”:d2}) d1 = dept.find(“ops”) employees.add({“Harry”:d1}) Invariant holds!

Slide 19

Slide 19 text

on_duty = employees.find(staffed=”T”) assert(len(on_duty) == 1) ASSERTION FAILS Anomaly (to avoid): Invariant: only one ops on staff at a time Operations: change staffing

Slide 20

Slide 20 text

staff = {“Laura”:T, “Harry”:F, “Gary”:F} staff.set({“Laura”:F}, “Gary”:T}) staff.set({“Laura”:F}, {“Harry”:T}) Invariant violated! staff = {“Laura”:F, “Harry”:T, “Gary”:T} Invariant: only one ops on staff at a time Operations: change staffing

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

SAFETY invariants hold across all states LIVENESS database states eventually agree (converge)

Slide 23

Slide 23 text

*(N.B. for readers concerned with the formalism: assume that database states are sets of mutations, such that ⊔ is set union and each operation simply adds to the set of mutations; a bit more on ⊔ is coming up; also, this formalism as presented is overdone and not particularly elegant; with more space/time, this can be simplified--feel free to email me) “I don’t live my life by anybody else’s clock. If I feel like doing something, I don’t care what time it is. I just do it.” --Dennis Rodman, Bad As I Wanna Be. New York: Delacorte Press, 1996. Print. Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state* Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D))

Slide 24

Slide 24 text

Sufficient? Necessary? App-Level? Conflict Serializability Yes No No State-based Commutativity Yes* No Depends Coordination- Freedom Yes Yes Yes “Everybody wants to stop Dennis Rodman.” --Dennis Rodman, Bad As I Wanna Be To maintain consistency... Coordination-freedom is required for simultaneously maintaining application-level consistency, availability, and convergence Single-step(s) case (from diagram): Invariant I and set of operations T are coordination- free if ∀ t1 ,t2 ∈ T: I(D)⋀I(t1 (D))⋀I(t2 (D)) 㱺 I(t1 (D)⊔t2 (D)) Invariant I and set of operations T are coordination-free if, given initial state Di , every pair of states Dj and Dk resulting from any two valid series of operations in T applied to Di can be merged into a valid database state*

Slide 25

Slide 25 text

Formal framework for reasoning about application coordination requirements Coordination depends on combination of: - expressiveness of operations - strength of invariants STRENGTH OF INVARIANTS EXPRESSIVENESS OF OPERATIONS *Okay, so this is simplified, and there isn’t really a linear order on either axis (rather, it’s more about equivalence classes), but humor me here... COORDINATION REQUIRED COORDINATION-FREE

Slide 26

Slide 26 text

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do It Freedom in a Database “I Love Pain” You and the Future

Slide 27

Slide 27 text

Constraint: record IDs are unique DECLARE TABLE users ( ID int UNIQUE, FirstName string, LastName string ) Anomaly:

Slide 28

Slide 28 text

let the DB decide the ID; use node ID or UUID C-FREE! Constraint: record IDs are unique Operation: insert record INSERT INTO users (firstname, lastname) VALUES (“Leslie”, “Lamport”) DECLARE TABLE users ( ID int UNIQUE, FirstName string, LastName string ) NOT C-FREE NOT C-FREE Operation: insert record with sequential ID DECLARE TABLE users ( ID int UNIQUE AUTO_INCREMENT ... don’t have to abort, just have to coordinate on commit Operation: insert record with specific ID INSERT INTO users (ID, firstname, lastname) VALUES (1, “Leslie”, “Lamport”)

Slide 29

Slide 29 text

Foreign key constraints DECLARE TABLE users ( U_ID int UNIQUE, D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); Anomalies: EMPTY “badass division” department “lamport” has no department

Slide 30

Slide 30 text

(342, “badass division”) Foreign key constraints DECLARE TABLE users ( U_ID int UNIQUE, D_ID int UserName string FOREIGN KEY (D_ID) REFERENCES department(D_ID) ) DECLARE TABLE department ( D_ID int UNIQUE, DeptName string ) NEW_D_ID = INSERT INTO department VALUES (“badass division”); INSERT INTO users (D_ID, UserName) VALUES (NEW_D_ID, “lamport”); users shard department shard Visible to all readers Visible to all readers Not yet visible to all readers Not yet visible to all readers (402, 342, “lamport”) 2 RTT writes (prepare and make visible) Between 1-2 RTTs for reads Magic trick: store metadata to record sibling writes txid=5 txid=5

Slide 31

Slide 31 text

2 RTT writes (prepare and make visible) Between 1-2 RTTs for reads Magic trick: store metadata to record sibling writes Also applicable to: --Distributed secondary indexing --Materialized views (e.g., pre-computed aggregates, alerts) --Multi-entity update (e.g., Tao, Espresso, PNUTS) --Cheap snapshot reads aligned along transaction boundaries Key: design with coordination-freedom as primary goal Interested parties: paper in pipeline; contact me ([O(1) to O(N) metadata-efficiency trade-off]) http://www.bailis.org/blog/non-blocking-transactional-atomicity/ N.B.: This leverages 2PC protocols, but it’s more than 2PC. Individual rounds can block, but readers resolve incomplete commits autonomously ATOMICALLY VISIBLE MULTI-PUT, -GET ACROSS MULTIPLE SHARDS without LOCKING, BLOCKING (RAMP: Read Atomic Multi-Partition Transactions)

Slide 32

Slide 32 text

LIVENESS CRDTs, CALM, Immutability guarantee well-defined merge (sometimes deterministic outcome) ...but few safety guarantees (e.g., can’t safely read) ! SAFETY invariants hold across all states +

Slide 33

Slide 33 text

Formal framework for reasoning about application coordination requirements Remainder: Get yourself a CAS Use that sweet Riak AP Invariant Operation C.F. ? Equality, Inequality Any Y Generate unique ID Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N RAMP Transaction Check those CRDTs

Slide 34

Slide 34 text

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do It Freedom in a Database “I Love Pain” You and the Future

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

“I’m no good working from a comfort zone. I need pain. I love pain.” --Dennis Rodman, Bad As I Wanna Be

Slide 37

Slide 37 text

TPCC Combine fkeys with sequence number insert on commit...

Slide 38

Slide 38 text

TPC-C New-Order Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) RAMP transaction on counter CRDT warehouse district orders neworders +100 insert 100

Slide 39

Slide 39 text

TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID

Slide 40

Slide 40 text

TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID

Slide 41

Slide 41 text

TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID

Slide 42

Slide 42 text

TPC-C New-Order Foreign key insert (e.g., NewOrder, Orders tables) Pre-materialized aggregates (e.g., W_YTD=SUM(orders for warehouse)) Sequence number ID assignment (i.e., D_NEXT_O_ID) RAMP transaction on counter CRDT RAMP transaction across tables rewrite FK references to point to temp unique ID create local index from temp unique ID to sequence ID insert O_ID warehouse district orders neworders insert O_ID deferred atomic incrementAndGet() on commit assign new O_ID tmp ID O NLY SYNCH CO O RDINATIO N REQ UIRED

Slide 43

Slide 43 text

“You can like me or hate me. But all I can say is, when I get on that damn floor, all I’m going to do is get solid.” --Dennis Rodman, Bad As I Wanna Be

Slide 44

Slide 44 text

Linear Scaling via Minimized Coordination No magic in implementation Single-node perf is poor (2.5K lines Java) but only one non-CF operation: incrementAndGet() on D_NEXT_O_ID Coordination need not be a bottleneck (if implemented in a coordination-free manner): UC Berkeley database prototype, 100 EC2 CC2.8xlarge instances (thank you AWS folks! currently poor single-node performance, but unimportant if you can scale out [for the time being]), linearizable masters, only blocking coordination: incrementAndGet for “district next order ID” key, CPU-bound on in-memory data; ~2500 lines Java; 120 clients/warehouse, 5 warehouses/machine, no THINK TIME (i.e., more contention than stock configuration)

Slide 45

Slide 45 text

Consistency is about applications SSI’s Synchronization Shackles Scaling: Just Do It Freedom in a Database “I Love Pain” You and the Future

Slide 46

Slide 46 text

> WARNING: Orders.O_ID requires coordination! INSERT found in CreateOrder > WARNING: CreateOrder requires remote check for @C_ID! CREATE TABLE Orders ( O_ID int AUTO_INCREMENT, C_ID int, O_QTY int, DATE datetime NOT NULL PRIMARY KEY (OrderID), FOREIGN KEY (CustomerID) REFERENCES Customers(C_ID), CONSTRAINT [O_QTY > 0] ) CREATE PROCEDURE CreateOrder(@C_ID int, @O_QTY int) AS INSERT INTO Orders (C_ID, O_QTY, DATE) VALUES (C_ID, O_QTY, NOW()); GO Standard SQL with extensions and analysis

Slide 47

Slide 47 text

How do I web scale?

Slide 48

Slide 48 text

1.) Maximize safe concurrency Analyze operations, invariants for coordination- freedom; delay synchronization when possible next level: automated analysis 2.) Minimize distribution of conflicts Resolve conflicts using as few servers (space) and with as short a critical section (time) as possible next level: automated conflict avoidance, rewriting use pessimistic locking, optimistic execution with validation, or rewrite queries to be coordination-free

Slide 49

Slide 49 text

http://martinfowler.com/articles/nosql-intro-original.pdf

Slide 50

Slide 50 text

Get a: CAS/OCC/Lock Mgr Get some: CRDTs/RAMPs/EC Lots of re-use: Query model, local persistence, cluster membership, sharding protocol, failure detection, metrics, monitoring, administration Use one system!

Slide 51

Slide 51 text

it’s an anti-pattern introduces unnecessary complexity fundamental differences are small symptomatic of an immature ecosystem ...if Polyglot Persistence for online data serving is still standard for non-legacy apps in 2023, the OSS DB community will have failed Building correct, reliable, and high-performance databases is hard and takes time, lol “Polyglot Persistence” is apt for 2013, but...

Slide 52

Slide 52 text

Joint work with great folks including: Aaron Davidson, Ali Ghodsi, Alan Fekete, Mike Franklin, Joe Hellerstein, Ion Stoica Special thanks to: Peter Alvaro, Phil Bernstein, Dan Bruckner, Neil Conway, Robert Hodges, Evan Sparks, Doug Terry “The game matters... This is a great game.” --Dennis Rodman, Bad As I Wanna Be

Slide 53

Slide 53 text

Coordination-freedom is a necessary and sufficient condition for availability and indefinite scalability (more precisely: for validity, availability, and convergence) Application-level correctness criteria are often but not always maintainable without coordination Your (future) database can manage this for you. Reason about your application, not your database replication protocol. Hint: you probably won’t need a different database for each case Know what “consistency” means to your application. Hint: linearizability is not an application-level concept. Hint: you can’t “beat” CAP when “C” means SSI or linearizability. There is a fundamental trade-off between limited coordination and application-level consistency