– Database = linked groups of records (“CODASYL”) • Pointers were physical names, today we hash – Programmer as “navigator” through the links – Similar to DOM engine, WWW, graph DBs • Used for its high performance, but… – But hard to program, maintain – Hard to evolve the schema (embedded in code) Wikipedia: “IDMS”
• Separate the data from the code – SQL is the (only) API • Data outlasts any parHcular implementaHon – because the model doesn’t change • Goal: implement the top-‐down model well – Led to transacHons as a tool – DeclaraHve language leaves room for opHmizaHon
is to provide a file system” – original 1974 Unix paper • Bo`om-‐up world view – Few, simple, efficient mechanisms – Layers and composiHon – “navigaHonal” – EvoluHon comes from APIs, encapsulaHon • NoSQL is in this Unix tradiHon – Examples: dbm (1979 kv), gdbm, Berkeley DB, JDBM
Down – Clean model, – ACID TransacHons • Two kinds of developers – DB authors – SQL programmers • Values – Clean SemanHcs – Set operaHons – Easy long-‐term evoluHon • Venues: SIGMOD, VLDB Systems View • Bo`om Up – Build on top – Evolve modules • One kind of programmer – Integrated use • Values: – Good APIs – Flexibility – Range of possible programs • Venues: SOSP, OSDI
• Systems values: – Layered, ideally modular APIs – Enable a range of systems and semanHcs • Some things to build on top over Hme: – MulH-‐component transacHons – Secondary indices – EvoluHon story – Returning sets of data, not just values
a table – Object ID is the primary key – Object fields are the columns • Update key => – create SQL query to UPDATE a row – execute the query • Typical consequences: – Extra copies, Poor use of RAM • One copy for the app, one for the DB buffer manager – Inheritance, evoluHon are messy – Performance fine for Ruby on Rails, but heavyweight “Vietnam of Computer Science”
– Tail latency of reads is king – (writes are async and tail latency is OK) • Consequences: – Minimize seeks for individual reads – OpHmize data read together • Caching • Denormalize data (i.e. copy fields to mulHple places)
MulHple copies have to be kept in sync • Slows updates to make reads faster 2. Significant added complexity • Really prefer a single master copy (modulo replicaHon) • Both SQL and NoSQL have this problem: – SQL: • Denormalized schemas, consistency constraints • Materialized views = cached virtual tables with invalidaHon – NoSQL: app has to track invalidaHon/updates
read latency – Tend to want key data collocated for ≤ 1 seek • Many NoSQL systems driven by this – Airline reservaHons: Sabre (pre SQL unHl recently) – Inktomi search engine – Amazon’s Dynamo – Google’s BigTable, Spanner • Open QuesHon: do SSDs => normalizaHon OK?
Joins are set operaHons – Normally iterate through results – Places an emphasis on locality of sets • NoSQL ooen returns a single value – Denormalize if needed to get “complete” value – No joins – Some small sets; search engine returns k values • One seek per value OK as long as they are parallel – Later: iteraHon over snapshots
keys fit into in-‐memory hash table – All values go to a log, index points to the log • Simple durability, mostly sequenHal writes • All reads take at most one seek – 0 if cached – Hash the key, follow the pointer to the log • Compact log to reclaim dead space • Recovery is easy: checkpoint + scan the log
– Write-‐ahead log – …but later overwrite values in place – Focus on future read locality • NoSQL (Bitcask, BigTable, Spanner, …) – Log is the final locaHon – Compact log to recover space – Limited mulH-‐key locality aoer compacHon
– Need low tail latency – Do not need to return sets – (Update in place helps with sets) 2. Don’t overwrite the current version – Undo logs bad for whole-‐value writes • Write the value twice (but in same log) • Blob support in DBs typically avoids undo logs – Undo logs much be`er for: • Logical operaHons such as increment • ParHal updates (avoid wriHng the whole object)
versions – All (recent) versions are in the log Solves the iteraHon problem: – Problem: need a self-‐consistent set – DB soluHon: large read lock, blocks writes – DB soluHon 2: “snapshot isolaHon” (Oracle) • All reads at Hmestamp at beginning of transacHon – Spanner: ”snapshot reads” pick a Hmestamp • Use the older versions in the log • Extra indexing (similar to BigTable)
– Begin => log “begin xid” – Commit => log “commit xid” + checksum – Abort => do nothing or log “abort xid” – Include xid in consHtuent updates • Recovery: – Only replay valid commi`ed transacHons – Ensures all or nothing mulH-‐key updates • Commit also installs index updates atomically – Easy, since they are in memory
commit – End of phase 1 => log “prepare xid” • Really the same state is commit before, but not commi`ed yet – Aoer vote RPC, log commit – Easy to do because of the no-‐overwrite policy • This also enables KV updates to be part of mulH-‐system transacHons
TransacHons and logging the same • Will need to lock both indices someHmes – Both are in memory – Can use a single write lock for both if most updates change both indices
straightorward given 2PC – Various quorum approaches as in dynamo • Recovery can be simplified – Can get lost index from replicas • More complex: – Geung independent replicas – Consistent hashing to vary nodes/system • AlternaHve: use a trie, see Gribble SDDS, OSDI 2000 • Trie supports range queries – Micro sharding for parallel recovery
k nodes to be up – Prob(up) = Prob(single node up)^k [= small] • Spanner soluHon: – Each replica is actually a Paxos group • Each Paxos group local to one datacenter – 2PC among the Paxos groups – DrasHcally improves Prob(single node up) • Layering hides the complexity of Paxos
• Need version # for each compacted file – New log is always in the current version – CompacHon always writes out new version • Two opHons: – Recovery can read old versions – Converters from n to n+1 (e.g. Microsoo Office) • Compacted files immutable – Enables one-‐Hme batch conversion
views • Basic soluHon – Need to have “commit hooks” – On commit, noHfy via pub-‐sub to listeners – Listeners invalidate their copies • Or choose to serve stale version, while updaHng • This can be a broadly useful building block – E.g. memcache
– Ordered keys, as in BigTable – Merge equi-‐join across stores – Roughly how Inktomi worked (sorted by doc id) • Some apps essenHally do the joins themselves • Harder: – Joining against secondary index – Non-‐equal key joings
building these kinds of systems – One a`empt at layering – Provides transacHonal logging and recovery – Can support update in place, compacHon or a mix – Handles ORM problem cleanly – Supports 2PC (but not on top of Paxos) Rusty Sears PhD topic – Open source on github
– Each node has to have log manager, buffer manager, transacHon manager collocated on one node – Reason: hold mutexes across calls • Fundamental to the use of LSNs on pages • One use of Stasis: break apart these pieces – LSN-‐free pages => no locks across calls – Instead pipeline async calls among modules • Enables modules to be on different machines – Enables a new approach for large-‐scale DBs • Sharding no longer the only opHon for a larger DB
from the 1970s – ConHnue to converge in the cloud • Possible outcome: – Layered, modular system – … with great flexibility – … used to build a variety of systems and semanHcs – … including a full SQL DMBS