Ricon 2012 Keynote

Advancing Distributed Systems Eric Brewer Professor, UC Berkeley
VP Infrastructure, Google RICON 2012 October 11, 2012

Charles Bachman, 1973 Turing Award Integrated Datastore
(IDS) (very) Early “No SQL” database

“NavigaHonal” Database •  Tight integraHon between code and data
– Database = linked groups of records (“CODASYL”) •  Pointers were physical names, today we hash – Programmer as “navigator” through the links – Similar to DOM engine, WWW, graph DBs •  Used for its high performance, but… – But hard to program, maintain – Hard to evolve the schema (embedded in code) Wikipedia: “IDMS”

Why RelaHonal? (1970s) •  Need a high-‐level model (sets)
•  Separate the data from the code – SQL is the (only) API •  Data outlasts any parHcular implementaHon – because the model doesn’t change •  Goal: implement the top-‐down model well – Led to transacHons as a tool – DeclaraHve language leaves room for opHmizaHon

Also 1970s: Unix “The most important job of UNIX
is to provide a file system” – original 1974 Unix paper •  Boòm-‐up world view –  Few, simple, efficient mechanisms –  Layers and composiHon –  “navigaHonal” –  EvoluHon comes from APIs, encapsulaHon •  NoSQL is in this Unix tradiHon –  Examples: dbm (1979 kv), gdbm, Berkeley DB, JDBM

Two Valid World Views RelaHonal View •  Top
Down –  Clean model, –  ACID TransacHons •  Two kinds of developers –  DB authors –  SQL programmers •  Values –  Clean SemanHcs –  Set operaHons –  Easy long-‐term evoluHon •  Venues: SIGMOD, VLDB Systems View •  Bo`om Up –  Build on top –  Evolve modules •  One kind of programmer –  Integrated use •  Values: –  Good APIs –  Flexibility –  Range of possible programs •  Venues: SOSP, OSDI

NoSQL in Context •  Large reusable storage component
•  Systems values: – Layered, ideally modular APIs – Enable a range of systems and semanHcs •  Some things to build on top over Hme: – MulH-‐component transacHons – Secondary indices – EvoluHon story – Returning sets of data, not just values

Part 2: Some Diﬀerences

Three InteresHng Diﬀerences 1.  IntegraHon into the larger applicaHon
2.  Read/Write raHo and latencies 3.  Sets vs. values

1) Object-‐RelaHonal Mapping Problem •  Map applicaHon objects to
a table –  Object ID is the primary key –  Object fields are the columns •  Update key => –  create SQL query to UPDATE a row –  execute the query •  Typical consequences: –  Extra copies, Poor use of RAM •  One copy for the app, one for the DB buffer manager –  Inheritance, evoluHon are messy –  Performance fine for Ruby on Rails, but heavyweight “Vietnam of Computer Science”

2) Read Latency •  For live Internet services:
– Tail latency of reads is king – (writes are async and tail latency is OK) •  Consequences: – Minimize seeks for individual reads – OpHmize data read together •  Caching •  Denormalize data (i.e. copy ﬁelds to mulHple places)

Denormalizing for Latency •  Two basic problems: 1. 
MulHple copies have to be kept in sync •  Slows updates to make reads faster 2.  Signiﬁcant added complexity •  Really prefer a single master copy (modulo replicaHon) •  Both SQL and NoSQL have this problem: –  SQL: •  Denormalized schemas, consistency constraints •  Materialized views = cached virtual tables with invalidaHon –  NoSQL: app has to track invalidaHon/updates

DenormalizaHon Differences •  Key difference: NoSQL tends to care
more – Use in high-‐performance live services – Read mostly usage => OK to burden writes •  NoSQL typically missing invalidaHon support – SQL materialized views automate cache invalidaHon •  Counter example: Google’s Percolator – Incrementally update many denormalized tables – Dependency flows (think Excel cell updates)

Read Latency Summary •  Live services push hard on
read latency – Tend to want key data collocated for ≤ 1 seek •  Many NoSQL systems driven by this – Airline reservaHons: Sabre (pre SQL unHl recently) – Inktomi search engine – Amazon’s Dynamo – Google’s BigTable, Spanner •  Open QuesHon: do SSDs => normalizaHon OK?

Sets vs. Values •  SQL returns sets – 
Joins are set operaHons –  Normally iterate through results –  Places an emphasis on locality of sets •  NoSQL ooen returns a single value –  Denormalize if needed to get “complete” value –  No joins –  Some small sets; search engine returns k values •  One seek per value OK as long as they are parallel –  Later: iteraHon over snapshots

Bitcask 101 •  Simple single-‐node KV store – All
keys ﬁt into in-‐memory hash table – All values go to a log, index points to the log •  Simple durability, mostly sequenHal writes •  All reads take at most one seek – 0 if cached – Hash the key, follow the pointer to the log •  Compact log to reclaim dead space •  Recovery is easy: checkpoint + scan the log

Another diﬀerence: update in place •  Classic DB
– Write-‐ahead log – …but later overwrite values in place – Focus on future read locality •  NoSQL (Bitcask, BigTable, Spanner, …) – Log is the ﬁnal locaHon – Compact log to recover space – Limited mulH-‐key locality aoer compacHon

Why compacHon? 1.  Follows from single-‐value read latency
–  Need low tail latency –  Do not need to return sets –  (Update in place helps with sets) 2.  Don’t overwrite the current version –  Undo logs bad for whole-‐value writes •  Write the value twice (but in same log) •  Blob support in DBs typically avoids undo logs –  Undo logs much be`er for: •  Logical operaHons such as increment •  ParHal updates (avoid wriHng the whole object)

Why CompacHon? (conHnued) 3.  Easy to keep mulHple
versions –  All (recent) versions are in the log Solves the iteraHon problem: –  Problem: need a self-‐consistent set –  DB soluHon: large read lock, blocks writes –  DB soluHon 2: “snapshot isolaHon” (Oracle) •  All reads at Hmestamp at beginning of transacHon –  Spanner: ”snapshot reads” pick a Hmestamp •  Use the older versions in the log •  Extra indexing (similar to BigTable)

Part 3: Building Up

Atomic transacHons? •  Easy to add for compacHon approach
–  Begin => log “begin xid” –  Commit => log “commit xid” + checksum –  Abort => do nothing or log “abort xid” –  Include xid in consHtuent updates •  Recovery: –  Only replay valid commi`ed transacHons –  Ensures all or nothing mulH-‐key updates •  Commit also installs index updates atomically –  Easy, since they are in memory

MulH-‐node TransacHons? •  Need to add support for two-‐phase
commit – End of phase 1 => log “prepare xid” •  Really the same state is commit before, but not commi`ed yet – Aoer vote RPC, log commit – Easy to do because of the no-‐overwrite policy •  This also enables KV updates to be part of mulH-‐system transacHons

Secondary Indices? •  Add second in-‐memory index • 
TransacHons and logging the same •  Will need to lock both indices someHmes – Both are in memory – Can use a single write lock for both if most updates change both indices

ReplicaHon [mostly done] •  Many possibiliHes –  RelaHvely
straightorward given 2PC –  Various quorum approaches as in dynamo •  Recovery can be simpliﬁed –  Can get lost index from replicas •  More complex: –  Geung independent replicas –  Consistent hashing to vary nodes/system •  AlternaHve: use a trie, see Gribble SDDS, OSDI 2000 •  Trie supports range queries –  Micro sharding for parallel recovery

2PC hurts availability… •  Problem: 2PC depends on all
k nodes to be up – Prob(up) = Prob(single node up)^k [= small] •  Spanner soluHon: – Each replica is actually a Paxos group •  Each Paxos group local to one datacenter – 2PC among the Paxos groups – DrasHcally improves Prob(single node up) •  Layering hides the complexity of Paxos

EvoluHon? •  SomeHmes want to change the schema
•  Need version # for each compacted file – New log is always in the current version – CompacHon always writes out new version •  Two opHons: – Recovery can read old versions – Converters from n to n+1 (e.g. Microsoo Office) •  Compacted files immutable – Enables one-‐Hme batch conversion

Consistent Caching •  Fundamentally complex – Enables denormalizaHon, materialized
views •  Basic soluHon – Need to have “commit hooks” – On commit, noHfy via pub-‐sub to listeners – Listeners invalidate their copies •  Or choose to serve stale version, while updaHng •  This can be a broadly useful building block – E.g. memcache

What about joins? •  Current somewhat low-‐hanging fruit
– Ordered keys, as in BigTable – Merge equi-‐join across stores – Roughly how Inktomi worked (sorted by doc id) •  Some apps essenHally do the joins themselves •  Harder: – Joining against secondary index – Non-‐equal key joings

A Plug for Stasis Stasis is a framework for
building these kinds of systems – One a`empt at layering – Provides transacHonal logging and recovery – Can support update in place, compacHon or a mix – Handles ORM problem cleanly – Supports 2PC (but not on top of Paxos) Rusty Sears PhD topic – Open source on github

Stasis and the Cloud •  TradiHonal DB model:
–  Each node has to have log manager, buﬀer manager, transacHon manager collocated on one node –  Reason: hold mutexes across calls •  Fundamental to the use of LSNs on pages •  One use of Stasis: break apart these pieces –  LSN-‐free pages => no locks across calls –  Instead pipeline async calls among modules •  Enables modules to be on diﬀerent machines –  Enables a new approach for large-‐scale DBs •  Sharding no longer the only opHon for a larger DB

Conclusion •  Two valid world views – Diﬀerence dates
from the 1970s – ConHnue to converge in the cloud •  Possible outcome: – Layered, modular system – … with great ﬂexibility – … used to build a variety of systems and semanHcs – … including a full SQL DMBS

Ricon 2012 Keynote

Ricon 2012 Keynote

Eric Brewer

More Decks by Eric Brewer

Other Decks in Research

Featured

Transcript

Advancing Distributed Systems Eric Brewer Professor, UC Berkeley

Charles Bachman, 1973 Turing Award Integrated Datastore

“NavigaHonal” Database •  Tight integraHon between code and data

Why RelaHonal? (1970s) •  Need a high-‐level model (sets)

Also 1970s: Unix “The most important job of UNIX

Two Valid World Views RelaHonal View •  Top

NoSQL in Context •  Large reusable storage component

Part 2: Some Diﬀerences

Three InteresHng Diﬀerences 1.  IntegraHon into the larger applicaHon

1) Object-‐RelaHonal Mapping Problem •  Map applicaHon objects to

2) Read Latency •  For live Internet services:

Denormalizing for Latency •  Two basic problems: 1.

DenormalizaHon Diﬀerences •  Key diﬀerence: NoSQL tends to care

Read Latency Summary •  Live services push hard on

Sets vs. Values •  SQL returns sets –

Bitcask 101 •  Simple single-‐node KV store – All

Another diﬀerence: update in place •  Classic DB

Why compacHon? 1.  Follows from single-‐value read latency

Why CompacHon? (conHnued) 3.  Easy to keep mulHple

Part 3: Building Up

Atomic transacHons? •  Easy to add for compacHon approach

MulH-‐node TransacHons? •  Need to add support for two-‐phase

Secondary Indices? •  Add second in-‐memory index •

ReplicaHon [mostly done] •  Many possibiliHes –  RelaHvely

2PC hurts availability… •  Problem: 2PC depends on all

EvoluHon? •  SomeHmes want to change the schema

Consistent Caching •  Fundamentally complex – Enables denormalizaHon, materialized

What about joins? •  Current somewhat low-‐hanging fruit

A Plug for Stasis Stasis is a framework for

Stasis and the Cloud •  TradiHonal DB model:

Conclusion •  Two valid world views – Diﬀerence dates