Ricon 2012 Keynote - Speaker Deck

Slide 1

Slide 1 text

Advancing Distributed Systems Eric Brewer Professor, UC Berkeley VP Infrastructure, Google RICON 2012 October 11, 2012

Slide 2

Slide 2 text

Charles Bachman, 1973 Turing Award Integrated Datastore (IDS) (very) Early “No SQL” database

Slide 3

Slide 3 text

“NavigaHonal” Database •  Tight integraHon between code and data – Database = linked groups of records (“CODASYL”) •  Pointers were physical names, today we hash – Programmer as “navigator” through the links – Similar to DOM engine, WWW, graph DBs •  Used for its high performance, but… – But hard to program, maintain – Hard to evolve the schema (embedded in code) Wikipedia: “IDMS”

Slide 4

Slide 4 text

Why RelaHonal? (1970s) •  Need a high-‐level model (sets) •  Separate the data from the code – SQL is the (only) API •  Data outlasts any parHcular implementaHon – because the model doesn’t change •  Goal: implement the top-‐down model well – Led to transacHons as a tool – DeclaraHve language leaves room for opHmizaHon

Slide 5

Slide 5 text

Also 1970s: Unix “The most important job of UNIX is to provide a file system” – original 1974 Unix paper •  Boòm-‐up world view –  Few, simple, efficient mechanisms –  Layers and composiHon –  “navigaHonal” –  EvoluHon comes from APIs, encapsulaHon •  NoSQL is in this Unix tradiHon –  Examples: dbm (1979 kv), gdbm, Berkeley DB, JDBM

Slide 6

Slide 6 text

Two Valid World Views RelaHonal View •  Top Down –  Clean model, –  ACID TransacHons •  Two kinds of developers –  DB authors –  SQL programmers •  Values –  Clean SemanHcs –  Set operaHons –  Easy long-‐term evoluHon •  Venues: SIGMOD, VLDB Systems View •  Bo`om Up –  Build on top –  Evolve modules •  One kind of programmer –  Integrated use •  Values: –  Good APIs –  Flexibility –  Range of possible programs •  Venues: SOSP, OSDI

Slide 7

Slide 7 text

NoSQL in Context •  Large reusable storage component •  Systems values: – Layered, ideally modular APIs – Enable a range of systems and semanHcs •  Some things to build on top over Hme: – MulH-‐component transacHons – Secondary indices – EvoluHon story – Returning sets of data, not just values

Slide 8

Slide 8 text

Part 2: Some Diﬀerences

Slide 9

Slide 9 text

Three InteresHng Diﬀerences 1.  IntegraHon into the larger applicaHon 2.  Read/Write raHo and latencies 3.  Sets vs. values

Slide 10

Slide 10 text

1) Object-‐RelaHonal Mapping Problem •  Map applicaHon objects to a table –  Object ID is the primary key –  Object fields are the columns •  Update key => –  create SQL query to UPDATE a row –  execute the query •  Typical consequences: –  Extra copies, Poor use of RAM •  One copy for the app, one for the DB buffer manager –  Inheritance, evoluHon are messy –  Performance fine for Ruby on Rails, but heavyweight “Vietnam of Computer Science”

Slide 11

Slide 11 text

2) Read Latency •  For live Internet services: – Tail latency of reads is king – (writes are async and tail latency is OK) •  Consequences: – Minimize seeks for individual reads – OpHmize data read together •  Caching •  Denormalize data (i.e. copy ﬁelds to mulHple places)

Slide 12

Slide 12 text

Denormalizing for Latency •  Two basic problems: 1.  MulHple copies have to be kept in sync •  Slows updates to make reads faster 2.  Signiﬁcant added complexity •  Really prefer a single master copy (modulo replicaHon) •  Both SQL and NoSQL have this problem: –  SQL: •  Denormalized schemas, consistency constraints •  Materialized views = cached virtual tables with invalidaHon –  NoSQL: app has to track invalidaHon/updates

Slide 13

Slide 13 text

DenormalizaHon Differences •  Key difference: NoSQL tends to care more – Use in high-‐performance live services – Read mostly usage => OK to burden writes •  NoSQL typically missing invalidaHon support – SQL materialized views automate cache invalidaHon •  Counter example: Google’s Percolator – Incrementally update many denormalized tables – Dependency flows (think Excel cell updates)

Slide 14

Slide 14 text

Read Latency Summary •  Live services push hard on read latency – Tend to want key data collocated for ≤ 1 seek •  Many NoSQL systems driven by this – Airline reservaHons: Sabre (pre SQL unHl recently) – Inktomi search engine – Amazon’s Dynamo – Google’s BigTable, Spanner •  Open QuesHon: do SSDs => normalizaHon OK?

Slide 15

Slide 15 text

Sets vs. Values •  SQL returns sets –  Joins are set operaHons –  Normally iterate through results –  Places an emphasis on locality of sets •  NoSQL ooen returns a single value –  Denormalize if needed to get “complete” value –  No joins –  Some small sets; search engine returns k values •  One seek per value OK as long as they are parallel –  Later: iteraHon over snapshots

Slide 16

Slide 16 text

Bitcask 101 •  Simple single-‐node KV store – All keys ﬁt into in-‐memory hash table – All values go to a log, index points to the log •  Simple durability, mostly sequenHal writes •  All reads take at most one seek – 0 if cached – Hash the key, follow the pointer to the log •  Compact log to reclaim dead space •  Recovery is easy: checkpoint + scan the log

Slide 17

Slide 17 text

Another diﬀerence: update in place •  Classic DB – Write-‐ahead log – …but later overwrite values in place – Focus on future read locality •  NoSQL (Bitcask, BigTable, Spanner, …) – Log is the ﬁnal locaHon – Compact log to recover space – Limited mulH-‐key locality aoer compacHon

Slide 18

Slide 18 text

Why compacHon? 1.  Follows from single-‐value read latency –  Need low tail latency –  Do not need to return sets –  (Update in place helps with sets) 2.  Don’t overwrite the current version –  Undo logs bad for whole-‐value writes •  Write the value twice (but in same log) •  Blob support in DBs typically avoids undo logs –  Undo logs much be`er for: •  Logical operaHons such as increment •  ParHal updates (avoid wriHng the whole object)

Slide 19

Slide 19 text

Why CompacHon? (conHnued) 3.  Easy to keep mulHple versions –  All (recent) versions are in the log Solves the iteraHon problem: –  Problem: need a self-‐consistent set –  DB soluHon: large read lock, blocks writes –  DB soluHon 2: “snapshot isolaHon” (Oracle) •  All reads at Hmestamp at beginning of transacHon –  Spanner: ”snapshot reads” pick a Hmestamp •  Use the older versions in the log •  Extra indexing (similar to BigTable)

Slide 20

Slide 20 text

Part 3: Building Up

Slide 21

Slide 21 text

Atomic transacHons? •  Easy to add for compacHon approach –  Begin => log “begin xid” –  Commit => log “commit xid” + checksum –  Abort => do nothing or log “abort xid” –  Include xid in consHtuent updates •  Recovery: –  Only replay valid commi`ed transacHons –  Ensures all or nothing mulH-‐key updates •  Commit also installs index updates atomically –  Easy, since they are in memory

Slide 22

Slide 22 text

MulH-‐node TransacHons? •  Need to add support for two-‐phase commit – End of phase 1 => log “prepare xid” •  Really the same state is commit before, but not commi`ed yet – Aoer vote RPC, log commit – Easy to do because of the no-‐overwrite policy •  This also enables KV updates to be part of mulH-‐system transacHons

Slide 23

Slide 23 text

Secondary Indices? •  Add second in-‐memory index •  TransacHons and logging the same •  Will need to lock both indices someHmes – Both are in memory – Can use a single write lock for both if most updates change both indices

Slide 24

Slide 24 text

ReplicaHon [mostly done] •  Many possibiliHes –  RelaHvely straightorward given 2PC –  Various quorum approaches as in dynamo •  Recovery can be simpliﬁed –  Can get lost index from replicas •  More complex: –  Geung independent replicas –  Consistent hashing to vary nodes/system •  AlternaHve: use a trie, see Gribble SDDS, OSDI 2000 •  Trie supports range queries –  Micro sharding for parallel recovery

Slide 25

Slide 25 text

2PC hurts availability… •  Problem: 2PC depends on all k nodes to be up – Prob(up) = Prob(single node up)^k [= small] •  Spanner soluHon: – Each replica is actually a Paxos group •  Each Paxos group local to one datacenter – 2PC among the Paxos groups – DrasHcally improves Prob(single node up) •  Layering hides the complexity of Paxos

Slide 26

Slide 26 text

EvoluHon? •  SomeHmes want to change the schema •  Need version # for each compacted file – New log is always in the current version – CompacHon always writes out new version •  Two opHons: – Recovery can read old versions – Converters from n to n+1 (e.g. Microsoo Office) •  Compacted files immutable – Enables one-‐Hme batch conversion

Slide 27

Slide 27 text

Consistent Caching •  Fundamentally complex – Enables denormalizaHon, materialized views •  Basic soluHon – Need to have “commit hooks” – On commit, noHfy via pub-‐sub to listeners – Listeners invalidate their copies •  Or choose to serve stale version, while updaHng •  This can be a broadly useful building block – E.g. memcache

Slide 28

Slide 28 text

What about joins? •  Current somewhat low-‐hanging fruit – Ordered keys, as in BigTable – Merge equi-‐join across stores – Roughly how Inktomi worked (sorted by doc id) •  Some apps essenHally do the joins themselves •  Harder: – Joining against secondary index – Non-‐equal key joings

Slide 29

Slide 29 text

A Plug for Stasis Stasis is a framework for building these kinds of systems – One a`empt at layering – Provides transacHonal logging and recovery – Can support update in place, compacHon or a mix – Handles ORM problem cleanly – Supports 2PC (but not on top of Paxos) Rusty Sears PhD topic – Open source on github

Slide 30

Slide 30 text

Stasis and the Cloud •  TradiHonal DB model: –  Each node has to have log manager, buﬀer manager, transacHon manager collocated on one node –  Reason: hold mutexes across calls •  Fundamental to the use of LSNs on pages •  One use of Stasis: break apart these pieces –  LSN-‐free pages => no locks across calls –  Instead pipeline async calls among modules •  Enables modules to be on diﬀerent machines –  Enables a new approach for large-‐scale DBs •  Sharding no longer the only opHon for a larger DB

Slide 31

Slide 31 text

Conclusion •  Two valid world views – Diﬀerence dates from the 1970s – ConHnue to converge in the cloud •  Possible outcome: – Layered, modular system – … with great ﬂexibility – … used to build a variety of systems and semanHcs – … including a full SQL DMBS