Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ricon 2012 Keynote

Eric Brewer
October 11, 2012

Ricon 2012 Keynote

NoSQL in context and relative to relational databases, plus thoughts on evolving a layered, modular distributed system.

Eric Brewer

October 11, 2012
Tweet

More Decks by Eric Brewer

Other Decks in Research

Transcript

  1. Advancing  Distributed  Systems   Eric  Brewer   Professor,  UC  Berkeley

      VP  Infrastructure,  Google     RICON  2012   October  11,  2012  
  2. Charles  Bachman,    1973  Turing  Award     Integrated  Datastore

     (IDS)   (very)  Early  “No  SQL”  database  
  3. “NavigaHonal”  Database   •  Tight  integraHon  between  code  and  data

      – Database  =  linked  groups  of  records        (“CODASYL”)   •  Pointers  were  physical  names,  today  we  hash   – Programmer  as  “navigator”  through  the  links   – Similar  to  DOM  engine,  WWW,  graph  DBs   •  Used  for  its  high  performance,  but…   – But  hard  to  program,  maintain   – Hard  to  evolve  the  schema  (embedded  in  code)   Wikipedia:    “IDMS”  
  4. Why  RelaHonal?  (1970s)   •  Need  a  high-­‐level  model  (sets)

      •  Separate  the  data  from  the  code   – SQL  is  the  (only)  API   •  Data  outlasts  any  parHcular  implementaHon   – because  the  model  doesn’t  change   •  Goal:  implement  the  top-­‐down  model  well   – Led  to  transacHons  as  a  tool   – DeclaraHve  language  leaves  room  for  opHmizaHon  
  5. Also  1970s:  Unix   “The  most  important  job  of  UNIX

     is  to  provide  a  file  system”   –  original  1974  Unix  paper   •   Bo`om-­‐up  world  view   –  Few,  simple,  efficient  mechanisms   –  Layers  and  composiHon   –  “navigaHonal”   –  EvoluHon  comes  from  APIs,  encapsulaHon   •  NoSQL  is  in  this  Unix  tradiHon   –  Examples:  dbm  (1979  kv),  gdbm,  Berkeley  DB,  JDBM  
  6. Two  Valid  World  Views   RelaHonal  View   •  Top

     Down   –  Clean  model,   –  ACID  TransacHons   •  Two  kinds  of  developers   –  DB  authors   –  SQL  programmers   •  Values   –  Clean  SemanHcs   –  Set  operaHons   –  Easy  long-­‐term  evoluHon   •  Venues:  SIGMOD,  VLDB     Systems  View   •  Bo`om  Up   –  Build  on  top   –  Evolve  modules   •  One  kind  of  programmer   –  Integrated  use   •  Values:   –  Good  APIs   –  Flexibility   –  Range  of  possible  programs   •  Venues:  SOSP,  OSDI  
  7. NoSQL  in  Context   •  Large  reusable  storage  component  

    •  Systems  values:   – Layered,  ideally  modular  APIs   – Enable  a  range  of  systems  and  semanHcs   •  Some  things  to  build  on  top  over  Hme:   – MulH-­‐component  transacHons   – Secondary  indices   – EvoluHon  story   – Returning  sets  of  data,  not  just  values  
  8. Three  InteresHng  Differences   1.  IntegraHon  into  the  larger  applicaHon

      2.  Read/Write  raHo  and  latencies   3.  Sets  vs.  values  
  9. 1)  Object-­‐RelaHonal  Mapping  Problem   •  Map  applicaHon  objects  to

     a  table   –  Object  ID  is  the  primary  key   –  Object  fields  are  the  columns   •  Update  key  =>   –     create  SQL  query  to  UPDATE  a  row   –     execute  the  query   •  Typical  consequences:   –  Extra  copies,  Poor  use  of  RAM     •  One  copy  for  the  app,  one  for  the  DB  buffer  manager   –  Inheritance,  evoluHon  are  messy   –  Performance  fine  for  Ruby  on  Rails,  but  heavyweight   “Vietnam  of  Computer  Science”  
  10. 2)  Read  Latency   •  For  live  Internet  services:  

    – Tail  latency  of  reads  is  king   – (writes  are  async  and  tail  latency  is  OK)   •  Consequences:   – Minimize  seeks  for  individual  reads   – OpHmize  data  read  together   •  Caching   •  Denormalize  data  (i.e.  copy  fields  to  mulHple  places)  
  11. Denormalizing  for  Latency   •  Two  basic  problems:   1. 

    MulHple  copies  have  to  be  kept  in  sync   •  Slows  updates  to  make  reads  faster   2.  Significant  added  complexity   •  Really  prefer  a  single  master  copy  (modulo  replicaHon)     •  Both  SQL  and  NoSQL  have  this  problem:   –  SQL:   •  Denormalized  schemas,  consistency  constraints   •  Materialized  views  =  cached  virtual  tables  with  invalidaHon   –  NoSQL:  app  has  to  track  invalidaHon/updates  
  12. DenormalizaHon  Differences   •  Key  difference:  NoSQL  tends  to  care

     more   – Use  in  high-­‐performance  live  services   – Read  mostly  usage  =>  OK  to  burden  writes   •  NoSQL  typically  missing  invalidaHon  support   – SQL  materialized  views  automate  cache   invalidaHon   •  Counter  example:  Google’s  Percolator   – Incrementally  update  many  denormalized  tables   – Dependency  flows  (think  Excel  cell  updates)  
  13. Read  Latency  Summary   •  Live  services  push  hard  on

     read  latency   – Tend  to  want  key  data  collocated  for  ≤  1  seek   •  Many  NoSQL  systems  driven  by  this   – Airline  reservaHons:  Sabre  (pre  SQL  unHl  recently)   – Inktomi  search  engine   – Amazon’s  Dynamo   – Google’s  BigTable,  Spanner   •  Open  QuesHon:  do  SSDs  =>  normalizaHon  OK?    
  14. Sets  vs.  Values   •  SQL  returns  sets   – 

    Joins  are  set  operaHons   –  Normally  iterate  through  results   –  Places  an  emphasis  on  locality  of  sets   •  NoSQL  ooen  returns  a  single  value   –  Denormalize  if  needed  to  get  “complete”  value   –  No  joins   –  Some  small  sets;  search  engine  returns  k  values   •  One  seek  per  value  OK  as  long  as  they  are  parallel   –  Later:  iteraHon  over  snapshots  
  15. Bitcask  101   •  Simple  single-­‐node  KV  store   – All

     keys  fit  into  in-­‐memory  hash  table   – All  values  go  to  a  log,  index  points  to  the  log   •  Simple  durability,  mostly  sequenHal  writes   •  All  reads  take  at  most  one  seek     – 0  if  cached   – Hash  the  key,  follow  the  pointer  to  the  log   •  Compact  log  to  reclaim  dead  space   •  Recovery  is  easy:  checkpoint  +  scan  the  log  
  16. Another  difference:  update  in  place   •  Classic  DB  

    – Write-­‐ahead  log   – …but  later  overwrite  values  in  place   – Focus  on  future  read  locality   •  NoSQL  (Bitcask,  BigTable,  Spanner,  …)   – Log  is  the  final  locaHon   – Compact  log  to  recover  space   – Limited  mulH-­‐key  locality  aoer  compacHon  
  17. Why  compacHon?   1.  Follows  from  single-­‐value  read  latency  

    –  Need  low  tail  latency   –  Do  not  need  to  return  sets   –  (Update  in  place  helps  with  sets)   2.  Don’t  overwrite  the  current  version   –  Undo  logs  bad  for  whole-­‐value  writes   •  Write  the  value  twice  (but  in  same  log)   •  Blob  support  in  DBs  typically  avoids  undo  logs   –  Undo  logs  much  be`er  for:   •  Logical  operaHons  such  as  increment   •  ParHal  updates  (avoid  wriHng  the  whole  object)  
  18. Why  CompacHon?    (conHnued)   3.  Easy  to  keep  mulHple

     versions   –  All  (recent)  versions  are  in  the  log     Solves  the  iteraHon  problem:   –  Problem:  need  a  self-­‐consistent  set   –  DB  soluHon:  large  read  lock,  blocks  writes   –  DB  soluHon  2:  “snapshot  isolaHon”  (Oracle)   •  All  reads  at  Hmestamp  at  beginning  of  transacHon   –  Spanner:  ”snapshot  reads”  pick  a  Hmestamp   •  Use  the  older  versions  in  the  log   •  Extra  indexing  (similar  to  BigTable)  
  19. Atomic  transacHons?   •  Easy  to  add  for  compacHon  approach

      –  Begin  =>  log  “begin  xid”   –  Commit  =>  log  “commit  xid”  +  checksum   –  Abort  =>  do  nothing  or  log  “abort  xid”   –  Include  xid  in  consHtuent  updates   •  Recovery:   –  Only  replay  valid  commi`ed  transacHons   –  Ensures  all  or  nothing  mulH-­‐key  updates   •  Commit  also  installs  index  updates  atomically   –  Easy,  since  they  are  in  memory  
  20. MulH-­‐node  TransacHons?   •  Need  to  add  support  for  two-­‐phase

     commit   – End  of  phase  1  =>  log  “prepare  xid”   •  Really  the  same  state  is  commit  before,  but  not   commi`ed  yet   – Aoer  vote  RPC,  log  commit   – Easy  to  do  because  of  the  no-­‐overwrite  policy   •  This  also  enables  KV  updates  to  be  part  of   mulH-­‐system  transacHons  
  21. Secondary  Indices?   •  Add  second  in-­‐memory  index   • 

    TransacHons  and  logging  the  same   •  Will  need  to  lock    both  indices  someHmes   – Both  are  in  memory   – Can  use  a  single  write  lock  for  both  if  most   updates  change  both  indices  
  22. ReplicaHon  [mostly  done]   •  Many  possibiliHes   –  RelaHvely

     straightorward  given  2PC   –  Various  quorum  approaches  as  in  dynamo   •  Recovery  can  be  simplified   –  Can  get  lost  index  from  replicas   •  More  complex:   –  Geung  independent  replicas   –  Consistent  hashing  to  vary  nodes/system   •  AlternaHve:  use  a  trie,  see  Gribble  SDDS,  OSDI  2000   •  Trie  supports  range  queries   –  Micro  sharding  for  parallel  recovery  
  23. 2PC  hurts  availability…   •  Problem:  2PC  depends  on  all

     k  nodes  to  be  up   – Prob(up)  =  Prob(single  node  up)^k      [=  small]   •  Spanner  soluHon:   – Each  replica  is  actually  a  Paxos  group   •  Each  Paxos  group  local  to  one  datacenter     – 2PC  among  the  Paxos  groups   – DrasHcally  improves  Prob(single  node  up)   •  Layering  hides  the  complexity  of  Paxos  
  24. EvoluHon?   •  SomeHmes  want  to  change  the  schema  

    •  Need  version  #  for  each  compacted  file   – New  log  is  always  in  the  current  version   – CompacHon  always  writes  out  new  version   •  Two  opHons:   – Recovery  can  read  old  versions   – Converters  from  n  to  n+1    (e.g.  Microsoo  Office)   •  Compacted  files  immutable   – Enables  one-­‐Hme  batch  conversion  
  25. Consistent  Caching   •  Fundamentally  complex   – Enables  denormalizaHon,  materialized

     views   •  Basic  soluHon   – Need  to  have  “commit  hooks”   – On  commit,  noHfy  via  pub-­‐sub  to  listeners   – Listeners  invalidate  their  copies   •  Or  choose  to  serve  stale  version,  while  updaHng   •  This  can  be  a  broadly  useful  building  block   – E.g.  memcache  
  26.  What  about  joins?   •  Current  somewhat  low-­‐hanging  fruit  

    – Ordered  keys,  as  in  BigTable   – Merge  equi-­‐join  across  stores   – Roughly  how  Inktomi  worked  (sorted  by  doc  id)   •  Some  apps  essenHally  do  the  joins  themselves   •  Harder:   – Joining  against  secondary  index   – Non-­‐equal  key  joings  
  27. A  Plug  for  Stasis   Stasis  is  a  framework  for

     building  these  kinds  of   systems   – One  a`empt  at  layering   – Provides  transacHonal  logging  and  recovery   – Can  support  update  in  place,  compacHon  or  a  mix   – Handles  ORM  problem  cleanly   – Supports  2PC  (but  not  on  top  of  Paxos)   Rusty  Sears  PhD  topic   – Open  source  on  github  
  28. Stasis  and  the  Cloud   •  TradiHonal  DB  model:  

    –  Each  node  has  to  have  log  manager,  buffer  manager,   transacHon  manager  collocated  on  one  node   –  Reason:  hold  mutexes  across  calls   •  Fundamental  to  the  use  of  LSNs  on  pages   •  One  use  of  Stasis:  break  apart  these  pieces   –  LSN-­‐free  pages  =>  no  locks  across  calls   –  Instead  pipeline  async  calls  among  modules   •  Enables  modules  to  be  on  different  machines   –  Enables  a  new  approach  for  large-­‐scale  DBs   •  Sharding  no  longer  the  only  opHon  for  a  larger  DB  
  29. Conclusion   •  Two  valid  world  views   – Difference  dates

     from  the  1970s   – ConHnue  to  converge  in  the  cloud   •  Possible  outcome:   – Layered,  modular  system   – …  with  great  flexibility   – …  used  to  build  a  variety  of  systems  and  semanHcs   – …  including  a  full  SQL  DMBS