Slide 1

Slide 1 text

Advancing  Distributed  Systems   Eric  Brewer   Professor,  UC  Berkeley   VP  Infrastructure,  Google     RICON  2012   October  11,  2012  

Slide 2

Slide 2 text

Charles  Bachman,    1973  Turing  Award     Integrated  Datastore  (IDS)   (very)  Early  “No  SQL”  database  

Slide 3

Slide 3 text

“NavigaHonal”  Database   •  Tight  integraHon  between  code  and  data   – Database  =  linked  groups  of  records        (“CODASYL”)   •  Pointers  were  physical  names,  today  we  hash   – Programmer  as  “navigator”  through  the  links   – Similar  to  DOM  engine,  WWW,  graph  DBs   •  Used  for  its  high  performance,  but…   – But  hard  to  program,  maintain   – Hard  to  evolve  the  schema  (embedded  in  code)   Wikipedia:    “IDMS”  

Slide 4

Slide 4 text

Why  RelaHonal?  (1970s)   •  Need  a  high-­‐level  model  (sets)   •  Separate  the  data  from  the  code   – SQL  is  the  (only)  API   •  Data  outlasts  any  parHcular  implementaHon   – because  the  model  doesn’t  change   •  Goal:  implement  the  top-­‐down  model  well   – Led  to  transacHons  as  a  tool   – DeclaraHve  language  leaves  room  for  opHmizaHon  

Slide 5

Slide 5 text

Also  1970s:  Unix   “The  most  important  job  of  UNIX  is  to  provide  a  file  system”   –  original  1974  Unix  paper   •   Bo`om-­‐up  world  view   –  Few,  simple,  efficient  mechanisms   –  Layers  and  composiHon   –  “navigaHonal”   –  EvoluHon  comes  from  APIs,  encapsulaHon   •  NoSQL  is  in  this  Unix  tradiHon   –  Examples:  dbm  (1979  kv),  gdbm,  Berkeley  DB,  JDBM  

Slide 6

Slide 6 text

Two  Valid  World  Views   RelaHonal  View   •  Top  Down   –  Clean  model,   –  ACID  TransacHons   •  Two  kinds  of  developers   –  DB  authors   –  SQL  programmers   •  Values   –  Clean  SemanHcs   –  Set  operaHons   –  Easy  long-­‐term  evoluHon   •  Venues:  SIGMOD,  VLDB     Systems  View   •  Bo`om  Up   –  Build  on  top   –  Evolve  modules   •  One  kind  of  programmer   –  Integrated  use   •  Values:   –  Good  APIs   –  Flexibility   –  Range  of  possible  programs   •  Venues:  SOSP,  OSDI  

Slide 7

Slide 7 text

NoSQL  in  Context   •  Large  reusable  storage  component   •  Systems  values:   – Layered,  ideally  modular  APIs   – Enable  a  range  of  systems  and  semanHcs   •  Some  things  to  build  on  top  over  Hme:   – MulH-­‐component  transacHons   – Secondary  indices   – EvoluHon  story   – Returning  sets  of  data,  not  just  values  

Slide 8

Slide 8 text

Part  2:  Some  Differences  

Slide 9

Slide 9 text

Three  InteresHng  Differences   1.  IntegraHon  into  the  larger  applicaHon   2.  Read/Write  raHo  and  latencies   3.  Sets  vs.  values  

Slide 10

Slide 10 text

1)  Object-­‐RelaHonal  Mapping  Problem   •  Map  applicaHon  objects  to  a  table   –  Object  ID  is  the  primary  key   –  Object  fields  are  the  columns   •  Update  key  =>   –     create  SQL  query  to  UPDATE  a  row   –     execute  the  query   •  Typical  consequences:   –  Extra  copies,  Poor  use  of  RAM     •  One  copy  for  the  app,  one  for  the  DB  buffer  manager   –  Inheritance,  evoluHon  are  messy   –  Performance  fine  for  Ruby  on  Rails,  but  heavyweight   “Vietnam  of  Computer  Science”  

Slide 11

Slide 11 text

2)  Read  Latency   •  For  live  Internet  services:   – Tail  latency  of  reads  is  king   – (writes  are  async  and  tail  latency  is  OK)   •  Consequences:   – Minimize  seeks  for  individual  reads   – OpHmize  data  read  together   •  Caching   •  Denormalize  data  (i.e.  copy  fields  to  mulHple  places)  

Slide 12

Slide 12 text

Denormalizing  for  Latency   •  Two  basic  problems:   1.  MulHple  copies  have  to  be  kept  in  sync   •  Slows  updates  to  make  reads  faster   2.  Significant  added  complexity   •  Really  prefer  a  single  master  copy  (modulo  replicaHon)     •  Both  SQL  and  NoSQL  have  this  problem:   –  SQL:   •  Denormalized  schemas,  consistency  constraints   •  Materialized  views  =  cached  virtual  tables  with  invalidaHon   –  NoSQL:  app  has  to  track  invalidaHon/updates  

Slide 13

Slide 13 text

DenormalizaHon  Differences   •  Key  difference:  NoSQL  tends  to  care  more   – Use  in  high-­‐performance  live  services   – Read  mostly  usage  =>  OK  to  burden  writes   •  NoSQL  typically  missing  invalidaHon  support   – SQL  materialized  views  automate  cache   invalidaHon   •  Counter  example:  Google’s  Percolator   – Incrementally  update  many  denormalized  tables   – Dependency  flows  (think  Excel  cell  updates)  

Slide 14

Slide 14 text

Read  Latency  Summary   •  Live  services  push  hard  on  read  latency   – Tend  to  want  key  data  collocated  for  ≤  1  seek   •  Many  NoSQL  systems  driven  by  this   – Airline  reservaHons:  Sabre  (pre  SQL  unHl  recently)   – Inktomi  search  engine   – Amazon’s  Dynamo   – Google’s  BigTable,  Spanner   •  Open  QuesHon:  do  SSDs  =>  normalizaHon  OK?    

Slide 15

Slide 15 text

Sets  vs.  Values   •  SQL  returns  sets   –  Joins  are  set  operaHons   –  Normally  iterate  through  results   –  Places  an  emphasis  on  locality  of  sets   •  NoSQL  ooen  returns  a  single  value   –  Denormalize  if  needed  to  get  “complete”  value   –  No  joins   –  Some  small  sets;  search  engine  returns  k  values   •  One  seek  per  value  OK  as  long  as  they  are  parallel   –  Later:  iteraHon  over  snapshots  

Slide 16

Slide 16 text

Bitcask  101   •  Simple  single-­‐node  KV  store   – All  keys  fit  into  in-­‐memory  hash  table   – All  values  go  to  a  log,  index  points  to  the  log   •  Simple  durability,  mostly  sequenHal  writes   •  All  reads  take  at  most  one  seek     – 0  if  cached   – Hash  the  key,  follow  the  pointer  to  the  log   •  Compact  log  to  reclaim  dead  space   •  Recovery  is  easy:  checkpoint  +  scan  the  log  

Slide 17

Slide 17 text

Another  difference:  update  in  place   •  Classic  DB   – Write-­‐ahead  log   – …but  later  overwrite  values  in  place   – Focus  on  future  read  locality   •  NoSQL  (Bitcask,  BigTable,  Spanner,  …)   – Log  is  the  final  locaHon   – Compact  log  to  recover  space   – Limited  mulH-­‐key  locality  aoer  compacHon  

Slide 18

Slide 18 text

Why  compacHon?   1.  Follows  from  single-­‐value  read  latency   –  Need  low  tail  latency   –  Do  not  need  to  return  sets   –  (Update  in  place  helps  with  sets)   2.  Don’t  overwrite  the  current  version   –  Undo  logs  bad  for  whole-­‐value  writes   •  Write  the  value  twice  (but  in  same  log)   •  Blob  support  in  DBs  typically  avoids  undo  logs   –  Undo  logs  much  be`er  for:   •  Logical  operaHons  such  as  increment   •  ParHal  updates  (avoid  wriHng  the  whole  object)  

Slide 19

Slide 19 text

Why  CompacHon?    (conHnued)   3.  Easy  to  keep  mulHple  versions   –  All  (recent)  versions  are  in  the  log     Solves  the  iteraHon  problem:   –  Problem:  need  a  self-­‐consistent  set   –  DB  soluHon:  large  read  lock,  blocks  writes   –  DB  soluHon  2:  “snapshot  isolaHon”  (Oracle)   •  All  reads  at  Hmestamp  at  beginning  of  transacHon   –  Spanner:  ”snapshot  reads”  pick  a  Hmestamp   •  Use  the  older  versions  in  the  log   •  Extra  indexing  (similar  to  BigTable)  

Slide 20

Slide 20 text

Part  3:  Building  Up  

Slide 21

Slide 21 text

Atomic  transacHons?   •  Easy  to  add  for  compacHon  approach   –  Begin  =>  log  “begin  xid”   –  Commit  =>  log  “commit  xid”  +  checksum   –  Abort  =>  do  nothing  or  log  “abort  xid”   –  Include  xid  in  consHtuent  updates   •  Recovery:   –  Only  replay  valid  commi`ed  transacHons   –  Ensures  all  or  nothing  mulH-­‐key  updates   •  Commit  also  installs  index  updates  atomically   –  Easy,  since  they  are  in  memory  

Slide 22

Slide 22 text

MulH-­‐node  TransacHons?   •  Need  to  add  support  for  two-­‐phase  commit   – End  of  phase  1  =>  log  “prepare  xid”   •  Really  the  same  state  is  commit  before,  but  not   commi`ed  yet   – Aoer  vote  RPC,  log  commit   – Easy  to  do  because  of  the  no-­‐overwrite  policy   •  This  also  enables  KV  updates  to  be  part  of   mulH-­‐system  transacHons  

Slide 23

Slide 23 text

Secondary  Indices?   •  Add  second  in-­‐memory  index   •  TransacHons  and  logging  the  same   •  Will  need  to  lock    both  indices  someHmes   – Both  are  in  memory   – Can  use  a  single  write  lock  for  both  if  most   updates  change  both  indices  

Slide 24

Slide 24 text

ReplicaHon  [mostly  done]   •  Many  possibiliHes   –  RelaHvely  straightorward  given  2PC   –  Various  quorum  approaches  as  in  dynamo   •  Recovery  can  be  simplified   –  Can  get  lost  index  from  replicas   •  More  complex:   –  Geung  independent  replicas   –  Consistent  hashing  to  vary  nodes/system   •  AlternaHve:  use  a  trie,  see  Gribble  SDDS,  OSDI  2000   •  Trie  supports  range  queries   –  Micro  sharding  for  parallel  recovery  

Slide 25

Slide 25 text

2PC  hurts  availability…   •  Problem:  2PC  depends  on  all  k  nodes  to  be  up   – Prob(up)  =  Prob(single  node  up)^k      [=  small]   •  Spanner  soluHon:   – Each  replica  is  actually  a  Paxos  group   •  Each  Paxos  group  local  to  one  datacenter     – 2PC  among  the  Paxos  groups   – DrasHcally  improves  Prob(single  node  up)   •  Layering  hides  the  complexity  of  Paxos  

Slide 26

Slide 26 text

EvoluHon?   •  SomeHmes  want  to  change  the  schema   •  Need  version  #  for  each  compacted  file   – New  log  is  always  in  the  current  version   – CompacHon  always  writes  out  new  version   •  Two  opHons:   – Recovery  can  read  old  versions   – Converters  from  n  to  n+1    (e.g.  Microsoo  Office)   •  Compacted  files  immutable   – Enables  one-­‐Hme  batch  conversion  

Slide 27

Slide 27 text

Consistent  Caching   •  Fundamentally  complex   – Enables  denormalizaHon,  materialized  views   •  Basic  soluHon   – Need  to  have  “commit  hooks”   – On  commit,  noHfy  via  pub-­‐sub  to  listeners   – Listeners  invalidate  their  copies   •  Or  choose  to  serve  stale  version,  while  updaHng   •  This  can  be  a  broadly  useful  building  block   – E.g.  memcache  

Slide 28

Slide 28 text

 What  about  joins?   •  Current  somewhat  low-­‐hanging  fruit   – Ordered  keys,  as  in  BigTable   – Merge  equi-­‐join  across  stores   – Roughly  how  Inktomi  worked  (sorted  by  doc  id)   •  Some  apps  essenHally  do  the  joins  themselves   •  Harder:   – Joining  against  secondary  index   – Non-­‐equal  key  joings  

Slide 29

Slide 29 text

A  Plug  for  Stasis   Stasis  is  a  framework  for  building  these  kinds  of   systems   – One  a`empt  at  layering   – Provides  transacHonal  logging  and  recovery   – Can  support  update  in  place,  compacHon  or  a  mix   – Handles  ORM  problem  cleanly   – Supports  2PC  (but  not  on  top  of  Paxos)   Rusty  Sears  PhD  topic   – Open  source  on  github  

Slide 30

Slide 30 text

Stasis  and  the  Cloud   •  TradiHonal  DB  model:   –  Each  node  has  to  have  log  manager,  buffer  manager,   transacHon  manager  collocated  on  one  node   –  Reason:  hold  mutexes  across  calls   •  Fundamental  to  the  use  of  LSNs  on  pages   •  One  use  of  Stasis:  break  apart  these  pieces   –  LSN-­‐free  pages  =>  no  locks  across  calls   –  Instead  pipeline  async  calls  among  modules   •  Enables  modules  to  be  on  different  machines   –  Enables  a  new  approach  for  large-­‐scale  DBs   •  Sharding  no  longer  the  only  opHon  for  a  larger  DB  

Slide 31

Slide 31 text

Conclusion   •  Two  valid  world  views   – Difference  dates  from  the  1970s   – ConHnue  to  converge  in  the  cloud   •  Possible  outcome:   – Layered,  modular  system   – …  with  great  flexibility   – …  used  to  build  a  variety  of  systems  and  semanHcs   – …  including  a  full  SQL  DMBS