Slide 1

Slide 1 text

Execu&ng  Queries  on  a  Sharded   Database   Neha  Narula   September  25,  2012  

Slide 2

Slide 2 text

Choosing  and  Scaling  Your   Datastore   Neha  Narula   September  25,  2012  

Slide 3

Slide 3 text

Who  Am  I?   Froogle   Blobstore   Na&ve  Client  

Slide 4

Slide 4 text

In  This  Talk   •  What  to  think  about  when  choosing  a   datastore  for  a  web  applica&on   •  Myths  and  legends   •  Execu&ng  distributed  queries   •  My  research:    a  consistent  cache  

Slide 5

Slide 5 text

Every  so  oPen…   Friends  ask  me  for  advice  when  they  are   building  a  new  applica&on  

Slide 6

Slide 6 text

Friend  

Slide 7

Slide 7 text

“Hi  Neha!    I  am  making  a  new  applica&on.      

Slide 8

Slide 8 text

I  have  heard  MySQL  sucks  and  I  should  use  NoSQL.      

Slide 9

Slide 9 text

I  am  going  to  be  itera&ng  on  my  app  a  lot  

Slide 10

Slide 10 text

I  don't  have  any  customers  yet  

Slide 11

Slide 11 text

and  my  current  data  set  could  fit  on  a  thumb  drive   from  2004,  but…  

Slide 12

Slide 12 text

 Can  you  tell  me  which  NoSQL  database  I   should  use?  

Slide 13

Slide 13 text

And  how  to  shard  it?"    

Slide 14

Slide 14 text

Neha  

Slide 15

Slide 15 text

h[p://knowyourmeme.com/memes/facepalm  

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

What  to  Think  about  When  You’re   Choosing  a  Datastore   Hint:  Not  Scaling  

Slide 19

Slide 19 text

Development  Cycle   •  Prototyping   – Ease  of  use   •  Developing   – Flexibility,  mul&ple  developers   •  Running  a  real  produc&on  system   – Reliability   •  SUCCESS!    Sharding  and  specialized   datastorage       – We  should  only  be  so  lucky  

Slide 20

Slide 20 text

Gecng  Started   First  five  minutes  –  what’s  it  like?   Idea  credit:  Adam  Marcus,  The  NoSQL  Ecosystem,  HPTS  2011     via  Jus&n  Sheehy  from  Basho    

Slide 21

Slide 21 text

First  Five  Minutes:  Redis   h[p://simonwillison.net/2009/Oct/22/redis/  

Slide 22

Slide 22 text

First  Five  Minutes:  MySQL  

Slide 23

Slide 23 text

Developing   •  Mul&ple  people  working  on  the  same  code   •  Tes&ng  new  features  =>  new  access  pa[erns   •  New  person  comes  along…  

Slide 24

Slide 24 text

Redis   Time  to  go  get   lunch  

Slide 25

Slide 25 text

MySQL  

Slide 26

Slide 26 text

Ques&ons   •  What  is  your  mix  of  reads  and  writes?   •  How  much  data  do  you  have?   •  Do  you  need  transac&ons?   •  What  are  your  access  pa[erns?   •  What  will  grow  and  change?   •  What  do  you  already  know?  

Slide 27

Slide 27 text

Reads  and  Writes   •  Write  op&mized  vs.  Read  op&mized   •  MongoDB  has  a  global  write  lock  per  process   •  No  concurrent  writes!  

Slide 28

Slide 28 text

Size  of  Data   •  Does  it  fit  in  memory?   •  Disk-­‐based  solu&ons  will  be  slower   •  Worst  case,  Redis  needs  2X  the  memory  of   your  data!   – It  forks  with  copy-­‐on-­‐write   sudo echo 1 > /proc/sys/vm/overcommit_memory!  

Slide 29

Slide 29 text

Requirements   •  Performance   – Latency  tolerance   •  Durability   – Data  loss  tolerance   •  Freshness   – Staleness  tolerance   •  Up&me   – Down&me  tolerance  

Slide 30

Slide 30 text

Performance   •  Rela&onal  databases  are  considered  slow   •  But  they  are  doing  a  lot  of  work!   •  Some&mes  you  need  that  work,  some&mes   you  don’t.  

Slide 31

Slide 31 text

Simplest  Scenario   •  Responding  to  user  requests  in  real  &me   •  Frequently  read  data  fits  in  memory   •  High  read  rate   •  No  need  to  go  to  disk  or  scan  lots  of  data   Datastore  CPU  is  the  bo[leneck  

Slide 32

Slide 32 text

Cost  of  Execu&ng  a  Primary  Key  Lookup   Query   Cost   •  Receiving  message   •  Instan&a&ng  thread   •  Parsing  query   •  Op&mizing   •  Take  out  read  locks   •  (Some&mes)  lookup  in   an  index  btree   •  Responding  to  the  client   Actually  retrieving  data  

Slide 33

Slide 33 text

Op&ons  to  Make  This  Fast   •  Query  cache   •  Prepared  statements   •  Handler  Socket   – 750K  lookups/sec  on  a  single  server   Lesson:    Primary  key  lookups  can  be  fast  no   ma[er  what  datastore  you  use  

Slide 34

Slide 34 text

Flexibility  vs.  Performance   •  We  might  want  to  pay  the  overhead  for  query   flexibility   •  In  a  primary  key  datastore,  we  can  only  ask   queries  on  primary  key   •  SQL  gives  us  flexibility  to  change  our  queries  

Slide 35

Slide 35 text

Durability   •  Persistent  datastores   –  Client:    write   –  Server:  flush  to  disk,  then  send  “I  completed  your   write”   –  CRASH   –  Recover:  See  the  write   •  By  default,  MongoDB  does  not  fsync()  before   returning  to  client  on  write   –  Need  j:true   •  By  default,  MySQL  uses  MyISAM  instead  of   InnoDB  

Slide 36

Slide 36 text

Specializa&on   •  You  know  your     – query  access  pa[erns  and  traffic   – consistency  requirements   •  Design  specialized  lookups   – Transac&onal  datastore  for  consistent  data   – Memcached  for  sta&c,  mostly  unchanging  content   – Redis  for  a  data  processing  pipeline   •  Know  what  tradeoffs  to  make  

Slide 37

Slide 37 text

Ways  to  Scale   •  Reads   – Cache   – Replicate   – Par&&on   •  Writes   – Par&&on  data  amongst  mul&ple  servers    

Slide 38

Slide 38 text

Lots  of  Folklore  

Slide 39

Slide 39 text

Myths  of  Sharded  Datastores   •  NoSQL  scales  be[er  than  a  rela&onal  database   •  You  can’t  do  a  JOIN  on  a  sharded  datastore  

Slide 40

Slide 40 text

MYTH:    NoSQL  scales  be[er  than  a   rela&onal  database   •  Scaling  isn’t  about  the  datastore,  it’s  about   the  applica&on   – Examples:  FriendFeed,  Quora,  Facebook   •  Complex  queries  that  go  to  all  shards  don’t   scale   •  Simple  queries  that  par&&on  well  and  use  only   one  shard  do  scale    

Slide 41

Slide 41 text

Problem:    Applica&ons  Look  Up   Data  in  Different  Ways  

Slide 42

Slide 42 text

Post  Page   SELECT * ! FROM comments! WHERE post_id = 100! !   zrange comments:100 0 -1!   HA!  

Slide 43

Slide 43 text

Example  Par&&oned   Database   Database   Database   Database   comments table post_id! 100-199! 0-99! 200-199! Webservers  

Slide 44

Slide 44 text

MySQL MySQL MySQL Query  Goes  to  One   Par&&on   MySQL MySQL MySQL Comments   on  post  100   100-199! 0-99! 200-299!

Slide 45

Slide 45 text

MySQL MySQL MySQL Many  Concurrent   Queries   MySQL MySQL MySQL Comments   on  post  100   100-199! 0-99! 200-299! Comments   on  post  52   Comments   on  post  289  

Slide 46

Slide 46 text

User  Comments  Page   Fetch  all  of  a  user's  comments:   SELECT * FROM comments WHERE user = 'sam'!

Slide 47

Slide 47 text

Query  Goes  to  All   Par&&ons   MySQL MySQL MySQL Query  goes  to  all  servers   100-199! 0-99! 200-299! Sam’s   comments  

Slide 48

Slide 48 text

Costs  for  Query  Go  Up   When  Adding  a  New   Server   MySQL 0-99! Sam’s   comments   MySQL 100-199! MySQL 200-299! MySQL 300-399!

Slide 49

Slide 49 text

CPU  Cost  on  Server  of  Retrieving  One  Row   Two  costs:    Query  overhead  and  row  retrieval   MySQL Query   Overhead   97%  

Slide 50

Slide 50 text

Idea:  Mul&ple  Par&&onings   Par&&on  comments  table  on  post_id  and  user.     Reads  can  choose  appropriate  copy  so  queries  go  to   only  one  server.     Writes  go  to  all  copies.  

Slide 51

Slide 51 text

Mul&ple  Par&&onings   Database   Database   Database   comments table   post_id! 100-199! 0-99! 200-199! user! K-R! A-J! S-Z!

Slide 52

Slide 52 text

All  Read  Queries  Go  To  Only   One  Par&&on   MySQL MySQL MySQL 100-199! 0-99! S-Z! Sam’s   comments   Comments  on   post  100  

Slide 53

Slide 53 text

Writes  Go  to  Both  Table   Copies   MySQL MySQL MySQL 100-199! 0-99! S-Z! Write  a  new   comment  by   Vicki  on  post   100  

Slide 54

Slide 54 text

Scaling   •  Reads  scale  by  N  –  the  number  of  servers   •  Writes  are  slowed  down  by  T  –  the  number  of   table  copies   •  Big  win  if  N  >>  T   •  Table  copies  use  more  memory   – OPen  don’t  have  to  copy  large  tables  –  usually   metadata   •  How  to  execute  these  queries?  

Slide 55

Slide 55 text

Dixie   •  Is  a  query  planner  which  executes  SQL  queries  over  a   par&&oned  database     •  Op&mizes  read  queries  to  minimize  query  overhead  and   data  retrieval  costs   •  Uses  a  novel  cost  formula  which  takes  advantage  of  data   par&&oned  in  different  ways  

Slide 56

Slide 56 text

0 5000 10000 15000 20000 25000 1 2 5 10 QPS Number of Partitions One copy of page table Dixie + page.title copy Wikipedia  Workload  with  Dixie   3.2 X Each   query   going  to  1   server   Many   queries   going  to  all   servers   Wikipedia   •  Database  dump  from  2008   •  Real  HTTP  traces   •  Sharded  across  1,  2,  5,  and  10  servers   •  Compare  by  adding  a  copy  of  the  page   table  (<  100  MB),  sharded  another  way    

Slide 57

Slide 57 text

Myths  of  Sharded  Datastores   •  NoSQL  scales  be[er  than  a  rela&onal  database   •  You  can’t  do  a  JOIN  on  a  sharded  datastore  

Slide 58

Slide 58 text

MYTH:    You  can’t  execute  a  JOIN  on  a   sharded  database   •  What  are  the  JOIN  keys?    What  are  your   par&&on  keys?   •  Bad  to  move  lots  of  data   •  Not  bad  to  lookup  a  few  pieces  of  data   •  Index  lookup  JOINs  expressed  as  primary  key   lookups  are  fast  

Slide 59

Slide 59 text

Join  Query   Fetch  all  of  Alice's  comments  on  Max's  posts.   3 Alice First post! comments table post_id user text 6 7 22 Alice Alice Alice Like. You think? Nice work! 1 Max http://… posts table id author link 3 22 37 Max Max Max www. Ask Hacker http://…

Slide 60

Slide 60 text

Join  Query   Fetch  all  of  Alice's  comments  on  Max's  posts.   3 Alice First post! comments table post_id user text 6 7 22 Alice Alice Alice Like. You think? Nice work! 1 Max http://… posts table id author link 3 22 37 Max Max Max www. Ask Hacker http://…

Slide 61

Slide 61 text

Distributed  Query  Planning   •  R*   •  Shore   •  The  State  of  the  Art  in  Distributed  Query   Processing,  by  Donald  Kossman   •  Gizzard  

Slide 62

Slide 62 text

Conven&onal  Wisdom   Par&&on  on  JOIN  keys,  Send  query   as  is  to  each  server.   0-99! 0-99! 100- 199! 100- 199! 200- 299! 200- 299! SELECT comments.text ! FROM posts, comments! WHERE comments.post_id = posts.id! AND posts.author = 'Max'! AND comments.user = 'Alice'!

Slide 63

Slide 63 text

… Conven&onal  Plan  Scaling   MySQL Alice’s   comments  on   Max’s  posts   MySQL MySQL

Slide 64

Slide 64 text

Index  Lookup  Join   Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.   0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! J-R! S-Z! A-I! 0-99! A-I! SELECT posts.id ! FROM posts! WHERE posts.author = 'Max’!

Slide 65

Slide 65 text

0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! J-R! S-Z! A-I! 0-99! A-I! ids of Max’s posts Index  Lookup  Join     Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.  

Slide 66

Slide 66 text

0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! A-I! J-R! S-Z! 0-99! A-I! SELECT comments.text ! FROM comments! WHERE comments.post_id IN […]! AND comments.user = 'Alice'! Index  Lookup  Join   Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.  

Slide 67

Slide 67 text

0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! A-I! J-R! S-Z! 0-99! A-I! Alice’s  comments  on   Max’s  posts   Index  Lookup  Join   Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.  

Slide 68

Slide 68 text

… MySQL Alice’s comments on Max’s posts MySQL MySQL Index  Lookup  Join  Plan  Scaling    

Slide 69

Slide 69 text

Comparison   Conven&onal  Plan   Index  Lookup  Plan   Intermediate   data,  Max’s   posts  Alice   DIDN’T   comment  on  

Slide 70

Slide 70 text

Comparison   Conven&onal  Plan   Index  Lookup  Plan   Intermediate   data,  Max’s   posts  Alice   DIDN’T   comment  on  

Slide 71

Slide 71 text

Dixie   •  Cost  model  and  query  op&mizer  which   predicts  the  costs  of  the  two  plans   •  Query  executor  which  executes  the  original   query,  designed  for  a  single  database,  on  a   par&&oned  database  with  table  copies  

Slide 72

Slide 72 text

Dixie's  Predic&ons  for  the  Two  Plans   0 2000 4000 6000 8000 10000 0 50 100 150 Queries/Sec Posts/Author 2-Step Join Estimate Pushdown Join Estimate •  Varying  the  size  of  the  intermediate  data   •  Max’s  posts,  Alice  didn’t  comment  on   •  Fixing  the  number  of  results  returned   •  Par&&oned  over  10  servers     Conven&onal   Plan,  each   query  going   to  all  servers   Index  Lookup   Plan,  each   query  going  to   two  servers  

Slide 73

Slide 73 text

Performance  of  the  Two  Plans   0 2000 4000 6000 8000 10000 0 50 100 150 Queries/Sec Posts/Author 2-Step Join Pushdown Join 2-Step Join Estimate Pushdown Join Estimate Blue  beats  yellow  aPer  this.   Dixie  predicts  it!  

Slide 74

Slide 74 text

JOINs  Are  Not  Evil   •  When  only  transferring  small  amounts  data   •  And  with  carefully  par&&oned  data  

Slide 75

Slide 75 text

Lessons  Learned   •  Don’t  worry  at  the  beginning:  Use  what  you   know   •  Not  about  SQL  vs.  NoSQL  systems  –  you  can   scale  any  system   •  More  about  complex  vs.  simple  queries   •  When  you  do  scale,  try  to  make  every  query   use  one  (or  a  few)  shards  

Slide 76

Slide 76 text

Research:    Caching   •  Applica&ons  use  a  cache  to  store  results   computed  from  a  webserver  or  database  rows   •  Expira&on  or  invalida&on?   •  Annoying  and  difficult  for  app  developers  to   invalidate  cache  items  correctly   Work  in  progress  with  Bryan  Kate,  Eddie  Kohler,  Yandong  Mao,  and  Robert  Morris   Harvard  and  MIT  

Slide 77

Slide 77 text

Solu&on:  Dependency  Tracking  Cache   •  Applica&on  indicates  what  data  went  into  a   cached  result   •  The  system  will  take  care  of  invalida&ons   •  Don’t  need  to  recalculate  expired  data  if  it  has   not  changed   •  Always  get  the  freshest  data  when  data  has   changed   •  Track  ranges,  even  if  empty!  

Slide 78

Slide 78 text

Example  Applica&on:  Twi[er   •  Cache  a  user’s  view  of  their  twi[er  homepage   (&meline)   •  Only  invalidate  when  a  followed  user  tweets   •  Even  if  many  followed  users  tweet,  only   recalculate  the  &meline  once  when  read   •  No  need  to  recalculate  on  read  if  no  one  has   tweeted  

Slide 79

Slide 79 text

Challenges   •  Invalida&on  +  expira&on  &mes   – Reading  stale  data   •  Distributed  caching   •  Cache  coherence  on  the  original  database   rows  

Slide 80

Slide 80 text

Benefits   •  Applica&on  gets  all  the  benefits  of  caching   – Saves  on  computa&on   – Less  traffic  to  the  backend  database   •  Doesn’t  have  to  worry  about  freshness  

Slide 81

Slide 81 text

Summary   •  Choosing  a  datastore   •  Dixie   •  Dependency  tracking  cache  

Slide 82

Slide 82 text

Thanks!     [email protected]   h[p://nehanaru.la   @neha