Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Executing Queries on a Sharded Database

Neha
September 25, 2012

Executing Queries on a Sharded Database

Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.

No matter what datastore you choose, you will eventually have to consider sharding your data store to support growing traffic. Two key challenges arise: (1) web workloads often do not have one clear partitioning and (2) it is challenging to determine how to efficiently execute queries over partitioned tables.

To address these challenges I present Dixie, a SQL query planner, optimizer, and executor for databases horizontally partitioned over multiple servers. Dixie shows that we shouldn’t give up on SQL databases just yet. Dixie automates the exploitation of tables with multiple copies partitioned in different ways, in order to increase throughput by expanding the set of queries that need not be sent to all servers. Central to Dixie’s design are a cost model and plan generator that are mindful of queries small enough that query overhead may dominate the cost. For a large class of joins, which traditional wisdom suggests require tables partitioned on the join keys, Dixie can find higher-performance plans using other partitionings.

Presented at Strange Loop, St. Louis, MO.

Neha

September 25, 2012
Tweet

More Decks by Neha

Other Decks in Programming

Transcript

  1. In  This  Talk   •  What  to  think  about  when

     choosing  a   datastore  for  a  web  applica&on   •  Myths  and  legends   •  Execu&ng  distributed  queries   •  My  research:    a  consistent  cache  
  2. Every  so  oPen…   Friends  ask  me  for  advice  when

     they  are   building  a  new  applica&on  
  3. Development  Cycle   •  Prototyping   – Ease  of  use  

    •  Developing   – Flexibility,  mul&ple  developers   •  Running  a  real  produc&on  system   – Reliability   •  SUCCESS!    Sharding  and  specialized   datastorage       – We  should  only  be  so  lucky  
  4. Gecng  Started   First  five  minutes  –  what’s  it  like?

      Idea  credit:  Adam  Marcus,  The  NoSQL  Ecosystem,  HPTS  2011     via  Jus&n  Sheehy  from  Basho    
  5. Developing   •  Mul&ple  people  working  on  the  same  code

      •  Tes&ng  new  features  =>  new  access  pa[erns   •  New  person  comes  along…  
  6. Ques&ons   •  What  is  your  mix  of  reads  and

     writes?   •  How  much  data  do  you  have?   •  Do  you  need  transac&ons?   •  What  are  your  access  pa[erns?   •  What  will  grow  and  change?   •  What  do  you  already  know?  
  7. Reads  and  Writes   •  Write  op&mized  vs.  Read  op&mized

      •  MongoDB  has  a  global  write  lock  per  process   •  No  concurrent  writes!  
  8. Size  of  Data   •  Does  it  fit  in  memory?

      •  Disk-­‐based  solu&ons  will  be  slower   •  Worst  case,  Redis  needs  2X  the  memory  of   your  data!   – It  forks  with  copy-­‐on-­‐write   sudo echo 1 > /proc/sys/vm/overcommit_memory!  
  9. Requirements   •  Performance   – Latency  tolerance   •  Durability

      – Data  loss  tolerance   •  Freshness   – Staleness  tolerance   •  Up&me   – Down&me  tolerance  
  10. Performance   •  Rela&onal  databases  are  considered  slow   • 

    But  they  are  doing  a  lot  of  work!   •  Some&mes  you  need  that  work,  some&mes   you  don’t.  
  11. Simplest  Scenario   •  Responding  to  user  requests  in  real

     &me   •  Frequently  read  data  fits  in  memory   •  High  read  rate   •  No  need  to  go  to  disk  or  scan  lots  of  data   Datastore  CPU  is  the  bo[leneck  
  12. Cost  of  Execu&ng  a  Primary  Key  Lookup   Query  

    Cost   •  Receiving  message   •  Instan&a&ng  thread   •  Parsing  query   •  Op&mizing   •  Take  out  read  locks   •  (Some&mes)  lookup  in   an  index  btree   •  Responding  to  the  client   Actually  retrieving  data  
  13. Op&ons  to  Make  This  Fast   •  Query  cache  

    •  Prepared  statements   •  Handler  Socket   – 750K  lookups/sec  on  a  single  server   Lesson:    Primary  key  lookups  can  be  fast  no   ma[er  what  datastore  you  use  
  14. Flexibility  vs.  Performance   •  We  might  want  to  pay

     the  overhead  for  query   flexibility   •  In  a  primary  key  datastore,  we  can  only  ask   queries  on  primary  key   •  SQL  gives  us  flexibility  to  change  our  queries  
  15. Durability   •  Persistent  datastores   –  Client:    write

      –  Server:  flush  to  disk,  then  send  “I  completed  your   write”   –  CRASH   –  Recover:  See  the  write   •  By  default,  MongoDB  does  not  fsync()  before   returning  to  client  on  write   –  Need  j:true   •  By  default,  MySQL  uses  MyISAM  instead  of   InnoDB  
  16. Specializa&on   •  You  know  your     – query  access

     pa[erns  and  traffic   – consistency  requirements   •  Design  specialized  lookups   – Transac&onal  datastore  for  consistent  data   – Memcached  for  sta&c,  mostly  unchanging  content   – Redis  for  a  data  processing  pipeline   •  Know  what  tradeoffs  to  make  
  17. Ways  to  Scale   •  Reads   – Cache   – Replicate

      – Par&&on   •  Writes   – Par&&on  data  amongst  mul&ple  servers    
  18. Myths  of  Sharded  Datastores   •  NoSQL  scales  be[er  than

     a  rela&onal  database   •  You  can’t  do  a  JOIN  on  a  sharded  datastore  
  19. MYTH:    NoSQL  scales  be[er  than  a   rela&onal  database

      •  Scaling  isn’t  about  the  datastore,  it’s  about   the  applica&on   – Examples:  FriendFeed,  Quora,  Facebook   •  Complex  queries  that  go  to  all  shards  don’t   scale   •  Simple  queries  that  par&&on  well  and  use  only   one  shard  do  scale    
  20. Post  Page   SELECT * ! FROM comments! WHERE post_id

    = 100! !   zrange comments:100 0 -1!   HA!  
  21. Example  Par&&oned   Database   Database   Database   Database

      comments table post_id! 100-199! 0-99! 200-199! Webservers  
  22. MySQL MySQL MySQL Query  Goes  to  One   Par&&on  

    MySQL MySQL MySQL Comments   on  post  100   100-199! 0-99! 200-299!
  23. MySQL MySQL MySQL Many  Concurrent   Queries   MySQL MySQL

    MySQL Comments   on  post  100   100-199! 0-99! 200-299! Comments   on  post  52   Comments   on  post  289  
  24. User  Comments  Page   Fetch  all  of  a  user's  comments:

      SELECT * FROM comments WHERE user = 'sam'!
  25. Query  Goes  to  All   Par&&ons   MySQL MySQL MySQL

    Query  goes  to  all  servers   100-199! 0-99! 200-299! Sam’s   comments  
  26. Costs  for  Query  Go  Up   When  Adding  a  New

      Server   MySQL 0-99! Sam’s   comments   MySQL 100-199! MySQL 200-299! MySQL 300-399!
  27. CPU  Cost  on  Server  of  Retrieving  One  Row   Two

     costs:    Query  overhead  and  row  retrieval   MySQL Query   Overhead   97%  
  28. Idea:  Mul&ple  Par&&onings   Par&&on  comments  table  on  post_id  and

     user.     Reads  can  choose  appropriate  copy  so  queries  go  to   only  one  server.     Writes  go  to  all  copies.  
  29. Mul&ple  Par&&onings   Database   Database   Database   comments

    table   post_id! 100-199! 0-99! 200-199! user! K-R! A-J! S-Z!
  30. All  Read  Queries  Go  To  Only   One  Par&&on  

    MySQL MySQL MySQL 100-199! 0-99! S-Z! Sam’s   comments   Comments  on   post  100  
  31. Writes  Go  to  Both  Table   Copies   MySQL MySQL

    MySQL 100-199! 0-99! S-Z! Write  a  new   comment  by   Vicki  on  post   100  
  32. Scaling   •  Reads  scale  by  N  –  the  number

     of  servers   •  Writes  are  slowed  down  by  T  –  the  number  of   table  copies   •  Big  win  if  N  >>  T   •  Table  copies  use  more  memory   – OPen  don’t  have  to  copy  large  tables  –  usually   metadata   •  How  to  execute  these  queries?  
  33. Dixie   •  Is  a  query  planner  which  executes  SQL

     queries  over  a   par&&oned  database     •  Op&mizes  read  queries  to  minimize  query  overhead  and   data  retrieval  costs   •  Uses  a  novel  cost  formula  which  takes  advantage  of  data   par&&oned  in  different  ways  
  34. 0 5000 10000 15000 20000 25000 1 2 5 10

    QPS Number of Partitions One copy of page table Dixie + page.title copy Wikipedia  Workload  with  Dixie   3.2 X Each   query   going  to  1   server   Many   queries   going  to  all   servers   Wikipedia   •  Database  dump  from  2008   •  Real  HTTP  traces   •  Sharded  across  1,  2,  5,  and  10  servers   •  Compare  by  adding  a  copy  of  the  page   table  (<  100  MB),  sharded  another  way    
  35. Myths  of  Sharded  Datastores   •  NoSQL  scales  be[er  than

     a  rela&onal  database   •  You  can’t  do  a  JOIN  on  a  sharded  datastore  
  36. MYTH:    You  can’t  execute  a  JOIN  on  a  

    sharded  database   •  What  are  the  JOIN  keys?    What  are  your   par&&on  keys?   •  Bad  to  move  lots  of  data   •  Not  bad  to  lookup  a  few  pieces  of  data   •  Index  lookup  JOINs  expressed  as  primary  key   lookups  are  fast  
  37. Join  Query   Fetch  all  of  Alice's  comments  on  Max's

     posts.   3 Alice First post! comments table post_id user text 6 7 22 Alice Alice Alice Like. You think? Nice work! 1 Max http://… posts table id author link 3 22 37 Max Max Max www. Ask Hacker http://…
  38. Join  Query   Fetch  all  of  Alice's  comments  on  Max's

     posts.   3 Alice First post! comments table post_id user text 6 7 22 Alice Alice Alice Like. You think? Nice work! 1 Max http://… posts table id author link 3 22 37 Max Max Max www. Ask Hacker http://…
  39. Distributed  Query  Planning   •  R*   •  Shore  

    •  The  State  of  the  Art  in  Distributed  Query   Processing,  by  Donald  Kossman   •  Gizzard  
  40. Conven&onal  Wisdom   Par&&on  on  JOIN  keys,  Send  query  

    as  is  to  each  server.   0-99! 0-99! 100- 199! 100- 199! 200- 299! 200- 299! SELECT comments.text ! FROM posts, comments! WHERE comments.post_id = posts.id! AND posts.author = 'Max'! AND comments.user = 'Alice'!
  41. Index  Lookup  Join   Par&&on  on  filter  keys,  retrieve  Max's

      posts,  then  retrieve  Alice's  comments  on   Max's  posts.   0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! J-R! S-Z! A-I! 0-99! A-I! SELECT posts.id ! FROM posts! WHERE posts.author = 'Max’!
  42. 0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200-

    299! J-R! S-Z! A-I! 0-99! A-I! ids of Max’s posts Index  Lookup  Join     Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.  
  43. 0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200-

    299! A-I! J-R! S-Z! 0-99! A-I! SELECT comments.text ! FROM comments! WHERE comments.post_id IN […]! AND comments.user = 'Alice'! Index  Lookup  Join   Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.  
  44. 0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200-

    299! A-I! J-R! S-Z! 0-99! A-I! Alice’s  comments  on   Max’s  posts   Index  Lookup  Join   Par&&on  on  filter  keys,  retrieve  Max's   posts,  then  retrieve  Alice's  comments  on   Max's  posts.  
  45. Comparison   Conven&onal  Plan   Index  Lookup  Plan   Intermediate

      data,  Max’s   posts  Alice   DIDN’T   comment  on  
  46. Comparison   Conven&onal  Plan   Index  Lookup  Plan   Intermediate

      data,  Max’s   posts  Alice   DIDN’T   comment  on  
  47. Dixie   •  Cost  model  and  query  op&mizer  which  

    predicts  the  costs  of  the  two  plans   •  Query  executor  which  executes  the  original   query,  designed  for  a  single  database,  on  a   par&&oned  database  with  table  copies  
  48. Dixie's  Predic&ons  for  the  Two  Plans   0 2000 4000

    6000 8000 10000 0 50 100 150 Queries/Sec Posts/Author 2-Step Join Estimate Pushdown Join Estimate •  Varying  the  size  of  the  intermediate  data   •  Max’s  posts,  Alice  didn’t  comment  on   •  Fixing  the  number  of  results  returned   •  Par&&oned  over  10  servers     Conven&onal   Plan,  each   query  going   to  all  servers   Index  Lookup   Plan,  each   query  going  to   two  servers  
  49. Performance  of  the  Two  Plans   0 2000 4000 6000

    8000 10000 0 50 100 150 Queries/Sec Posts/Author 2-Step Join Pushdown Join 2-Step Join Estimate Pushdown Join Estimate Blue  beats  yellow  aPer  this.   Dixie  predicts  it!  
  50. JOINs  Are  Not  Evil   •  When  only  transferring  small

     amounts  data   •  And  with  carefully  par&&oned  data  
  51. Lessons  Learned   •  Don’t  worry  at  the  beginning:  Use

     what  you   know   •  Not  about  SQL  vs.  NoSQL  systems  –  you  can   scale  any  system   •  More  about  complex  vs.  simple  queries   •  When  you  do  scale,  try  to  make  every  query   use  one  (or  a  few)  shards  
  52. Research:    Caching   •  Applica&ons  use  a  cache  to

     store  results   computed  from  a  webserver  or  database  rows   •  Expira&on  or  invalida&on?   •  Annoying  and  difficult  for  app  developers  to   invalidate  cache  items  correctly   Work  in  progress  with  Bryan  Kate,  Eddie  Kohler,  Yandong  Mao,  and  Robert  Morris   Harvard  and  MIT  
  53. Solu&on:  Dependency  Tracking  Cache   •  Applica&on  indicates  what  data

     went  into  a   cached  result   •  The  system  will  take  care  of  invalida&ons   •  Don’t  need  to  recalculate  expired  data  if  it  has   not  changed   •  Always  get  the  freshest  data  when  data  has   changed   •  Track  ranges,  even  if  empty!  
  54. Example  Applica&on:  Twi[er   •  Cache  a  user’s  view  of

     their  twi[er  homepage   (&meline)   •  Only  invalidate  when  a  followed  user  tweets   •  Even  if  many  followed  users  tweet,  only   recalculate  the  &meline  once  when  read   •  No  need  to  recalculate  on  read  if  no  one  has   tweeted  
  55. Challenges   •  Invalida&on  +  expira&on  &mes   – Reading  stale

     data   •  Distributed  caching   •  Cache  coherence  on  the  original  database   rows  
  56. Benefits   •  Applica&on  gets  all  the  benefits  of  caching

      – Saves  on  computa&on   – Less  traffic  to  the  backend  database   •  Doesn’t  have  to  worry  about  freshness  
  57. Summary   •  Choosing  a  datastore   •  Dixie  

    •  Dependency  tracking  cache