Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Executing Queries on a Sharded Database

Neha
September 25, 2012

Executing Queries on a Sharded Database

Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.

No matter what datastore you choose, you will eventually have to consider sharding your data store to support growing traffic. Two key challenges arise: (1) web workloads often do not have one clear partitioning and (2) it is challenging to determine how to efficiently execute queries over partitioned tables.

To address these challenges I present Dixie, a SQL query planner, optimizer, and executor for databases horizontally partitioned over multiple servers. Dixie shows that we shouldn’t give up on SQL databases just yet. Dixie automates the exploitation of tables with multiple copies partitioned in different ways, in order to increase throughput by expanding the set of queries that need not be sent to all servers. Central to Dixie’s design are a cost model and plan generator that are mindful of queries small enough that query overhead may dominate the cost. For a large class of joins, which traditional wisdom suggests require tables partitioned on the join keys, Dixie can find higher-performance plans using other partitionings.

Presented at Strange Loop, St. Louis, MO.

Neha

September 25, 2012
Tweet

More Decks by Neha

Other Decks in Programming

Transcript

  1. Execu&ng  Queries  on  a  Sharded  
    Database  
    Neha  Narula  
    September  25,  2012  

    View Slide

  2. Choosing  and  Scaling  Your  
    Datastore  
    Neha  Narula  
    September  25,  2012  

    View Slide

  3. Who  Am  I?  
    Froogle  
    Blobstore  
    Na&ve  Client  

    View Slide

  4. In  This  Talk  
    •  What  to  think  about  when  choosing  a  
    datastore  for  a  web  applica&on  
    •  Myths  and  legends  
    •  Execu&ng  distributed  queries  
    •  My  research:    a  consistent  cache  

    View Slide

  5. Every  so  oPen…  
    Friends  ask  me  for  advice  when  they  are  
    building  a  new  applica&on  

    View Slide

  6. Friend  

    View Slide

  7. “Hi  Neha!    I  am  making  a  new  applica&on.      

    View Slide

  8. I  have  heard  MySQL  sucks  and  I  should  use  NoSQL.      

    View Slide

  9. I  am  going  to  be  itera&ng  on  my  app  a  lot  

    View Slide

  10. I  don't  have  any  customers  yet  

    View Slide

  11. and  my  current  data  set  could  fit  on  a  thumb  drive  
    from  2004,  but…  

    View Slide

  12.  Can  you  tell  me  which  NoSQL  database  I  
    should  use?  

    View Slide

  13. And  how  to  shard  it?"  
     

    View Slide

  14. Neha  

    View Slide

  15. h[p://knowyourmeme.com/memes/facepalm  

    View Slide

  16. View Slide

  17. View Slide

  18. What  to  Think  about  When  You’re  
    Choosing  a  Datastore  
    Hint:  Not  Scaling  

    View Slide

  19. Development  Cycle  
    •  Prototyping  
    – Ease  of  use  
    •  Developing  
    – Flexibility,  mul&ple  developers  
    •  Running  a  real  produc&on  system  
    – Reliability  
    •  SUCCESS!    Sharding  and  specialized  
    datastorage      
    – We  should  only  be  so  lucky  

    View Slide

  20. Gecng  Started  
    First  five  minutes  –  what’s  it  like?  
    Idea  credit:  Adam  Marcus,  The  NoSQL  Ecosystem,  HPTS  2011  
     
    via  Jus&n  Sheehy  from  Basho    

    View Slide

  21. First  Five  Minutes:  Redis  
    h[p://simonwillison.net/2009/Oct/22/redis/  

    View Slide

  22. First  Five  Minutes:  MySQL  

    View Slide

  23. Developing  
    •  Mul&ple  people  working  on  the  same  code  
    •  Tes&ng  new  features  =>  new  access  pa[erns  
    •  New  person  comes  along…  

    View Slide

  24. Redis  
    Time  to  go  get  
    lunch  

    View Slide

  25. MySQL  

    View Slide

  26. Ques&ons  
    •  What  is  your  mix  of  reads  and  writes?  
    •  How  much  data  do  you  have?  
    •  Do  you  need  transac&ons?  
    •  What  are  your  access  pa[erns?  
    •  What  will  grow  and  change?  
    •  What  do  you  already  know?  

    View Slide

  27. Reads  and  Writes  
    •  Write  op&mized  vs.  Read  op&mized  
    •  MongoDB  has  a  global  write  lock  per  process  
    •  No  concurrent  writes!  

    View Slide

  28. Size  of  Data  
    •  Does  it  fit  in  memory?  
    •  Disk-­‐based  solu&ons  will  be  slower  
    •  Worst  case,  Redis  needs  2X  the  memory  of  
    your  data!  
    – It  forks  with  copy-­‐on-­‐write  
    sudo echo 1 > /proc/sys/vm/overcommit_memory!
     

    View Slide

  29. Requirements  
    •  Performance  
    – Latency  tolerance  
    •  Durability  
    – Data  loss  tolerance  
    •  Freshness  
    – Staleness  tolerance  
    •  Up&me  
    – Down&me  tolerance  

    View Slide

  30. Performance  
    •  Rela&onal  databases  are  considered  slow  
    •  But  they  are  doing  a  lot  of  work!  
    •  Some&mes  you  need  that  work,  some&mes  
    you  don’t.  

    View Slide

  31. Simplest  Scenario  
    •  Responding  to  user  requests  in  real  &me  
    •  Frequently  read  data  fits  in  memory  
    •  High  read  rate  
    •  No  need  to  go  to  disk  or  scan  lots  of  data  
    Datastore  CPU  is  the  bo[leneck  

    View Slide

  32. Cost  of  Execu&ng  a  Primary  Key  Lookup  
    Query  
    Cost  
    •  Receiving  message  
    •  Instan&a&ng  thread  
    •  Parsing  query  
    •  Op&mizing  
    •  Take  out  read  locks  
    •  (Some&mes)  lookup  in  
    an  index  btree  
    •  Responding  to  the  client  
    Actually  retrieving  data  

    View Slide

  33. Op&ons  to  Make  This  Fast  
    •  Query  cache  
    •  Prepared  statements  
    •  Handler  Socket  
    – 750K  lookups/sec  on  a  single  server  
    Lesson:    Primary  key  lookups  can  be  fast  no  
    ma[er  what  datastore  you  use  

    View Slide

  34. Flexibility  vs.  Performance  
    •  We  might  want  to  pay  the  overhead  for  query  
    flexibility  
    •  In  a  primary  key  datastore,  we  can  only  ask  
    queries  on  primary  key  
    •  SQL  gives  us  flexibility  to  change  our  queries  

    View Slide

  35. Durability  
    •  Persistent  datastores  
    –  Client:    write  
    –  Server:  flush  to  disk,  then  send  “I  completed  your  
    write”  
    –  CRASH  
    –  Recover:  See  the  write  
    •  By  default,  MongoDB  does  not  fsync()  before  
    returning  to  client  on  write  
    –  Need  j:true  
    •  By  default,  MySQL  uses  MyISAM  instead  of  
    InnoDB  

    View Slide

  36. Specializa&on  
    •  You  know  your    
    – query  access  pa[erns  and  traffic  
    – consistency  requirements  
    •  Design  specialized  lookups  
    – Transac&onal  datastore  for  consistent  data  
    – Memcached  for  sta&c,  mostly  unchanging  content  
    – Redis  for  a  data  processing  pipeline  
    •  Know  what  tradeoffs  to  make  

    View Slide

  37. Ways  to  Scale  
    •  Reads  
    – Cache  
    – Replicate  
    – Par&&on  
    •  Writes  
    – Par&&on  data  amongst  mul&ple  servers  
     

    View Slide

  38. Lots  of  Folklore  

    View Slide

  39. Myths  of  Sharded  Datastores  
    •  NoSQL  scales  be[er  than  a  rela&onal  database  
    •  You  can’t  do  a  JOIN  on  a  sharded  datastore  

    View Slide

  40. MYTH:    NoSQL  scales  be[er  than  a  
    rela&onal  database  
    •  Scaling  isn’t  about  the  datastore,  it’s  about  
    the  applica&on  
    – Examples:  FriendFeed,  Quora,  Facebook  
    •  Complex  queries  that  go  to  all  shards  don’t  
    scale  
    •  Simple  queries  that  par&&on  well  and  use  only  
    one  shard  do  scale  
     

    View Slide

  41. Problem:    Applica&ons  Look  Up  
    Data  in  Different  Ways  

    View Slide

  42. Post  Page  
    SELECT * !
    FROM comments!
    WHERE post_id = 100!
    !
     
    zrange comments:100 0 -1!
     
    HA!  

    View Slide

  43. Example  Par&&oned  
    Database  
    Database  
    Database  
    Database  
    comments table
    post_id!
    100-199!
    0-99!
    200-199!
    Webservers  

    View Slide

  44. MySQL
    MySQL
    MySQL
    Query  Goes  to  One  
    Par&&on  
    MySQL
    MySQL
    MySQL
    Comments  
    on  post  100  
    100-199!
    0-99!
    200-299!

    View Slide

  45. MySQL
    MySQL
    MySQL
    Many  Concurrent  
    Queries  
    MySQL
    MySQL
    MySQL
    Comments  
    on  post  100  
    100-199!
    0-99!
    200-299!
    Comments  
    on  post  52  
    Comments  
    on  post  289  

    View Slide

  46. User  Comments  Page  
    Fetch  all  of  a  user's  comments:  
    SELECT * FROM comments WHERE user = 'sam'!

    View Slide

  47. Query  Goes  to  All  
    Par&&ons  
    MySQL
    MySQL
    MySQL
    Query  goes  to  all  servers  
    100-199!
    0-99!
    200-299!
    Sam’s  
    comments  

    View Slide

  48. Costs  for  Query  Go  Up  
    When  Adding  a  New  
    Server  
    MySQL
    0-99!
    Sam’s  
    comments  
    MySQL
    100-199!
    MySQL
    200-299!
    MySQL
    300-399!

    View Slide

  49. CPU  Cost  on  Server  of  Retrieving  One  Row  
    Two  costs:    Query  overhead  and  row  retrieval  
    MySQL
    Query  
    Overhead  
    97%  

    View Slide

  50. Idea:  Mul&ple  Par&&onings  
    Par&&on  comments  table  on  post_id  and  user.  
     
    Reads  can  choose  appropriate  copy  so  queries  go  to  
    only  one  server.  
     
    Writes  go  to  all  copies.  

    View Slide

  51. Mul&ple  Par&&onings  
    Database  
    Database  
    Database  
    comments table  
    post_id!
    100-199!
    0-99!
    200-199!
    user!
    K-R!
    A-J!
    S-Z!

    View Slide

  52. All  Read  Queries  Go  To  Only  
    One  Par&&on  
    MySQL
    MySQL
    MySQL
    100-199!
    0-99!
    S-Z!
    Sam’s  
    comments  
    Comments  on  
    post  100  

    View Slide

  53. Writes  Go  to  Both  Table  
    Copies  
    MySQL
    MySQL
    MySQL
    100-199!
    0-99!
    S-Z!
    Write  a  new  
    comment  by  
    Vicki  on  post  
    100  

    View Slide

  54. Scaling  
    •  Reads  scale  by  N  –  the  number  of  servers  
    •  Writes  are  slowed  down  by  T  –  the  number  of  
    table  copies  
    •  Big  win  if  N  >>  T  
    •  Table  copies  use  more  memory  
    – OPen  don’t  have  to  copy  large  tables  –  usually  
    metadata  
    •  How  to  execute  these  queries?  

    View Slide

  55. Dixie  
    •  Is  a  query  planner  which  executes  SQL  queries  over  a  
    par&&oned  database  
     
    •  Op&mizes  read  queries  to  minimize  query  overhead  and  
    data  retrieval  costs  
    •  Uses  a  novel  cost  formula  which  takes  advantage  of  data  
    par&&oned  in  different  ways  

    View Slide

  56. 0
    5000
    10000
    15000
    20000
    25000
    1 2 5 10
    QPS
    Number of Partitions
    One copy of page table
    Dixie + page.title copy
    Wikipedia  Workload  with  Dixie  
    3.2 X
    Each  
    query  
    going  to  1  
    server  
    Many  
    queries  
    going  to  all  
    servers  
    Wikipedia  
    •  Database  dump  from  2008  
    •  Real  HTTP  traces  
    •  Sharded  across  1,  2,  5,  and  10  servers  
    •  Compare  by  adding  a  copy  of  the  page  
    table  (<  100  MB),  sharded  another  way  
     

    View Slide

  57. Myths  of  Sharded  Datastores  
    •  NoSQL  scales  be[er  than  a  rela&onal  database  
    •  You  can’t  do  a  JOIN  on  a  sharded  datastore  

    View Slide

  58. MYTH:    You  can’t  execute  a  JOIN  on  a  
    sharded  database  
    •  What  are  the  JOIN  keys?    What  are  your  
    par&&on  keys?  
    •  Bad  to  move  lots  of  data  
    •  Not  bad  to  lookup  a  few  pieces  of  data  
    •  Index  lookup  JOINs  expressed  as  primary  key  
    lookups  are  fast  

    View Slide

  59. Join  Query  
    Fetch  all  of  Alice's  comments  on  Max's  posts.  
    3 Alice First post!
    comments table
    post_id user text
    6
    7
    22
    Alice
    Alice
    Alice
    Like.
    You think?
    Nice work!
    1 Max http://…
    posts table
    id author link
    3
    22
    37
    Max
    Max
    Max
    www.
    Ask Hacker
    http://…

    View Slide

  60. Join  Query  
    Fetch  all  of  Alice's  comments  on  Max's  posts.  
    3 Alice First post!
    comments table
    post_id user text
    6
    7
    22
    Alice
    Alice
    Alice
    Like.
    You think?
    Nice work!
    1 Max http://…
    posts table
    id author link
    3
    22
    37
    Max
    Max
    Max
    www.
    Ask Hacker
    http://…

    View Slide

  61. Distributed  Query  Planning  
    •  R*  
    •  Shore  
    •  The  State  of  the  Art  in  Distributed  Query  
    Processing,  by  Donald  Kossman  
    •  Gizzard  

    View Slide

  62. Conven&onal  Wisdom  
    Par&&on  on  JOIN  keys,  Send  query  
    as  is  to  each  server.  
    0-99! 0-99!
    100-
    199!
    100-
    199!
    200-
    299!
    200-
    299!
    SELECT comments.text !
    FROM posts, comments!
    WHERE comments.post_id = posts.id!
    AND posts.author = 'Max'!
    AND comments.user = 'Alice'!

    View Slide


  63. Conven&onal  Plan  Scaling  
    MySQL
    Alice’s  
    comments  on  
    Max’s  posts  
    MySQL MySQL

    View Slide

  64. Index  Lookup  Join  
    Par&&on  on  filter  keys,  retrieve  Max's  
    posts,  then  retrieve  Alice's  comments  on  
    Max's  posts.  
    0-99!
    100-
    199!
    200-
    299!
    J-R!
    S-Z!
    100-
    199!
    200-
    299!
    J-R!
    S-Z!
    A-I!
    0-99!
    A-I!
    SELECT posts.id !
    FROM posts!
    WHERE posts.author = 'Max’!

    View Slide

  65. 0-99!
    100-
    199!
    200-
    299!
    J-R!
    S-Z!
    100-
    199!
    200-
    299!
    J-R!
    S-Z!
    A-I!
    0-99!
    A-I!
    ids of Max’s
    posts
    Index  Lookup  Join  
     
    Par&&on  on  filter  keys,  retrieve  Max's  
    posts,  then  retrieve  Alice's  comments  on  
    Max's  posts.  

    View Slide

  66. 0-99!
    100-
    199!
    200-
    299!
    J-R!
    S-Z!
    100-
    199!
    200-
    299!
    A-I!
    J-R!
    S-Z!
    0-99!
    A-I!
    SELECT comments.text !
    FROM comments!
    WHERE comments.post_id IN […]!
    AND comments.user = 'Alice'!
    Index  Lookup  Join  
    Par&&on  on  filter  keys,  retrieve  Max's  
    posts,  then  retrieve  Alice's  comments  on  
    Max's  posts.  

    View Slide

  67. 0-99!
    100-
    199!
    200-
    299!
    J-R!
    S-Z!
    100-
    199!
    200-
    299!
    A-I!
    J-R!
    S-Z!
    0-99!
    A-I!
    Alice’s  comments  on  
    Max’s  posts  
    Index  Lookup  Join  
    Par&&on  on  filter  keys,  retrieve  Max's  
    posts,  then  retrieve  Alice's  comments  on  
    Max's  posts.  

    View Slide


  68. MySQL
    Alice’s
    comments
    on Max’s
    posts
    MySQL MySQL
    Index  Lookup  Join  Plan  Scaling  
     

    View Slide

  69. Comparison  
    Conven&onal  Plan   Index  Lookup  Plan  
    Intermediate  
    data,  Max’s  
    posts  Alice  
    DIDN’T  
    comment  on  

    View Slide

  70. Comparison  
    Conven&onal  Plan   Index  Lookup  Plan  
    Intermediate  
    data,  Max’s  
    posts  Alice  
    DIDN’T  
    comment  on  

    View Slide

  71. Dixie  
    •  Cost  model  and  query  op&mizer  which  
    predicts  the  costs  of  the  two  plans  
    •  Query  executor  which  executes  the  original  
    query,  designed  for  a  single  database,  on  a  
    par&&oned  database  with  table  copies  

    View Slide

  72. Dixie's  Predic&ons  for  the  Two  Plans  
    0
    2000
    4000
    6000
    8000
    10000
    0 50 100 150
    Queries/Sec
    Posts/Author
    2-Step Join
    Estimate
    Pushdown
    Join Estimate
    •  Varying  the  size  of  the  intermediate  data  
    •  Max’s  posts,  Alice  didn’t  comment  on  
    •  Fixing  the  number  of  results  returned  
    •  Par&&oned  over  10  servers  
     
    Conven&onal  
    Plan,  each  
    query  going  
    to  all  servers  
    Index  Lookup  
    Plan,  each  
    query  going  to  
    two  servers  

    View Slide

  73. Performance  of  the  Two  Plans  
    0
    2000
    4000
    6000
    8000
    10000
    0 50 100 150
    Queries/Sec
    Posts/Author
    2-Step Join
    Pushdown Join
    2-Step Join
    Estimate
    Pushdown Join
    Estimate
    Blue  beats  yellow  aPer  this.  
    Dixie  predicts  it!  

    View Slide

  74. JOINs  Are  Not  Evil  
    •  When  only  transferring  small  amounts  data  
    •  And  with  carefully  par&&oned  data  

    View Slide

  75. Lessons  Learned  
    •  Don’t  worry  at  the  beginning:  Use  what  you  
    know  
    •  Not  about  SQL  vs.  NoSQL  systems  –  you  can  
    scale  any  system  
    •  More  about  complex  vs.  simple  queries  
    •  When  you  do  scale,  try  to  make  every  query  
    use  one  (or  a  few)  shards  

    View Slide

  76. Research:    Caching  
    •  Applica&ons  use  a  cache  to  store  results  
    computed  from  a  webserver  or  database  rows  
    •  Expira&on  or  invalida&on?  
    •  Annoying  and  difficult  for  app  developers  to  
    invalidate  cache  items  correctly  
    Work  in  progress  with  Bryan  Kate,  Eddie  Kohler,  Yandong  Mao,  and  Robert  Morris  
    Harvard  and  MIT  

    View Slide

  77. Solu&on:  Dependency  Tracking  Cache  
    •  Applica&on  indicates  what  data  went  into  a  
    cached  result  
    •  The  system  will  take  care  of  invalida&ons  
    •  Don’t  need  to  recalculate  expired  data  if  it  has  
    not  changed  
    •  Always  get  the  freshest  data  when  data  has  
    changed  
    •  Track  ranges,  even  if  empty!  

    View Slide

  78. Example  Applica&on:  Twi[er  
    •  Cache  a  user’s  view  of  their  twi[er  homepage  
    (&meline)  
    •  Only  invalidate  when  a  followed  user  tweets  
    •  Even  if  many  followed  users  tweet,  only  
    recalculate  the  &meline  once  when  read  
    •  No  need  to  recalculate  on  read  if  no  one  has  
    tweeted  

    View Slide

  79. Challenges  
    •  Invalida&on  +  expira&on  &mes  
    – Reading  stale  data  
    •  Distributed  caching  
    •  Cache  coherence  on  the  original  database  
    rows  

    View Slide

  80. Benefits  
    •  Applica&on  gets  all  the  benefits  of  caching  
    – Saves  on  computa&on  
    – Less  traffic  to  the  backend  database  
    •  Doesn’t  have  to  worry  about  freshness  

    View Slide

  81. Summary  
    •  Choosing  a  datastore  
    •  Dixie  
    •  Dependency  tracking  cache  

    View Slide

  82. Thanks!  
     
    [email protected]  
    h[p://nehanaru.la  
    @neha  

    View Slide