Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
No matter what datastore you choose, you will eventually have to consider sharding your data store to support growing traffic. Two key challenges arise: (1) web workloads often do not have one clear partitioning and (2) it is challenging to determine how to efficiently execute queries over partitioned tables.
To address these challenges I present Dixie, a SQL query planner, optimizer, and executor for databases horizontally partitioned over multiple servers. Dixie shows that we shouldn’t give up on SQL databases just yet. Dixie automates the exploitation of tables with multiple copies partitioned in different ways, in order to increase throughput by expanding the set of queries that need not be sent to all servers. Central to Dixie’s design are a cost model and plan generator that are mindful of queries small enough that query overhead may dominate the cost. For a large class of joins, which traditional wisdom suggests require tables partitioned on the join keys, Dixie can find higher-performance plans using other partitionings.
Presented at Strange Loop, St. Louis, MO.