Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Column Oriented DBs

Ben Darfler
September 17, 2013

Column Oriented DBs

A very brief overview of why column oriented dbs are useful.

Ben Darfler

September 17, 2013
Tweet

More Decks by Ben Darfler

Other Decks in Technology

Transcript

  1. Row Oriented Databases Application Date User ID Action Country Browser

    123 2013-09-01 ace234d session_start us Chrome 456 2013-09-01 135cea5 session_end tr Firefox 456 2013-09-02 dfe235f6 button_click tr Firefox 123 2013-09-02 ace234d button_click us Chrome 456 2013-09-03 dfe235f6 session_end tr Firefox 123 2013-09-03 ace234d session_end us Chrome
  2. • Perform a table scan ◦ Cost: 6 columns *

    6 rows = 36 i/o points • How about an index? ◦ Reduces the # of rows we have to read ◦ Adds additional storage overhead ◦ Creates a lot of random i/o (disks are bad at this) • What if we have 6 trillion rows? ◦ Reading all of those unneeded columns adds up ◦ Additional index overhead is now significant ▪ Even 1% of 100 TB is 1 TB SELECT count(distinct(user_id)) WHERE appid = 123
  3. Column Oriented Databases Application Date User ID Action Country Browser

    123 2013-09-01 ace234d session_start us Chrome 123 2013-09-02 ace234d button_click us Chrome 123 2013-09-03 ace234d session_end us Chrome 456 2013-09-01 135cea5 session_end tr Firefox 456 2013-09-02 dfe235f6 button_click tr Firefox 456 2013-09-03 dfe235f6 session_end tr Firefox
  4. • Perform a column scan ◦ Cost: 1 column *

    6 rows = 6 i/o points • Sorting instead of index ◦ Reduces the # of rows we have to read ◦ Adds overhead during insertion ◦ No need for additional index storage ◦ All sequential i/o (disks are great at this) • Compression ◦ Values in a given column tend to look very similar ◦ Compression works really well on similar data ◦ Greatly reduced i/o (even vs compressed rows) SELECT count(distinct(user_id)) WHERE appid = 123
  5. Run Length Encoding • Works well for sorted columns •

    Highly compressed • Can be queried without decompression • Use binary search to find matching records ◦ Same time complexity as B-Tree Index
  6. Additional Resources • C-Store: A Column-oriented DBMS ◦ http://db.csail.mit.edu/projects/cstore/vldb.pdf •

    The Vertica Analytic Database: C-Store 7 Years Later ◦ http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf • Cloudera Impala ◦ http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ • Parquet ◦ http://parquet.io/ • Apache Drill ◦ http://incubator.apache.org/drill/ • Google Dremel ◦ http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google. com/en/us/pubs/archive/36632.pdf • Amazon Redshift ◦ http://aws.amazon.com/redshift/