Column Oriented DBs

Column Oriented DBs Keeping Queries Fast at Terabyte Scale

Row Oriented Databases Application Date User ID Action Country Browser
123 2013-09-01 ace234d session_start us Chrome 456 2013-09-01 135cea5 session_end tr Firefox 456 2013-09-02 dfe235f6 button_click tr Firefox 123 2013-09-02 ace234d button_click us Chrome 456 2013-09-03 dfe235f6 session_end tr Firefox 123 2013-09-03 ace234d session_end us Chrome

• Perform a table scan ◦ Cost: 6 columns *
6 rows = 36 i/o points • How about an index? ◦ Reduces the # of rows we have to read ◦ Adds additional storage overhead ◦ Creates a lot of random i/o (disks are bad at this) • What if we have 6 trillion rows? ◦ Reading all of those unneeded columns adds up ◦ Additional index overhead is now significant ▪ Even 1% of 100 TB is 1 TB SELECT count(distinct(user_id)) WHERE appid = 123

Column Oriented Databases Application Date User ID Action Country Browser
123 2013-09-01 ace234d session_start us Chrome 123 2013-09-02 ace234d button_click us Chrome 123 2013-09-03 ace234d session_end us Chrome 456 2013-09-01 135cea5 session_end tr Firefox 456 2013-09-02 dfe235f6 button_click tr Firefox 456 2013-09-03 dfe235f6 session_end tr Firefox

• Perform a column scan ◦ Cost: 1 column *
6 rows = 6 i/o points • Sorting instead of index ◦ Reduces the # of rows we have to read ◦ Adds overhead during insertion ◦ No need for additional index storage ◦ All sequential i/o (disks are great at this) • Compression ◦ Values in a given column tend to look very similar ◦ Compression works really well on similar data ◦ Greatly reduced i/o (even vs compressed rows) SELECT count(distinct(user_id)) WHERE appid = 123

Run Length Encoding Application (123, 10M) (234, 5M) (345, 100M)
(456, 1K) (567, 10B) (678, 546)

Run Length Encoding • Works well for sorted columns •
Highly compressed • Can be queried without decompression • Use binary search to find matching records ◦ Same time complexity as B-Tree Index

Additional Resources • C-Store: A Column-oriented DBMS ◦ http://db.csail.mit.edu/projects/cstore/vldb.pdf •
The Vertica Analytic Database: C-Store 7 Years Later ◦ http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf • Cloudera Impala ◦ http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ • Parquet ◦ http://parquet.io/ • Apache Drill ◦ http://incubator.apache.org/drill/ • Google Dremel ◦ http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google. com/en/us/pubs/archive/36632.pdf • Amazon Redshift ◦ http://aws.amazon.com/redshift/

Column Oriented DBs

Column Oriented DBs

Ben Darfler

More Decks by Ben Darfler

Other Decks in Technology

Featured

Transcript

Column Oriented DBs Keeping Queries Fast at Terabyte Scale

Row Oriented Databases Application Date User ID Action Country Browser

• Perform a table scan ◦ Cost: 6 columns *

Column Oriented Databases Application Date User ID Action Country Browser

• Perform a column scan ◦ Cost: 1 column *

Run Length Encoding Application (123, 10M) (234, 5M) (345, 100M)

Run Length Encoding • Works well for sorted columns •

Additional Resources • C-Store: A Column-oriented DBMS ◦ http://db.csail.mit.edu/projects/cstore/vldb.pdf •