Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce and Columnar DB's

samant
April 02, 2014

MapReduce and Columnar DB's

samant

April 02, 2014
Tweet

More Decks by samant

Other Decks in Programming

Transcript

  1. MapReduce - Definition • One of Google’s greatest contributions to

    computer science • MapReduce is an algorithmic framework for executing jobs in parallel over several nodes
  2. MapReduce - Major Implementation • Almost always based on Hadoop

    - a Framework for the storage and processing of large scaled and distributed data supported by Apache • Itself inspired by Google BigTable Project
  3. Columnar DB’s - Definition Columnar databases are so named because

    the important aspect of their design is that data from a given column is stored together. (By contrast, a row-oriented database keeps information about a row together.) In column-oriented databases, adding columns is quite inexpensive.
  4. Columnar DB’s - Queries get ‘t1′, ‘r1′, {COLUMN => ‘c1′}

    get ‘t1′, ‘r1′, {COLUMN => ['c1', 'c2', 'c3']} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMESTAMP => ts1} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMERANGE => [ts1, ts2], VERSIONS => 4} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMESTAMP => ts1, VERSIONS => 4}
  5. Columnar DBs - Supporting Companies • Facebook • Yahoo •

    Ebay • Twitter • Amazon • Google • ...
  6. Columnar DB’s - Pro’s • Horizontal scalability (replication and partitioning)

    • Versioning is trivial • No real storage cost for null values • Used mainly for Big Data / data mining / Business Intelligence analysis
  7. Columnar DB’s - Con’s • Complexity (Installation, infrastructure and usage)

    • Design your schema based on how you plan to query the data • Some operations are really time expensive
  8. Facebook Messaging Index Table Keyword #1 Keyword #2 Keyword #3

    Keyword #... User ID #1 User ID #2 User ID #... Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id
  9. References Seven Databases in Seven Weeks: A Guide to Modern

    Databases and the NoSQL Movement by Eric Redmond and Jim R. Wilson