Slide 1

Slide 1 text

MapReduce and Columnar DB’s Amant Stéphane @stephamant

Slide 2

Slide 2 text

Summary ● MapReduce ● Columnar DB’s ● Practical Use Case

Slide 3

Slide 3 text

MapReduce

Slide 4

Slide 4 text

MapReduce - Definition ● One of Google’s greatest contributions to computer science ● MapReduce is an algorithmic framework for executing jobs in parallel over several nodes

Slide 5

Slide 5 text

MapReduce

Slide 6

Slide 6 text

MapReduce

Slide 7

Slide 7 text

MapReduce - Major Implementation ● Almost always based on Hadoop - a Framework for the storage and processing of large scaled and distributed data supported by Apache ● Itself inspired by Google BigTable Project

Slide 8

Slide 8 text

Columnar DB’s

Slide 9

Slide 9 text

Columnar DB’s - Definition Columnar databases are so named because the important aspect of their design is that data from a given column is stored together. (By contrast, a row-oriented database keeps information about a row together.) In column-oriented databases, adding columns is quite inexpensive.

Slide 10

Slide 10 text

Columnar DB’s - Definition

Slide 11

Slide 11 text

Columnar DB’s - Definition

Slide 12

Slide 12 text

Columnar DB’s - Queries get ‘t1′, ‘r1′, {COLUMN => ‘c1′} get ‘t1′, ‘r1′, {COLUMN => ['c1', 'c2', 'c3']} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMESTAMP => ts1} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMERANGE => [ts1, ts2], VERSIONS => 4} get ‘t1′, ‘r1′, {COLUMN => ‘c1′, TIMESTAMP => ts1, VERSIONS => 4}

Slide 13

Slide 13 text

Columnar DB’s - Major Implementation ● Cassandra ● Hypertable ● HBase

Slide 14

Slide 14 text

Columnar DBs - Supporting Companies ● Facebook ● Yahoo ● Ebay ● Twitter ● Amazon ● Google ● ...

Slide 15

Slide 15 text

Columnar DB’s - Pro’s ● Horizontal scalability (replication and partitioning) ● Versioning is trivial ● No real storage cost for null values ● Used mainly for Big Data / data mining / Business Intelligence analysis

Slide 16

Slide 16 text

Columnar DB’s - Con’s ● Complexity (Installation, infrastructure and usage) ● Design your schema based on how you plan to query the data ● Some operations are really time expensive

Slide 17

Slide 17 text

Practical Use Case

Slide 18

Slide 18 text

Facebook Messaging Index Table Keyword #1 Keyword #2 Keyword #3 Keyword #... User ID #1 User ID #2 User ID #... Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id Timestamp Message_id

Slide 19

Slide 19 text

References Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Eric Redmond and Jim R. Wilson

Slide 20

Slide 20 text

Thank you