Schema Design with MongoDB - Tony Hannan, Software Engineer, 10gen

MongoDB Schema Design Tony Hannan [email protected] MongoDallas Nov 2011

Document model not too different from Relational model RDBMS MongoDB
Table Collection Flat record (row) Full record (document) Index Index Join Embed or client-side join Transaction Single-document transaction

RDBMS schema design • ER → Relational → Physical Design
& Indexes 1. Every entity gets a table 2. Every many-to-many relationship gets a table 3. Denormalize to remove joins 4. Cluster tables to speed up joins 5. Index to speed up selection and joins • Steps 3 & 4 are optional but recommended

MongoDB schema design • ER → Relational → Physical Design
& Indexes 1. Every entity gets a table 2. Every many-to-many relationship gets a table 3. Denormalize to remove joins 4. Cluster tables to speed up remove joins 5. Index to speed up selection and client-side joins • Steps 3 & 4 are optional but recommended mandatory because MongoDB has no multi-object transactions and no (server-side) joins

Client-side join • Schema A {id, a, bid, c} →
B {id, x, y} • Join query as = db.A.find( {a: “foo”} ) bs = db.B.find( {id: {$in: distinct(map(bid, as))}} ) abs = [{id, a, {x, y}, c} | {id, a, b, c} <- as, {bid, x, y} <- bs, b == bid]

Denormalization A {id, a, bid, c} → B {id, x,
y} Embed: A {id, a, b: {x, y}, c} or Duplicate: A {id, a, b: {id, x}, c} → B {id, x, y}

Clustered tables in RDBMS (not clustered indexes) • Schema A
{id, a, c} <-->> B {id, aid, x, y} • Layout on disk A {id: 1, a, c} B {id, aid: 1, x, y} B {id, aid: 1, x, y} A {id: 2, a c} B {id, aid: 2, x, y} ...

Clustered tables = embedded array in MongoDB A {id, a,
c} <-->> B {id, aid, x, y} A {id, a, bs: [ {x, y} ], c}

Art of schema design • What to … • Embed
(both single document and array of documents) • Duplicate • Keep as client-side join • Index

Embed dependent entities • Embed entities that always appear with
their parent • Examples • Address • Comments • Items in a shopping cart Customer { address: {street, city, state, zip}, cart: [ {productId, quantity} ] … }

Growing embedded arrays • Documents have to move when they
grow beyond their current allocated space – There is padding so they don't move on every insert • Regularly growing/moving large objects slows down updates • Alternative is to not embed, however, then dependent entities are not colocated but interleaved with other unrelated dependents (resulting in slow retrieval) • A hybrid approach is to bucket dependents, so every N dependents reside together A {id, a, bs: [ {x, y} ], c} A {id, a, c} <-->> B {id, aid, bs: [ {x, y} ]} – Add new B bucket for every N inserts of {x, y}

Duplicate • Duplicate where benefit of removing client-side join out
ways cost of maintaining duplicates • Won't be able to update both copies atomically because of single-object transactions only

Indexing • Index where speed benefit out ways space and
update cost • You may index embedded fields even inside arrays • Every element of array gets indexed • Example Schema: Customer {id, address, cart: [ {productId, quantity} ]} Query: db.Customer.find( {cart.productId: 1234} ) Index: db.Customer.ensureIndex( {cart.productId: 1} ) • See next talk “Indexing and Query Optimization” for more

Dynamic schema (schemaless) • Easy schema evolution • Can add/remove
fields on the fly • Mixed types in same collection, good for subtypes • Eg. Employee collection holding Hourly and Salary employees {name, address, department, salary, ...} {name, address, department, hourlyRate, ...}

Single-object transactions • Embedding dependents likely makes basic transactions hit
just one document (object) • If you still have transactions that span multiple documents then • Consider if you can live without the transaction semantics • Use compensating transactions over single documents • Implement application level transactions using single-object transactions as locking primitive

findAndModify • Combo query and update single object, atomically •
Example: Priority queue • Schema: Queue {_id, priority, …} • Get and remove highest priority item on queue (atomically) x = db.Queue.findAndModify( {query: {}, sort: {priority: -1}, remove: true} )

Compare & swap using single-object transaction • Update object as
long as it hasn't changed (optimistic transaction) 1. Get object x = db.X.findOne({_id: 1234}) 2. Edit object 3. Save object as long as it hasn't been changed by someone else x.version ++ db.X.update({_id: 1234, version: x.version - 1}, x) r = db.getLastError() if (r.n == 0) throw tryAgain(x)

Schema Design with MongoDB - Tony Hannan, Softw...

Schema Design with MongoDB - Tony Hannan, Software Engineer, 10gen

mongodb

More Decks by mongodb

Other Decks in Technology

Featured

Transcript

MongoDB Schema Design Tony Hannan [email protected] MongoDallas Nov 2011

Document model not too different from Relational model RDBMS MongoDB

RDBMS schema design • ER → Relational → Physical Design

MongoDB schema design • ER → Relational → Physical Design

Client-side join • Schema A {id, a, bid, c} →

Denormalization A {id, a, bid, c} → B {id, x,

Clustered tables in RDBMS (not clustered indexes) • Schema A

Clustered tables = embedded array in MongoDB A {id, a,

Art of schema design • What to … • Embed

Embed dependent entities • Embed entities that always appear with

Growing embedded arrays • Documents have to move when they

Duplicate • Duplicate where benefit of removing client-side join out

Indexing • Index where speed benefit out ways space and

Dynamic schema (schemaless) • Easy schema evolution • Can add/remove

Single-object transactions • Embedding dependents likely makes basic transactions hit

findAndModify • Combo query and update single object, atomically •

Compare & swap using single-object transaction • Update object as