Slide 1

Slide 1 text

1  

Slide 2

Slide 2 text

Directory Layout •  Separate files per database •  Aggressive preallocation •  Files contain one or more extents 2   -rw------- 1 ben ben 64M May 1 19:14 test.0! -rw------- 1 ben ben 128M May 1 19:14 test.1! -rw------- 1 ben ben 256M May 1 18:25 test.2! -rw------- 1 ben ben 512M May 1 19:14 test.3! -rw------- 1 ben ben 1.0G May 1 19:14 test.4! -rw------- 1 ben ben 2.0G May 1 18:58 test.5! -rw------- 1 ben ben 16M May 1 19:14 test.ns!

Slide 3

Slide 3 text

Memory Mapping STACK! …! LIBS! …! test.ns! test.0! test.1! …! ! …! HEAP! MONGOD! NULL! 0x7fffffffffff   0x0   {  …  }   Disk   Document   Process  Virtual  Memory  

Slide 4

Slide 4 text

Data Structures •  DiskLoc •  Stores file number and offset of data on disk •  Record *r = mmap base + DiskLoc.offset! •  Max offset is 2^31 (2GB)! •  NamespaceDetails •  Stores collection metadata! •  Extent! •  Stores contiguous blocks within a namespace •  Max extent size is 2GB   •  Record! •  Holds a BSON document or B-tree bucket •  DeletedRecord overwrites a Record! •  Includes Padding

Slide 5

Slide 5 text

Namespace Details •  Holds metadata about a collection or index •  Stored in 1KB buckets in .ns file •  .ns file fixed size of 16MB •  Maintains document count •  Contains heads of linked lists firstExtent   lastExtent   _indexes[]   stats   freeList[]   NamespaceDetails  

Slide 6

Slide 6 text

Extent Structure Extent   length   xNext   xPrev   firstRecord   lastRecord   Extent   length   xNext   xPrev   firstRecord   lastRecord  

Slide 7

Slide 7 text

Extents >  db.foo.validate(  {  full  :  true  }  ).extents.forEach(                      function(z){  print(  z.loc  +  "\t\t"  +  z.size  );  }  )   0:3000    20480   0:12000    81920   0:26000    327680   0:76000    1310720   0:1da000  5242880   0:76a000  6291456   0:d6a000  7553024   0:16de000  9064448   0:1f83000  10878976   0:29e3000  13058048   1:2000    15671296   1:ef4000  18808832   1:29e4000  22573056  

Slide 8

Slide 8 text

Index Extents >  db.system.namespaces.find()   {  "name"  :  "test.foo"  }   {  "name"  :  "test.system.indexes"  }   {  "name"  :  "test.foo.$_id_"  }     >  db["foo.$_id_"].validate(  {  full  :  true  }  ).extents.forEach(                      function(z){  print(  z.loc  +  "\t\t"  +  z.size  );  }  )   0:9000    36864   0:1b6000  147456   0:6da000  589824   0:149e000  2359296   1:20e4000  9437184  

Slide 9

Slide 9 text

Extents and Records Extent   length   xNext   xPrev   firstRecord   lastRecord   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }  

Slide 10

Slide 10 text

Extents and Records Extent   length   xNext   xPrev   firstRecord   lastRecord   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }  

Slide 11

Slide 11 text

Extents and Records Extent   length   xNext   xPrev   firstRecord   lastRecord   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }  

Slide 12

Slide 12 text

BSON Format {  hello:  “world”  }   \x16\x00\x00\x00 \x02hello\x00 ! \x06\x00\x00\x00 world\x00\x00! Doc  Length   Value  Type   Value  Length  

Slide 13

Slide 13 text

Index Extents Extent   length   xNext   xPrev   firstRecord   lastRecord   Index  Record   Bucket   parent   numKeys     length   rNext   rPrev   Index  Record   Bucket   parent   numKeys   K   length   rNext   rPrev         {  Document  }  

Slide 14

Slide 14 text

Index Extents Extent   length   xNext   xPrev   firstRecord   lastRecord   Index  Record   Bucket   parent   numKeys     length   rNext   rPrev   Index  Record   Bucket   parent   numKeys   K   length   rNext   rPrev         {  Document  }   4   9   1   3   5   6   8   A   B  

Slide 15

Slide 15 text

Journaling •  Write ahead logging •  Operations written to journal before memory mapped regions •  Private view •  Shared view •  Once journal written, data safe unless hardware problem •  By default, journal flushed every 100ms, 100mb of writes, or on write concern of j=true •  User configurable with --journalCommitInterval

Slide 16

Slide 16 text

•  Section  contains  single  group  commit   •  Applied  all-­‐or-­‐nothing   Journal Format JHeader   JSectHeader  [LSN  3]   DurOp   DurOp   DurOp   JSectFooter   JSectHeader  [LSN  7]   DurOp   DurOp   DurOp   JSectFooter   …   Op_DbContext   length   offset   fileNo   data[length]   length   offset   fileNo   data[length]   length   offset   fileNo   data[length]   Write  Operation   Set  database  context  for   subsequent  operations  

Slide 17

Slide 17 text

Journal Performance •  On 99.9% read systems, no impact •  Write performance degraded 5-30% when journal on same drive •  Separate drive as low as 3%

Slide 18

Slide 18 text

Journal Admin •  Journal stored in /dbpath/journal folder •  If faster, three 1gb files may be preallocated •  Can symlink to a different spindle •  --journalCommitInterval* (2ms - 300ms) •  When to journal •  Single node: required for data integrity •  Replica set: at least 1 node •  All nodes: removes possible need to resync

Slide 19

Slide 19 text

Fragmentation •  Files may become fragmented over time if documents change size •  Free lists also contribute to fragmentation •  2.0 reduced scanning to reasonable amounts •  2.2 will change allocation strategy •  Need to re-write free list to do online compaction

Slide 20

Slide 20 text

Compaction •  1.8 and previous: repairDatabase •  2.0+ : compact command •  Currently resets paddingFactor, but can be changed. •  Index (re)generation is now concurrent, so compaction can be N times faster •  Generally causes some extra allocation •  Does not delete or truncate files

Slide 21

Slide 21 text

Planned Changes •  Split data and indexes into different files •  Indexes could by symlinked to a different drive (SSD) •  Improved allocation strategy

Slide 22

Slide 22 text

Download  MongoDB   http://www.mongodb.org/downloads     Ben  Becker   [email protected]