Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB Storage & Journaling

mongodb
May 07, 2012

MongoDB Storage & Journaling

MongoDB supports write-ahead journaling (by default) to facilitate fast crash recovery and consistency in database files after that crash. In this session, we'll give an overview of on-disk persistence with MongoDB, journaling, and discuss the internals of journaling and the storage engine.

mongodb

May 07, 2012
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. Directory Layout •  Separate files per database •  Aggressive preallocation

    •  Files contain one or more extents 2   -rw------- 1 ben ben 64M May 1 19:14 test.0! -rw------- 1 ben ben 128M May 1 19:14 test.1! -rw------- 1 ben ben 256M May 1 18:25 test.2! -rw------- 1 ben ben 512M May 1 19:14 test.3! -rw------- 1 ben ben 1.0G May 1 19:14 test.4! -rw------- 1 ben ben 2.0G May 1 18:58 test.5! -rw------- 1 ben ben 16M May 1 19:14 test.ns!
  2. Memory Mapping STACK! …! LIBS! …! test.ns! test.0! test.1! …!

    ! …! HEAP! MONGOD! NULL! 0x7fffffffffff   0x0   {  …  }   Disk   Document   Process  Virtual  Memory  
  3. Data Structures •  DiskLoc •  Stores file number and offset

    of data on disk •  Record *r = mmap base + DiskLoc.offset! •  Max offset is 2^31 (2GB)! •  NamespaceDetails •  Stores collection metadata! •  Extent! •  Stores contiguous blocks within a namespace •  Max extent size is 2GB   •  Record! •  Holds a BSON document or B-tree bucket •  DeletedRecord overwrites a Record! •  Includes Padding
  4. Namespace Details •  Holds metadata about a collection or index

    •  Stored in 1KB buckets in <dbname>.ns file •  .ns file fixed size of 16MB •  Maintains document count •  Contains heads of linked lists firstExtent   lastExtent   _indexes[]   stats   freeList[]   NamespaceDetails  
  5. Extent Structure Extent   length   xNext   xPrev  

    firstRecord   lastRecord   Extent   length   xNext   xPrev   firstRecord   lastRecord  
  6. Extents >  db.foo.validate(  {  full  :  true  }  ).extents.forEach(  

                       function(z){  print(  z.loc  +  "\t\t"  +  z.size  );  }  )   0:3000    20480   0:12000    81920   0:26000    327680   0:76000    1310720   0:1da000  5242880   0:76a000  6291456   0:d6a000  7553024   0:16de000  9064448   0:1f83000  10878976   0:29e3000  13058048   1:2000    15671296   1:ef4000  18808832   1:29e4000  22573056  
  7. Index Extents >  db.system.namespaces.find()   {  "name"  :  "test.foo"  }

      {  "name"  :  "test.system.indexes"  }   {  "name"  :  "test.foo.$_id_"  }     >  db["foo.$_id_"].validate(  {  full  :  true  }  ).extents.forEach(                      function(z){  print(  z.loc  +  "\t\t"  +  z.size  );  }  )   0:9000    36864   0:1b6000  147456   0:6da000  589824   0:149e000  2359296   1:20e4000  9437184  
  8. Extents and Records Extent   length   xNext   xPrev

      firstRecord   lastRecord   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }  
  9. Extents and Records Extent   length   xNext   xPrev

      firstRecord   lastRecord   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }  
  10. Extents and Records Extent   length   xNext   xPrev

      firstRecord   lastRecord   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }   Data  Record   length   rNext   rPrev   Document   {        _id:  “foo”,      ...     }  
  11. BSON Format {  hello:  “world”  }   \x16\x00\x00\x00 \x02hello\x00 !

    \x06\x00\x00\x00 world\x00\x00! Doc  Length   Value  Type   Value  Length  
  12. Index Extents Extent   length   xNext   xPrev  

    firstRecord   lastRecord   Index  Record   Bucket   parent   numKeys     length   rNext   rPrev   Index  Record   Bucket   parent   numKeys   K   length   rNext   rPrev         {  Document  }  
  13. Index Extents Extent   length   xNext   xPrev  

    firstRecord   lastRecord   Index  Record   Bucket   parent   numKeys     length   rNext   rPrev   Index  Record   Bucket   parent   numKeys   K   length   rNext   rPrev         {  Document  }   4   9   1   3   5   6   8   A   B  
  14. Journaling •  Write ahead logging •  Operations written to journal

    before memory mapped regions •  Private view •  Shared view •  Once journal written, data safe unless hardware problem •  By default, journal flushed every 100ms, 100mb of writes, or on write concern of j=true •  User configurable with --journalCommitInterval
  15. •  Section  contains  single  group  commit   •  Applied  all-­‐or-­‐nothing

      Journal Format JHeader   JSectHeader  [LSN  3]   DurOp   DurOp   DurOp   JSectFooter   JSectHeader  [LSN  7]   DurOp   DurOp   DurOp   JSectFooter   …   Op_DbContext   length   offset   fileNo   data[length]   length   offset   fileNo   data[length]   length   offset   fileNo   data[length]   Write  Operation   Set  database  context  for   subsequent  operations  
  16. Journal Performance •  On 99.9% read systems, no impact • 

    Write performance degraded 5-30% when journal on same drive •  Separate drive as low as 3%
  17. Journal Admin •  Journal stored in /dbpath/journal folder •  If

    faster, three 1gb files may be preallocated •  Can symlink to a different spindle •  --journalCommitInterval* (2ms - 300ms) •  When to journal •  Single node: required for data integrity •  Replica set: at least 1 node •  All nodes: removes possible need to resync
  18. Fragmentation •  Files may become fragmented over time if documents

    change size •  Free lists also contribute to fragmentation •  2.0 reduced scanning to reasonable amounts •  2.2 will change allocation strategy •  Need to re-write free list to do online compaction
  19. Compaction •  1.8 and previous: repairDatabase •  2.0+ : compact

    command •  Currently resets paddingFactor, but can be changed. •  Index (re)generation is now concurrent, so compaction can be N times faster •  Generally causes some extra allocation •  Does not delete or truncate files
  20. Planned Changes •  Split data and indexes into different files

    •  Indexes could by symlinked to a different drive (SSD) •  Improved allocation strategy