Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Journaling and the Storage Engine - Kyle Banker, Software Engineer, 10gen

mongodb
February 13, 2012

Journaling and the Storage Engine - Kyle Banker, Software Engineer, 10gen

MongoDB Boulder 2012

MongoDB supports write-ahead journaling (by default) to facilitate fast crash recovery and consistency in database files after that crash. In this session, we'll give an overview of on-disk persistence with MongoDB, journaling, and discuss the internals of journaling and the storage engine.

mongodb

February 13, 2012
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. Directory Layout -rw------- 1 kb admin 64M Jun 26 00:15

    test.0 -rw------- 1 kb admin 128M Jun 21 00:20 test.1 -rw------- 1 kb admin 256M Jun 26 00:15 test.2 -rw------- 1 kb admin 512M Jun 21 00:20 test.3 -rw------- 1 kb admin 1.0G Jun 26 00:15 test.4 -rw------- 1 kb admin 2.0G Jun 25 23:08 test.5 -rw------- 1 kb admin 16M Jun 26 00:15 test.ns • Separate files per database • Aggressive preallocation • Always a spare file
  2. Internal File Format • Files broken into extents • A

    collection has one or more extents • Grow exponentially from 64 MB to 2 GB (max file size as well) • Indexes have their own extents
  3. Sample Extents > db.foo.validate( { full : true } ).extents.forEach(

    function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:3000 20480 0:12000 81920 0:26000 327680 0:76000 1310720 0:1da000 5242880 0:76a000 6291456 0:d6a000 7553024 0:16de000 9064448 0:1f83000 10878976 0:29e3000 13058048 1:2000 15671296 1:ef4000 18808832 1:29e4000 22573056 1:3f6b000 27090944 1:5941000 32509952
  4. Index Extents > db.system.namespaces.find() { "name" : "test.foo" } {

    "name" : "test.system.indexes" } { "name" : "test.foo.$_id_" } > db["foo.$_id_"].validate( { full : true } ).extents.forEach( function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:9000 36864 0:1b6000 147456 0:6da000 589824 0:149e000 2359296 1:20e4000 9437184
  5. Memory Mapped • All data files memory mapped (mmap) •

    Virtual size = total data size + overhead • Journaled virtual size = ( total data size * 2 ) + overhead • fsync every 60 seconds (--syncdelay)
  6. Planned Changes • Split data and indexes into different files

    • Indexes could by symlinked to a different drive (SSD)
  7. Journalling • Write-ahead log • Operations written to journal before

    memory mapped regions • Once journal written, data safe unless hardware problem
  8. When is Data Written • Journal flushed every 100ms or

    100mb written • db.getLastError( { j: true } ) to force a journal flush
  9. Journal Admin • /journal sub directory in dbpath • 1

    GB files, rotated • Can symlink to a different volume • --journalCommitInterval (2 ms - 300 ms)
  10. Performance • On 99.9% read systems, no impact • Write

    performance 5-30% slowdown on same drive • Using separate drive as low as 3%
  11. When to use • Single node - required for any

    data integrity • Replica Set - at least 1 node • All nodes for large data sets
  12. Changes in 2.0 • Writes to journal occur outside of

    lock • Journal is compressed so more fits in 3 GB and is faster to write • On by default on 64-bit systems
  13. Fragmentation • Files can get fragmented over time if documents

    change size • Need to improve free list • 2.0 reduced scanning to reasonable amounts • 2.2 will change allocation strategy • Need to re-write free list to do online compaction
  14. Compaction • 1.8 and previous: repairDatabase • 2.0+ : compact

    command • only needs 2 GB extra space • Can be N times faster where N = number of indexes
  15. update and moves • Updates can make documents bigger •

    Moves are more expensive than other operations
  16. Download MongoDB http://www.mongodb.org and  let  us  know  what  you  think

    @hwaet        @mongodb 10gen is hiring! http://www.10gen.com/jobs