Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Journaling and the Storage Engine Spencer Brody, Software Engineer, 10gen

mongodb
April 20, 2012
160

Journaling and the Storage Engine Spencer Brody, Software Engineer, 10gen

At MongoDB Stockholm 2012 Spencer Brody presented on Journaling and the Storage Engine.

MongoDB supports write-ahead journaling (by default) to facilitate fast crash recovery and consistency in database files after that crash. In this session, we'll give an overview of on-disk persistence with MongoDB, journaling, and discuss the internals of journaling and the storage engine.

mongodb

April 20, 2012
Tweet

Transcript

  1. Directory Layout -rw------- 1 spencer admin 64M Jun 26 00:15

    test.0 -rw------- 1 spencer admin 128M Jun 21 00:20 test.1 -rw------- 1 spencer admin 256M Jun 26 00:15 test.2 -rw------- 1 spencer admin 512M Jun 21 00:20 test.3 -rw------- 1 spencer admin 1.0G Jun 26 00:15 test.4 -rw------- 1 spencer admin 2.0G Jun 25 23:08 test.5 -rw------- 1 spencer admin 2.0G Jun 25 24:04 test.6 -rw------- 1 spencer admin 16M Jun 26 00:15 test.ns •Separate files per database •Aggressive preallocation •Always spare file Friday, April 20, 12
  2. Internal File Format • Files broken into extents • A

    collection has 1 or more extents • Grow exponentially up to 2gb (max file size as well) • Indexes have different extents than data Friday, April 20, 12
  3. Sample Extents > db.foo.validate( { full : true } ).extents.forEach(

    function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:3000 20480 0:12000 81920 0:26000 327680 0:76000 1310720 0:1da000 5242880 0:76a000 6291456 0:d6a000 7553024 0:16de000 9064448 0:1f83000 10878976 0:29e3000 13058048 1:2000 15671296 1:ef4000 18808832 1:29e4000 22573056 1:3f6b000 27090944 1:5941000 32509952 Friday, April 20, 12
  4. Index Extents > db.system.namespaces.find() { "name" : "test2.foo" } {

    "name" : "test2.system.indexes" } { "name" : "test2.foo.$_id_" } > db["foo.$_id_"].validate( { full : true } ).extents.forEach( function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:9000 36864 0:1b6000 147456 0:6da000 589824 0:149e000 2359296 1:20e4000 9437184 Friday, April 20, 12
  5. Memory Mapped • All data files memory mapped • Virtual

    size = total data size + overhead • Journaled virtual size = ( total data size * 2 ) + overhead • fsync every 60 seconds (--syncdelay) Friday, April 20, 12
  6. Planned Changes • Split data and indexes into different files

    • Indexes could by symlinked to a different drive (SSD) Friday, April 20, 12
  7. Journalling • Write ahead log • Operations written to journal

    before memory mapped regions • Once journal written, data safe unless hardware problem Friday, April 20, 12
  8. A “Section” contains all of the information for a single

    group commit. Group commits are applied all-or-nothing. Specifies a database for subsequent operations (until the next DbContext op) A basic write op. Other op types exist too (e.g., createfile, dropdb) Friday, April 20, 12
  9. When is Data Written • Journal flushed every 100ms or

    100mb written • j=true flag to force a journal flush Friday, April 20, 12
  10. Journal Admin • /journal sub directory in <dbpath> (/data/ db)

    • 3 1gb files that get rotated • Can symlink to a different spindle • --journalCommitInterval (2ms - 300ms) Friday, April 20, 12
  11. Performance • On 99.9% read systems, no impact • Write

    performance 5-30% slowdown on same drive • Using separate drive as low as 3% Friday, April 20, 12
  12. When to use • Single node - required for any

    data integrity • Replica Set - at least 1 node • All nodes for large data sets removes need for large resyncs Friday, April 20, 12
  13. Changes in 2.0 • Writes to journal outside of lock

    • Journal is compressed so more fits in 3gb and is faster to write • On by default on 64-bit systems Friday, April 20, 12
  14. Fragmentation • Files can get fragmented over time if documents

    change size • Need to improve free list • 2.0 reduced scanning to reasonable amounts • 2.2 will change allocation strategy • Need to re-write free list to do online compaction Friday, April 20, 12
  15. Compaction • 1.8 and previous: repairDatabase • 2.0+ : compact

    command • only needs 2gb extra space • Can be N times faster where N = number of indexes Friday, April 20, 12
  16. update and moves • Updates can make documents bigger •

    Moves are more expensive than other operations Friday, April 20, 12
  17. padding • adaptive padding between 1.0 and 2.0 • manual

    control coming in 2.2 Friday, April 20, 12
  18. Download MongoDB http://www.mongodb.org and  let  us  know  what  you  think

    @stbrody        @mongodb 10gen is hiring! http://www.10gen.com/jobs Friday, April 20, 12