Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Brief Tour of MongoDB: Intros, Ops & Internals (Mongo Munich, June 2011)

A Brief Tour of MongoDB: Intros, Ops & Internals (Mongo Munich, June 2011)

Brendan McAdams

July 04, 2011
Tweet

More Decks by Brendan McAdams

Other Decks in Programming

Transcript

  1. Mongo Munich Meetup A Brief Tour of MongoDB: Intros, Ops

    & Internals Brendan McAdams - 10gen, Inc. [email protected] @rit Monday, July 4, 2011
  2. Introductions • Brendan McAdams <[email protected]> • Started using MongoDB (in

    production) ~Feb. 2009 • Engineer at 10gen - “The Company” • Scala support (maintain and develop drivers, improve and assist third party frameworks, community steering) • Java support (contribute to maintenance of drivers, features. Focus is on improving integration for non-Java JVM languages w/ our Java toolchain) • Hadoop support (develop & maintain MongoDB’s Hadoop integration layers, assist deployments) • Support (Free community & commercial) • Community Outreach (Meetups, Conferences) • Training & Consulting Monday, July 4, 2011
  3. Stuffing an object graph into a relational model is like

    fitting a square peg into a round hole. Monday, July 4, 2011
  4. Sure, we can use an ORM. But who are we

    really fooling? Monday, July 4, 2011
  5. Sure, we can use an ORM. But who are we

    really fooling? ... and who/what are we going to wake up next to in the morning? Monday, July 4, 2011
  6. Data Models Key / Value memcached, Dynamo Tabular BigTable Document

    Oriented MongoDB, CouchDB Monday, July 4, 2011
  7. Flexible “Schemas” { “author”: “brendan”, “text”: “...” } { “author”:

    “brendan”, “text”: “...”, “tags”: [“mongodb”, “nosql”] } Monday, July 4, 2011
  8. Here is a “simple” SQL Model mysql> select * from

    book; +----+----------------------------------------------------------+ | id | title | +----+----------------------------------------------------------+ | 1 | The Demon-Haunted World: Science as a Candle in the Dark | | 2 | Cosmos | | 3 | Programming in Scala | +----+----------------------------------------------------------+ 3 rows in set (0.00 sec) mysql> select * from bookauthor; +---------+-----------+ | book_id | author_id | +---------+-----------+ | 1 | 1 | | 2 | 1 | | 3 | 2 | | 3 | 3 | | 3 | 4 | +---------+-----------+ 5 rows in set (0.00 sec) mysql> select * from author; +----+-----------+------------+-------------+-------------+---------------+ | id | last_name | first_name | middle_name | nationality | year_of_birth | +----+-----------+------------+-------------+-------------+---------------+ | 1 | Sagan | Carl | Edward | NULL | 1934 | | 2 | Odersky | Martin | NULL | DE | 1958 | | 3 | Spoon | Lex | NULL | NULL | NULL | | 4 | Venners | Bill | NULL | NULL | NULL | +----+-----------+------------+-------------+-------------+---------------+ 4 rows in set (0.00 sec) Monday, July 4, 2011
  9. Joins are great and all ... • Potentially organizationally messy

    • Structure of a single object is NOT immediately clear to someone glancing at the shell data Monday, July 4, 2011
  10. Joins are great and all ... • Potentially organizationally messy

    • Structure of a single object is NOT immediately clear to someone glancing at the shell data • We have to flatten our object out into three tables Monday, July 4, 2011
  11. Joins are great and all ... • Potentially organizationally messy

    • Structure of a single object is NOT immediately clear to someone glancing at the shell data • We have to flatten our object out into three tables • 7 separate inserts just to add “Programming in Scala” Monday, July 4, 2011
  12. Joins are great and all ... • Potentially organizationally messy

    • Structure of a single object is NOT immediately clear to someone glancing at the shell data • We have to flatten our object out into three tables • 7 separate inserts just to add “Programming in Scala” • Once we turn the relational data back into objects ... Monday, July 4, 2011
  13. Joins are great and all ... • Potentially organizationally messy

    • Structure of a single object is NOT immediately clear to someone glancing at the shell data • We have to flatten our object out into three tables • 7 separate inserts just to add “Programming in Scala” • Once we turn the relational data back into objects ... • We still need to convert it to data for our frontend Monday, July 4, 2011
  14. Joins are great and all ... • Potentially organizationally messy

    • Structure of a single object is NOT immediately clear to someone glancing at the shell data • We have to flatten our object out into three tables • 7 separate inserts just to add “Programming in Scala” • Once we turn the relational data back into objects ... • We still need to convert it to data for our frontend • I don’t know about you, but I have better things to do with my time. Monday, July 4, 2011
  15. The Same Data in MongoDB > db.books.find().forEach(printjson) { "_id" :

    ObjectId("4dfa6baa9c65dae09a4bbda3"), "title" : "The Demon-Haunted World: Science as a Candle in the Dark", "author" : [ { "first_name" : "Carl", "last_name" : "Sagan", "middle_name" : "Edward", "year_of_birth" : 1934 } ] } { "_id" : ObjectId("4dfa6baa9c65dae09a4bbda4"), "title" : "Cosmos", "author" : [ { "first_name" : "Carl", "last_name" : "Sagan", "middle_name" : "Edward", "year_of_birth" : 1934 } ] } Monday, July 4, 2011
  16. The Same Data in MongoDB (Part 2) { "_id" :

    ObjectId("4dfa6baa9c65dae09a4bbda5"), "title" : "Programming in Scala", "author" : [ { "first_name" : "Martin", "last_name" : "Odersky", "nationality" : "DE", "year_of_birth" : 1958 }, { "first_name" : "Lex", "last_name" : "Spoon" }, { "first_name" : "Bill", "last_name" : "Venners" } ] } Monday, July 4, 2011
  17. Access to the embedded objects is integral > db.books.find({"author.first_name": "Martin",

    "author.last_name": "Odersky"}) { "_id" : ObjectId("4dfa6baa9c65dae09a4bbda5"), "title" : "Programming in Scala", "author" : [ { "first_name" : "Martin", "last_name" : "Odersky", "nationality" : "DE", "year_of_birth" : 1958 }, { "first_name" : "Lex", "last_name" : "Spoon" }, { "first_name" : "Bill", "last_name" : "Venners" } ] } Monday, July 4, 2011
  18. As is manipulation of the embedded data > db.books.update({"author.first_name": "Bill",

    "author.last_name": "Venners"}, ... {$set: {"author.$.company": "Artima, Inc."}}) > db.books.update({"author.first_name": "Martin", "author.last_name": "Odersky"}, ... {$set: {"author.$.company": "Typesafe, Inc."}}) > db.books.findOne({"title": /Scala$/}) { "_id" : ObjectId("4dfa6baa9c65dae09a4bbda5"), "author" : [ { "company" : "Typesafe, Inc.", "first_name" : "Martin", "last_name" : "Odersky", "nationality" : "DE", "year_of_birth" : 1958 }, { "first_name" : "Lex", "last_name" : "Spoon" }, { "company" : "Artima, Inc.", "first_name" : "Bill", "last_name" : "Venners" } ], "title" : "Programming in Scala" } Monday, July 4, 2011
  19. •(200 gigs of MongoDB files creates 200 gigs of virtual

    memory) •OS controls what data in RAM •When a piece of data isn't found, a page fault occurs (Expensive + Locking!) •OS goes to disk to fetch the data •Indexes are part of the Regular Database files •Deployment Trick: Pre-Warm your Database (PreWarming your cache) to prevent cold start slowdown Operating System map files on the Filesystem to Virtual Memory Monday, July 4, 2011
  20. Big Things To Watch For • % index miss •

    faults / sec • flushes / sec Monday, July 4, 2011
  21. •For working set queries, CPU usage is typically low MongoDB

    will take advantage of multiple cores Monday, July 4, 2011
  22. •Surprise: Queries which don't hit indexes make heavy use of

    CPU & Disk •Deployment Trick: Avoid counting & computing on the fly by caching & precomputing data Full Tablescans Monday, July 4, 2011
  23. DB Profiling is your Friend • Ensure your queries are

    being executed correctly • Enable profiling • db.setProfilingLevel(n) • n=1: slow operations, n=2: all operations • Viewing profile information • db.system.profile.find({info: /test.foo/}) •http://www.mongodb.org/display/DOCS/Database+Profiler • Query execution plan: •db.xx.find({..}).explain() •http://www.mongodb.org/display/DOCS/Optimization • Make sure your Queries are properly indexed. • Deployment Trick: Start mongod with --notablescan to disable tablescans Monday, July 4, 2011
  24. Indexes • Index on Foo, Bar, Baz” works for “Foo”,

    “Foo, Bar” and “Foo, Bar, Baz” • The Query Optimizer figures out the order but can’t do things in reverse • You can pass hints to force a specific index: db.collection.find({username: ‘foo’, city: ‘New York’}).hint({‘username’: 1}) • Missing Values are indexed as “null” • This includes unique indexes • Deployment Trick: 1.8 has Sparse and Covered Indexes! • system.indexes ! Monday, July 4, 2011
  25. •Currently Single Threaded; runs in parallel across shards •Deployment Trick:

    Use the new aggregation output options Map Reduce Monday, July 4, 2011
  26. •Working set should be, as much as possible, in memory

    •Your entire dataset need not be! Working set is crucial!!! Monday, July 4, 2011
  27. •Disk I/O becomes your definer of performance in non- working

    set queries Disks & I/O Monday, July 4, 2011
  28. •RAID is good for a variety of reasons •Our Recommendations

    ... Surprise: Faster Disks is better than slow disks. More is also better Monday, July 4, 2011
  29. •Improved write performance •Survives single disk failure •Downside: Needs double

    storage needs •e.g. 4 20 gig disks gives you 40 gigs of usable space •LVM of RAID 10 on EBS seems to smooth out performance and reliability best for MongoDB RAID 10 (Mirrored sets inside a striped set; minimum 4 disks) Monday, July 4, 2011
  30. Raid 10 is NOT Raid 0+1 • Striping on top

    of Mirrors vs. Mirrors on top of Striping • (The order on diagrams can be confusing) • This is RAID 0 + 1, not RAID 10 Monday, July 4, 2011
  31. Raid 10 is NOT Raid 0+1 • Striping on top

    of Mirrors vs. Mirrors on top of Striping • (The order on diagrams can be confusing) • This is RAID 10, not RAID 0+1 Monday, July 4, 2011
  32. •1 or 2 additional disks required for parity •Can survive

    1 or 2 disk failures •Implementations seem inconsistent, buyer beware RAID 5 or 6 Monday, July 4, 2011
  33. •Expensive, but getting cheaper •Significantly reduced seek time and increased

    I/O Throughput •Random Writes and Sequential Reads are still a weak point Flash (SSD) Monday, July 4, 2011
  34. •For production: Use a 64 bit OS and a 64

    bit MongoDB Build •32 Bit has a 2 gig limit; imposed by the operating systems for memory mapped files •Clients can be 32 bit •MongoDB Supports (little endian only) •Linux, FreeBSD, OS X (on Intel, not PowerPC) •Windows •Solaris (Intel only, Joyent offers a cloud service which works for Mongo) OS Monday, July 4, 2011
  35. •Shows I/O counters, time spent in locks, etc MongoStat -

    free tool which comes with MongoDB Monday, July 4, 2011
  36. •iostat [args] <seconds per poll> •-x for extended report •Disk

    can be a bottleneck in large datasets where working set > ram •~200-300Mb/s on XL EC2 instances, but YMMV (EBS is slower) •On Amazon Latency spikes are common, 400-600ms (No, this is not a good thing) Similarly, iostat ships on most Linux machines (or can be installed) [sysstat package on Ubuntu] Monday, July 4, 2011
  37. Use MongoDB’s Built-in Profiler • Ensure your queries are being

    executed correctly • Enable profiling • db.setProfilingLevel(n) • n=1: slow operations, n=2: all operations • Viewing profile information • db.system.profile.find({info: /test.foo/}) •http://www.mongodb.org/display/DOCS/Database+Profiler • Query execution plan: •db.xx.find({..}).explain() •http://www.mongodb.org/display/DOCS/Optimization • Deployment / Common Sense Trick: Make sure your Queries are properly indexed! Monday, July 4, 2011
  38. •You can create symbolic links to keep different databases on

    different disks •Best to aggregate your IO across multiple disks •File Allocation All data & namespace files are stored in the 'data' directory (--dbpath) Monday, July 4, 2011
  39. _id if not specified drivers will add default: ObjectId("4bface1a2231316e04f3c434") timestamp

    machine id process id counter http://www.mongodb.org/display/DOCS/Object+IDs Monday, July 4, 2011
  40. BSON Encoding { _id: ObjectId(XXXXXXXXXXXX), hello: “world”} \x27\x00\x00\x00\x07_id\x00 X X

    X X X X X X X X X X X X \x02 h e l l o \x00\x06\x00 \x00\x00 w o r l d \x00\x00 http://bsonspec.org Monday, July 4, 2011
  41. Extent Allocation foo.0 foo.1 foo.2 00000000000 00000000000 00000000000 00000000000 00000000000

    00000000000 00000000000 preallocated space 00000000000 0000 foo.$freelist foo.baz foo.bar foo.test allocated per namespace: ns details stored in foo.ns Monday, July 4, 2011
  42. Record Allocation Deleted Record (Size, Offset, Next) BSON Data Header

    (Size, Offset, Next, Prev) Padding ... ... Monday, July 4, 2011
  43. Insert Message (TCP / IP ) message length message id

    response id op code (insert) \x68\x00\x00\x00 \xXX\xXX\xXX\xXX \x00\x00\x00\x00 \xd2\x07\x00\x00 reserved collection name document(s) \x00\x00\x00\x00 f o o . t e s t \x00 BSON Data http://www.mongodb.org/display/DOCS/Mongo+Wire+Protocol Monday, July 4, 2011
  44. •--logpath <file> •Rotation can be requested of MongoDB... •db.runCommand("logRotate") •kill

    -SIGUSR1 <mongod pid> •killall -SIGUSR1 mongod •Won't work for ./mongod > [file] syntax Logfiles Monday, July 4, 2011
  45. •MongoDB is filesystem neutral •ext3, ext4 and XFS are most

    used •BUT.... •ext4, XFS or any other filesystem with posix_fallocate() are preferred and best Filesystems Monday, July 4, 2011
  46. •Many distros default to ext3 (but Amazon AMI now uses

    ext4 by default) •For best performance reformat to EXT4 / XFS •Make sure you use a recent version of EXT4 •Striping (MDADM / LVM) aggregates I/O •See previous recommendations about RAID 10 EC2 Monday, July 4, 2011
  47. •When doing a lot of updates or deletes.... •Compaction may

    be needed occasionally on indices and datafiles •db.repair() •Replica Sets: •Rolling repairs, start nodes up with --repair param • Deployment Trick: For large bulk data operations consider removing indexes and re-adding them later! (better : a new DB may help) Maintenance Monday, July 4, 2011
  48. Scale out write read shard1 rep_a1 rep_b1 rep_c2 shard2 rep_a2

    rep_b2 rep_c2 shard3 rep_a3 rep_b3 rep_c3 mongos  /   config  server mongos  /   config  server mongos  /   config  server Monday, July 4, 2011
  49. •Eliminates impact on master during backup •Hidden Nodes in 1.8

    Best driven from a slave Monday, July 4, 2011
  50. •binary, compact object dump •each consistent object is written •NOT

    necessarily consistent from start to finish (Unless you lock the database) •mongorestore to restore binary dump •Deployment Trick #1: database doesn't have to be up to restore, can use dbpath • Deployment Trick #2: mongodump with replSetName/ <hostlist> will automatically read from a slave! mongodump / mongorestore Monday, July 4, 2011
  51. •lock: blocks writes •db.runCommand({fsync: 1, lock: 1}) •fsync to flush

    buffers to disk •backup •then, unlock •db.$cmd.sys.unlock.findOne(); filelock / fsync Monday, July 4, 2011
  52. •EBS Can disappear (See: last week) •S3 for longer term

    backups •USE AMAZON AVAILABILITY ZONES •DR / HA With Journaling, you can run an LVM or EBS snapshot and recover later without locking Monday, July 4, 2011
  53. MongoDB is a Single-Master System • A database is served

    by members of a “replica set” • The system elects a primary (master) • Failure of the master is detected, and a new master is elected • Application writes get an error if there is no quorum to elect a new master • Reads continue to be fulfilled Monday, July 4, 2011
  54. MongoDB Supports Sharding • A collection can be sharded •

    Each shard is served by its own replica set • New shards (each a replica set) can be added at any time • Shard key ranges are automatically balanced Monday, July 4, 2011
  55. MongoDB Storage Management • Data is kept in memory-mapped files

    • Servers should have a lot of memory • Files are allocated as needed • Documents in a collection are kept on a list using a geographical addressing scheme • Indexes (B*-trees) point to documents using geographical addresses Monday, July 4, 2011
  56. MongoDB Server Management • Replica set members are aware of

    each other • A majority of votes is required to elect a new primary • Members can be assigned priorities to affect the election • e.g., an “invisible” replica can be created with zero priority for backup purposes Monday, July 4, 2011
  57. • 2,000+ Production Deployments and growing. • NYTimes, MTV, Shutterfly,

    Foursquare, Craigslist, Disney, and more in Production. Monday, July 4, 2011
  58. • 2,000+ Production Deployments and growing. • NYTimes, MTV, Shutterfly,

    Foursquare, Craigslist, Disney, and more in Production. • Real, full indexes including sparse, covered & geospatial. Monday, July 4, 2011
  59. MongoDB Users • http://www.10gen.com/customers • http://www.10gen.com/presentations • craigslist: http://www.10gen.com/presentation/ mongosf2011/craigslist

    • bit.ly: http://blip.tv/mongodb/bit-ly-user-history-auto- sharded-3723147 • shutterfly: http://www.10gen.com/presentation/ mongosv2010/shutterfly Monday, July 4, 2011
  60. @mongodb conferences,  appearances,  and  meetups http://www.10gen.com/events http://bit.ly/mongoB   Facebook  

                     |                  Twitter                  |                  LinkedIn http://linkd.in/joinmongo download at mongodb.org We’re Hiring ! [email protected] (twitter: @rit) Monday, July 4, 2011