Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoNYC 2012: Scaling MongoDB at foursquare

mongodb
May 29, 2012
4.3k

MongoNYC 2012: Scaling MongoDB at foursquare

MongoNYC 2012: Scaling MongoDB at foursquare, Jon Hoffman, foursquare. At foursquare we store almost all our live data within sharded mongodb clusters. We handle thousands of requests per second via our web site and api and each of those may generate several dozen mongo queries. This talk will address the practical details of our configuration, optimizations we've found useful, and the operational techniques we use to keep things running smoothly.

mongodb

May 29, 2012
Tweet

Transcript

  1. Mongo on AWS Stability in the midst of chaos MongoNYC

    May 23, 2012 Jon Hoffman Server Engineer @hoffrocket
  2. Agenda • A little bit about foursquare • How we

    use mongo • Making mongo robust across soft-failures • Questions?
  3. Data and Volume • A few thousand http qps across

    web and api • tens of thousands of back-end qps • over 5 million new check-ins per day • Spikes to 100 per second • check-in action performs 30 back-end queries before returning a result • Many more in asynchronous processing
  4. Server environment • Entirely on AWS EC2 • Single region,

    mostly in one AZ • 7 sharded clusters, 3 small non-sharded • Checkins stored as individual documents on largest cluster of 14 shards • Replica set of 3 or 4 nodes per shard • m2-4xlarge 68GB nodes • raid0 on 4 EBS drives • Dataset mostly kept in memory
  5. Favorite mongo features 1. Auto sharding and balancing – Scales

    writes & reads 2. Replica set automated failover – Scales reads
  6. Sharding • Chunks are non-contiguous on disk and moves can

    incur a lot of random IO • Shard before you need to shard • balancer requires a lot of spare IO capacity • spare capacity needs depend on your usage patterns • Use a balancing window to run off peak
  7. Mongo on AWS • It’s all about random IO •

    High Memory machines only have 2 ephemeral drives • Can’t keep up with most write volumes • Even for low write volumes, oplog fell way behind during chunk moves • RAID0 on 4 EBS drives • Decent steady state performance, but extremely variable and unreliable
  8. Failure Modes • EBS is a network service • Variable

    bandwidth & latency • Rarely, IO will halt for seconds or minutes (rare * hundreds of devices == at least once per day) • Mongo starts queuing queries very quickly, mongos continues to route new queries Degradation is not seen as a failure mode, so there’s no automated failover.
  9. Monitoring (non-exhaustive) • mongo: • opcounters, lock %, r/w queue,

    faults, index misses • Per collection opcounters “top” • Machine • CPU system, user, wio, load avg • Disk bandwidth, utilization • Network bandwidth, TCPRetransmits • Memory cached, free, used • File descriptors
  10. Detecting EBS disk halting • Learn to love iostat •

    Convenient wrapper around /proc/diskstats on linux • Everytime we noticed a severe IO problem, %util was at 100.00 • %util: (time spent in io operations)/(elapsed time) [roughly] > iostat –x 1 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.00 0.00 0.00 100.00 xvdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
  11. Simulating EBS Halting with fusehalt • Linux filesystem in userspace

    • modifications to fuse disk proxy example code • Switched on/off via watchfile • Mounts root filesystem at path of choice • All IO system calls pass through call back functions static int read(const char *path, char *buf, size_t size…) { while ( halted == 1 ) { sleep(1); } // system calls to read file into buffer // background pthread watches file and updates halted }
  12. fusehalt fusehalt /mnt/haltroot /tmp/haltfile –f # /data/mongo now available at

    /mnt/haltroot/ data/mongo mongod –dbpath /mnt/haltroot/data/mongo # halt: touch /tmp/haltfile # un-halt: rm /tmp/haltfile https://github.com/hoffrocket/fusehalt
  13. Disk health monitor • Hard to simulate %util spikes •

    easy to simulate actual halting with fusehalt • Time periodic sparse writes to hit all drives in raid array in python script for i in range(0, raid_size): touchfile.seek(i * offset) touchfile.write(‘1111’) https://gist.github.com/2773364
  14. Automated failover • Disk health monitor touches a “killfile” when

    a timeout is breached • Modified mongo to watch configurable killfile • If primary, replica will stepDown • serverStatus has new “health” block • Mongos won’t route slave_ok queries to replica with bad health • Generally useful and lighter weight than replica set reconfig https://github.com/foursquare/mongo/commits/ r2.0.4-fs2
  15. Bare metal future • Hybrid cloud and hosted • AWS

    direct connect • SSD drives • Predictable IO performance • No longer need to store all data in memory