Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoNYC 2012: Stability in the midst of chaos

MongoNYC 2012: Stability in the midst of chaos

At foursquare we store almost all our live data within sharded Mongodb clusters. We handle thousands of requests per second via our web site and api and each of those may generate several dozen Mongo queries. This talk will address the practical details of our configuration, optimizations we've found useful, and the operational techniques we use to keep things running smoothly.

Jon Hoffman

May 23, 2012
Tweet

More Decks by Jon Hoffman

Other Decks in Technology

Transcript

  1. Mongo on AWS Stability in the midst of chaos MongoNYC

    May 23, 2012 Jon  Hoffman   Server  Engineer   @hoffrocket
  2. Agenda •  A little bit about foursquare •  How we

    use mongo •  Making mongo robust across soft-failures •  Questions?
  3. Data and Volume •  A few thousand http qps across

    web and api •  tens of thousands of back-end qps •  over 5 million new check-ins per day •  Spikes to 100 per second •  check-in action performs 30 back-end queries before returning a result •  Many more in asynchronous processing
  4. Server environment •  Entirely on AWS EC2 •  Single region,

    mostly in one AZ •  7 sharded clusters, 3 small non-sharded •  Checkins stored as individual documents on largest cluster of 14 shards •  Replica set of 3 or 4 nodes per shard •  m2-4xlarge 68GB nodes •  raid0 on 4 EBS drives •  Dataset mostly kept in memory
  5. Favorite mongo features 1. Auto sharding and balancing –  Scales writes

    & reads 2. Replica set automated failover –  Scales reads
  6. Sharding •  Chunks are non-contiguous on disk and moves can

    incur a lot of random IO •  Shard before you need to shard •  balancer requires a lot of spare IO capacity •  spare capacity needs depend on your usage patterns •  Use a balancing window to run off peak
  7. Mongo on AWS •  It’s all about random IO • 

    High Memory machines only have 2 ephemeral drives •  Can’t keep up with most write volumes •  Even for low write volumes, oplog fell way behind during chunk moves •  RAID0 on 4 EBS drives •  Decent steady state performance, but extremely variable and unreliable
  8. Failure Modes •  EBS is a network service •  Variable

    bandwidth & latency •  Rarely, IO will halt for seconds or minutes (rare * hundreds of devices == at least once per day) •  Mongo starts queuing queries very quickly, mongos continues to route new queries Degradation is not seen as a failure mode, so there’s no automated failover.
  9. Monitoring (non-exhaustive) •  mongo: •  opcounters, lock %, r/w queue,

    faults, index misses •  Per collection opcounters “top” •  Machine •  CPU system, user, wio, load avg •  Disk bandwidth, utilization •  Network bandwidth, TCPRetransmits •  Memory cached, free, used •  File descriptors
  10. Detecting EBS disk halting •  Learn to love iostat" • 

    Convenient wrapper around /proc/diskstats on linux" •  Everytime we noticed a severe IO problem, %util was at 100.00" •  %util: (time spent in io operations)/(elapsed time) [roughly]"   >  iostat  –x  1     Device:                  rrqm/s      wrqm/s          r/s          w/s        rMB/s        wMB/s  avgrq-­‐sz  avgqu-­‐sz      await    svctm    %util   xvdap1                        0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvdd                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvde                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvdc                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvdb                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvdi                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvdk                            0.00          0.00        0.00        0.00          0.00          0.00          0.00        49.00        0.00      0.00  100.00   xvdj                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   xvdl                            0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00   md0                              0.00          0.00        0.00        0.00          0.00          0.00          0.00          0.00        0.00      0.00      0.00  
  11. Simulating EBS Halting with fusehalt •  Linux filesystem in userspace

    •  modifications to fuse disk proxy example code •  Switched on/off via watchfile •  Mounts root filesystem at path of choice •  All IO system calls pass through call back functions static int read(const char *path, char *buf, size_t size…) { while ( halted == 1 ) { sleep(1); } // system calls to read file into buffer // background pthread watches file and updates halted }
  12. fusehalt fusehalt  /mnt/haltroot  /tmp/haltfile  –f   #  /data/mongo  now  available

     at  /mnt/haltroot/ data/mongo   mongod  –dbpath  /mnt/haltroot/data/mongo   #  halt:   touch  /tmp/haltfile   #  un-­‐halt:   rm  /tmp/haltfile     https://github.com/hoffrocket/fusehalt
  13. Disk health monitor •  Hard to simulate %util spikes • 

    easy to simulate actual halting with fusehalt •  Time periodic sparse writes to hit all drives in raid array in python script      for  i  in  range(0,  raid_size):              touchfile.seek(i  *  offset)              touchfile.write(‘1111’)     https://gist.github.com/2773364  
  14. Automated failover •  Disk health monitor touches a “killfile” when

    a timeout is breached •  Modified mongo to watch configurable killfile •  If primary, replica will stepDown •  serverStatus has new “health” block •  Mongos won’t route slave_ok queries to replica with bad health •  Generally useful and lighter weight than replica set reconfig https://github.com/foursquare/mongo/commits/r2.0.4- fs2
  15. Bare metal future •  Hybrid cloud and hosted •  AWS

    direct connect •  SSD drives •  Predictable IO performance •  No longer need to store all data in memory