MongoNYC 2012: Scaling MongoDB at foursquare

Slide 1

Slide 1 text

Mongo on AWS Stability in the midst of chaos MongoNYC May 23, 2012 Jon Hoffman Server Engineer @hoffrocket

Slide 2

Slide 2 text

Agenda • A little bit about foursquare • How we use mongo • Making mongo robust across soft-failures • Questions?

Slide 3

Slide 3 text

Where do my friends go? Where should I go? Social and Algorithmic Discovery

Slide 4

Slide 4 text

Growth 20,000,000+ people 2,000,000,000+ check-ins 30,000,000+ places

Slide 5

Slide 5 text

Data and Volume • A few thousand http qps across web and api • tens of thousands of back-end qps • over 5 million new check-ins per day • Spikes to 100 per second • check-in action performs 30 back-end queries before returning a result • Many more in asynchronous processing

Slide 6

Slide 6 text

Server environment • Entirely on AWS EC2 • Single region, mostly in one AZ • 7 sharded clusters, 3 small non-sharded • Checkins stored as individual documents on largest cluster of 14 shards • Replica set of 3 or 4 nodes per shard • m2-4xlarge 68GB nodes • raid0 on 4 EBS drives • Dataset mostly kept in memory

Slide 7

Slide 7 text

Favorite mongo features 1. Auto sharding and balancing – Scales writes & reads 2. Replica set automated failover – Scales reads

Slide 8

Slide 8 text

Sharding • Chunks are non-contiguous on disk and moves can incur a lot of random IO • Shard before you need to shard • balancer requires a lot of spare IO capacity • spare capacity needs depend on your usage patterns • Use a balancing window to run off peak

Slide 9

Slide 9 text

Mongo on AWS • It’s all about random IO • High Memory machines only have 2 ephemeral drives • Can’t keep up with most write volumes • Even for low write volumes, oplog fell way behind during chunk moves • RAID0 on 4 EBS drives • Decent steady state performance, but extremely variable and unreliable

Slide 10

Slide 10 text

Failure Modes • EBS is a network service • Variable bandwidth & latency • Rarely, IO will halt for seconds or minutes (rare * hundreds of devices == at least once per day) • Mongo starts queuing queries very quickly, mongos continues to route new queries Degradation is not seen as a failure mode, so there’s no automated failover.

Slide 11

Slide 11 text

Monitoring (non-exhaustive) • mongo: • opcounters, lock %, r/w queue, faults, index misses • Per collection opcounters “top” • Machine • CPU system, user, wio, load avg • Disk bandwidth, utilization • Network bandwidth, TCPRetransmits • Memory cached, free, used • File descriptors

Slide 12

Slide 12 text

Mongo telemetry • Websocket updates • Cluster wide aggegation of mongostat

Slide 13

Slide 13 text

Monitoring

Slide 14

Slide 14 text

Detecting EBS disk halting • Learn to love iostat • Convenient wrapper around /proc/diskstats on linux • Everytime we noticed a severe IO problem, %util was at 100.00 • %util: (time spent in io operations)/(elapsed time) [roughly] > iostat –x 1 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util xvdap1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.00 0.00 0.00 100.00 xvdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Slide 15

Slide 15 text

Simulating EBS Halting with fusehalt • Linux filesystem in userspace • modifications to fuse disk proxy example code • Switched on/off via watchfile • Mounts root filesystem at path of choice • All IO system calls pass through call back functions static int read(const char *path, char *buf, size_t size…) { while ( halted == 1 ) { sleep(1); } // system calls to read file into buffer // background pthread watches file and updates halted }

Slide 16

Slide 16 text

fusehalt fusehalt /mnt/haltroot /tmp/haltfile –f # /data/mongo now available at /mnt/haltroot/ data/mongo mongod –dbpath /mnt/haltroot/data/mongo # halt: touch /tmp/haltfile # un-halt: rm /tmp/haltfile https://github.com/hoffrocket/fusehalt

Slide 17

Slide 17 text

Disk health monitor • Hard to simulate %util spikes • easy to simulate actual halting with fusehalt • Time periodic sparse writes to hit all drives in raid array in python script for i in range(0, raid_size): touchfile.seek(i * offset) touchfile.write(‘1111’) https://gist.github.com/2773364

Slide 18

Slide 18 text

Automated failover • Disk health monitor touches a “killfile” when a timeout is breached • Modified mongo to watch configurable killfile • If primary, replica will stepDown • serverStatus has new “health” block • Mongos won’t route slave_ok queries to replica with bad health • Generally useful and lighter weight than replica set reconfig https://github.com/foursquare/mongo/commits/ r2.0.4-fs2

Slide 19

Slide 19 text

No need to get out of bed

Slide 20

Slide 20 text

Bare metal future • Hybrid cloud and hosted • AWS direct connect • SSD drives • Predictable IO performance • No longer need to store all data in memory

Slide 21

Slide 21 text

Questions? [email protected] foursquare.com/jobs