hostnames • Distribute evenly across 3 AZs • Fail over automatically • Single source of truth • All inbound traffic through ELBs • Autoscaling groups Sunday, January 12, 14
month • 700/sec steady state • Spikes to 10k/sec (15x burst) • PPNS holds sockets open to all android devices • PDNS to serve android phone-home IPs (only place we run our own DNS) Sunday, January 12, 14
per rs • Running 2.4.6 - 2.4.8 (current) • Over 1M collections • Over 200k schemas • Used for structured mobile app data, real-time query analysis, backend billing & analytics, central routing • Fully instrumented provisioning with chef Sunday, January 12, 14
set before an election • monitor page faults • continuous compaction on snapshot nodes • perf goes off a cliff after an initial sync (for us -- not most people!) Sunday, January 12, 14
sharding works ok for small # of large collections • We use a separate mongo replset for routing apps -> shards • Apps are local to a replset • lets us do things like balance performance • Transfer unused apps to graveyard Sunday, January 12, 14
logs • Each log event tagged with unique id • Custom service written in go • Interfaces with Facebook’s Scribe endpoint • Scribe dumps data to daily Hive table partitions • Run aggregations on hive • Load per-app aggregated data into mongo on AWS • 4k msg/sec steady, 10k msg/sec peak Sunday, January 12, 14
5mb! • log GC pauses • compaction strategy • restart every couple weeks to reduce heap contention • we are still on 1.1 so a lot of our issues are solved later Sunday, January 12, 14
• ephemeral storage • Currently ~20k ops/sec, down from max ~40k ops/sec • Always disable flush to disk on primary • Use RPS to distribute eth0 software interrupts across multiple cores • watch cat /proc/interrupts, cat /proc/softirqs Sunday, January 12, 14
rid of it • ... but rails • Considered RDS • No chained replication • Visibility is challenging • Even tiny periodic blips impact the API • ... but AZ failover would be sooo nice Sunday, January 12, 14
• Single consistent source of truth about your infrastructure • (there are up and coming RAFT alternatives but none of them are production ready) Sunday, January 12, 14