"Beast List" Create a Checklist of all issues preventing us having an uptime of 99.9+% Came up with software / tools we can build. Some concrete issues Unicorn, our Ruby HTTP server was a resource hog. Large deploy times.
them. Connections consume precious memory on the DB. Proxies help effectively manage connections across app servers. Side-effect - you can monitor your database perf from a central place. We wrote our own proxy for Mongo in Go: github.com/facebookgo/dvara (https://github.com/facebookgo/dvara)
everything. We started with Ganglia -> UI froze after 100 metrics. We use Facebooks Scuba and ODS. Find a metrics service, hopefully you don't have to write one. www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with- scuba/10150599692628920 (https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920)
show under live traffic. We had a mechanism to run live traffic on a test cluster. Tools to compare results from test and prod cluster invaluable. Shadowed traffic for months for some endpoints before we released to 100% of users. Our setup -> a custom Go HTTP proxy to send requests to test and prod clusters. Works great for Read APIs. Complicated setup for Write APIs with database snapshot and DB compare.
/ production hooks to control roll out of new code to a fraction of users/traffic. Good way to get confidence. Our in house Go system is called Decider -> based of Redis. Important to clean up old code after the roll out to avoid code smells.
engineer should deploy when needed. We moved away from a fixed Monday release to release all the time. Deploy many small changes as often as possible. Deployctl - In house tool written in Python to deploy Go (Zookeeper based). Deploy locking and canarying.
every binary for debugging. Exposes Health Checks / Git version / build time / uptime. Can connect pprof over it (thanks GoLang). Can activate verbose logging on a particular server - logs every request response pair.
through out our codebase. Context can be used to tag along ReqID, AppID We use context to pass in a ReqID, that is added to query comment on Mongo, helps us track back a request from a slow query in the log. Support context objects when writing a new library. Golang has great context package golang.org/x/net (golang.org/x/net) .
later DB will be the bottleneck. Understand the internals of your Database from the start. Query planner, db caches - row/block cache, Indexing trade-offs, major locks. Start hacking on the DB codebase, you can add custom metrics - usually easier than it seems. Parse.com+RocksDB team at Facebook built a new storage engine for Mongo - Mongo-Rocks. blog.parse.com/announcements/mongodb-rocksdb-parse/ (http://blog.parse.com/announcements/mongodb- rocksdb-parse/)
- only at boot time github.com/facebookgo/inject (https://github.com/facebookgo/inject) . Lots of small libraries github.com/facebookgo/ (https://github.com/facebookgo/) . Try not to fork - we submit patches upstream.
tests. Our go test suite takes less than 2min to run. Parallel test runs are beautiful. We boot multiple mongo/memcache instances in memory in our test binary. github.com/facebookgo/mgotest (http://github.com/facebookgo/mgotest) github.com/facebookgo/mctest (https://github.com/facebookgo/mctest)