SimpleGeo Backend Engineer at Flickr before that Backend and Frontend Engineer at Yahoo! Ops/Tools before that Philosophy, Economics, and French major before that @mihasya [email protected]
unexpected sequences, and either not visible or not immediately comprehensible." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 78). Kindle Edition.
of us. [...] As systems grow in size and in the number of diverse functions they serve, and are built to function in ever more hostile environments, increasing their ties to other systems, they experience more and more incomprehensible or unexpected interactions. They become more vulnerable to unavoidable system accidents." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
radioactive water was not traveling to the tank they intended, but because of complex flow and pressure interactions, was going to a different, wrong tank, which also overflowed, this time in the auxiliary building." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (pp. 22-23). Kindle Edition.
executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network." "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region" http://aws.amazon.com/message/65648/
of us. [...] As systems grow in size and in the number of diverse functions they serve, and are built to function in ever more hostile environments, increasing their ties to other systems, they experience more and more incomprehensible or unexpected interactions. They become more vulnerable to unavoidable system accidents." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
most common source of unexpected interaction Resist temptation to double up on roles Use queues, caches as buffers NOTE: those are complex subsystems of their own
with Configuration Management Management Decouple from your platform (OS/kernel) Easy to test/bench potential candidates Easy to migrate if you find a winner This is especially important when dealing with cloud Automate as much of deploy/bootstrap process as possible Probably won't help much during a provider outage due to stampede BUT: DirectConnect You might not always be in the cloud..
Hot-hot keeps you on your toes Simplifies, not just for the cloud Yahoo! now foregoing datacenter features like HVAC "If it gets too hot in Washington, turn that DC off for a while" I'm sure they're not the only ones
block for EC2 This is the level they (theoretically) decouple at They are probably thinking along the same lines we are - must be able to turn off one AZ without impact in the other
Own Adventure Node.js and Python Some people just hate Node.js Can be anything, as long as Gate can talk to it ( another reason to decouple ) Highly specialized