Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a cloud service on a cloud infrastructure. Also, cloud.

mihasya
October 21, 2011

Building a cloud service on a cloud infrastructure. Also, cloud.

mihasya

October 21, 2011
Tweet

Other Decks in Technology

Transcript

  1. Building a cloud service on a cloud infrastructure at Building

    a cloud service on a cloud infrastructure at Also, cloud. Also, cloud. Mikhail Panchenko, Surge 2011
  2. Who Am I? Who Am I? Pancakes Infrastructure Engineer at

    SimpleGeo Backend Engineer at Flickr before that Backend and Frontend Engineer at Yahoo! Ops/Tools before that Philosophy, Economics, and French major before that @mihasya [email protected]
  3. Tools for mobile/geo developers Primarily focused on services, some data-

    oriented APIs PaaS, I guess? I've lost track a bit Availability, redundancy part of brand Our outage = your outage No pressure
  4. Agenda Agenda Goals A little bit of theory Challenges in

    The Cloud General Architecture Implementation Details
  5. "Complex interactions are those of unfamiliar sequences, or unplanned and

    unexpected sequences, and either not visible or not immediately comprehensible." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 78). Kindle Edition.
  6. "The notion of baffling interactions is increasingly familiar to all

    of us. [...] As systems grow in size and in the number of diverse functions they serve, and are built to function in ever more hostile environments, increasing their ties to other systems, they experience more and more incomprehensible or unexpected interactions. They become more vulnerable to unavoidable system accidents." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
  7. "The beauty of this is its simplicity. Once a plan

    gets too complex, everything can go wrong." Walter Sobchak, The Big Lebowski
  8. Three Mile Island Three Mile Island "... they found that

    radioactive water was not traveling to the tank they intended, but because of complex flow and pressure interactions, was going to a different, wrong tank, which also overflowed, this time in the auxiliary building." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (pp. 22-23). Kindle Edition.
  9. Amazon Web Services Amazon Web Services "The traffic shift was

    executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network." "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region" http://aws.amazon.com/message/65648/
  10. Common Theme Common Theme Previously independent systems become coupled as

    a result of unanticipated interactions, leading to fundamentally surprising results
  11. When pumping radioactive water into the wrong When pumping radioactive

    water into the wrong tank, the behavior of the program is undefined tank, the behavior of the program is undefined
  12. Tightly coupled to a complex system over which you Tightly

    coupled to a complex system over which you have no control and into which you have no insight have no control and into which you have no insight
  13. "The notion of baffling interactions is increasingly familiar to all

    of us. [...] As systems grow in size and in the number of diverse functions they serve, and are built to function in ever more hostile environments, increasing their ties to other systems, they experience more and more incomprehensible or unexpected interactions. They become more vulnerable to unavoidable system accidents." Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
  14. Decouple Your Subsystems Decouple Your Subsystems Shared resources are the

    most common source of unexpected interaction Resist temptation to double up on roles Use queues, caches as buffers NOTE: those are complex subsystems of their own
  15. Decouple Your Subsystems Decouple Your Subsystems Explicit Decoupling CPU Affinity

    Webserver on 1-7; SSH etc on 8 Crude, but gets the job done More robust solutions - containers
  16. Decouple Your Functionality Decouple Your Functionality Service architecture Each service

    does one thing well Easier to measure, understand, and accommodate resource demands Reduce potential for interactions, cross-functional failure
  17. Decouple from Your Environment with Configuration Decouple from Your Environment

    with Configuration Management Management Decouple from your platform (OS/kernel) Easy to test/bench potential candidates Easy to migrate if you find a winner This is especially important when dealing with cloud Automate as much of deploy/bootstrap process as possible Probably won't help much during a provider outage due to stampede BUT: DirectConnect You might not always be in the cloud..
  18. Decouple Your Datacenters Decouple Your Datacenters Most robust redundancy mechanism

    Hot-hot keeps you on your toes Simplifies, not just for the cloud Yahoo! now foregoing datacenter features like HVAC "If it gets too hot in Washington, turn that DC off for a while" I'm sure they're not the only ones
  19. Decouple Your Datacenters Decouple Your Datacenters "AZ" - Basic building

    block for EC2 This is the level they (theoretically) decouple at They are probably thinking along the same lines we are - must be able to turn off one AZ without impact in the other
  20. Every datacenter as an independent microcosm of Every datacenter as

    an independent microcosm of your overall architecture your overall architecture
  21. Really simple operational steps for stressful tasks Really simple operational

    steps for stressful tasks & situations & situations
  22. ELB ELB Dynamic Load Balancing Flexible virtual IP Easy to

    add/remove AZs Uses healthchecks to automatically evict nodes
  23. Gate - "Layer 8 Proxy" Gate - "Layer 8 Proxy"

    Lightweight Node.js daemon OAuth Rate Limiting Basic routing to actual services
  24. Services - Pick Your Own Adventure Services - Pick Your

    Own Adventure Node.js and Python Some people just hate Node.js Can be anything, as long as Gate can talk to it ( another reason to decouple ) Highly specialized
  25. RabbitMQ RabbitMQ A grenade for our knife-fight Very flexible -

    more than we need Simplification candidate New persistor in >= 1.3 - degradation over failure See talk at 1:30PM
  26. Cassandra Cassandra A mostly-textbook DHT Homogenous distributed model Random load

    distribution Partition tolerance A perfect foundation for our architecture