Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Platform on AWS

Building a Platform on AWS

Amazon Web Services (AWS) is widely used by technology companies to enable elastic workloads and store large amounts of data. When we started building on AWS, we expected to have issues but didn’t fully appreciate that operating in AWS at scale would mean new challenges in error handling and building for failure. Join us to hear all about our experience with the perils and pitfalls of starting and stopping hundreds of instances every day. We’ll pass on our hard-won knowledge so that you can make your own AWS deployment a little more consistent.

Philip Corliss

March 01, 2016
Tweet

More Decks by Philip Corliss

Other Decks in Programming

Transcript

  1. PLATFORMS ON AWS The good, the bad, and the eventually

    consistent. Slides: https://speakerdeck.com/pcorliss/
  2. WHO IS THIS ? • Philip Corliss • @pcorliss (Gmail,

    Twitter, Github) • Cheese Enthusiast • Engineering Manager • Braintree
  3. Outages “Some internal issues which do not have a widespread

    effect on our customers are not posted on the status page. These are small scale issues which affected connectivity for ~10 minutes.” AWS Case #1614444181
  4. Random 500s “When we would be concerned is when the

    error rate for your requests gets too high (say 10% or more over an extended period of time like 15 minutes or so).” AWS Case #1637036441
  5. ELB TCP Reset “ELB team has identified an issue where

    connections in the middle of a TCP handshake are reset when backend instances are registered or deregistered… ELB team is currently testing a fix and hopes to apply it globally in the coming months.” AWS Case #1505350111
  6. Insufficient Capacity We currently do not have sufficient t2.micro capacity

    in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity.
  7. Degraded Instances EC2 has detected degradation of the underlying hardware

    hosting one or more of your Amazon EC2 instances in the us-west-2 region. Due to this degradation, your instance(s) could already be unreachable. Running instances will be stopped or terminated…
  8. Other Stuff • Instance IDs (Eventual) • us-east-1 (Beta) •

    4pm CST (Garbage Time) • CPU Credit Balance • Broken on Boot
  9. WHO’S THIS GUY? • Philip Corliss • @pcorliss (Gmail, Twitter,

    Github) • Cheese Enthusiast • Engineering Manager • Braintree