Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breaking Point: Building Scalable, Resilient APIs

Mark Hibberd
February 10, 2015

Breaking Point: Building Scalable, Resilient APIs

The default position of a distributed system is failure. Networks fail. Machines fail. Systems fail.

The problem is APIs are, at their core, a complex distributed system. At some point in their lifetime, APIs will likely have to scale, maybe due to high-volume, large data-sets, a high-number of clients, or maybe just scale to rapid change. When this happens, we want our systems to bend not break.

This talk is a tour of how systems fail, combining analysis of how complex systems break at scale with anecdotes capturing the lighter side of catastrophic failure. We will then ground this with a set of practical tools and techniques to deal with building and testing complex systems for reliability.

Mark Hibberd

February 10, 2015

More Decks by Mark Hibberd

Other Decks in Programming


  1. “A common mistake that people make when trying to design

    something completely foolproof was to underestimate the ingenuity of complete fools.” Douglas Adams -! Mostly Harmless (1992)
  2. OCTOBER 27, 2012 ! Bug triggers Cascading Failures Causes Major

    Amazon Outage ! https://aws.amazon.com/message/680342/
  3. At 10:00AM PDT Monday, a small number of Amazon Elastic

    Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck”
  4. 17%

  5. 17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s 4.5%

    extra traffic means a 36% load increase on each server
  6. 25% 100k req/s 12.5% extra traffic means a 300% load

    increase on each server 12.5k >>> 14.3k >>> 17k >>> 50k req/s
  7. JUNE 11, 2010 ! A Perfect Storm.....of Whales Heavy Loads

    Causes Series of Twitter Outages During World Cup http://engineering.twitter.com/2010/06/perfect-stormof-whales.html
  8. Since Saturday, Twitter has experienced several incidences of poor site

    performance and a high number of errors due to one of our internal sub-networks being over-capacity.
  9. critical licensing service, 100 million + active users a day,

    millions of $$$. ! A couple of “simple” services. Thick clients, non- updatable, load-balanced on client.
  10. NOVEMBER 14, 2014 ! Link Imbalance Mystery Issue Causes Metastable

    Failure State ! https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable- failure-state-at-scale/
  11. A Good Incident Report, Helps Others Learn From Your Mistake,

    And Ensures You Really Understand What Went Wrong
  12. 1. Summary & Impact 2. Timeline 3. Root Cause 4.

    Resolution and Recovery 5. Corrective and Preventative Measures 5 Steps To A Good Incident Report https://sysadmincasts.com/episodes/20-how-to-write-an-incident-report-postmortem
  13. “Anything that happens, happens. ! Anything that, in happening, causes

    something else to happen, causes something else to happen. ! Anything that, in happening, causes itself to happen again, happens again. ! It doesn’t necessarily do it in chronological order, though.” Douglas Adams -! Mostly Harmless (1992)
  14. “A beach house isn't just real estate. It's a state

    of mind.” Douglas Adams -! Mostly Harmless (1992)
  15. services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}

    } { version: {…}, stats: {…}, source: {…} }
  16. @ambiata we deal with ingesting and processing lots of data

    100s TB / per day / per customer scientific experiment and measurement is key experiments affect users directly researchers / non-specialist engineers produce code
  17. query /chord {id: ab123} datastore ;chord report ;result x x

    behaviour change through in production testing