Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breaking Point: Building Scalable, Resilient APIs (ScalaSyd)

Mark Hibberd
September 09, 2015

Breaking Point: Building Scalable, Resilient APIs (ScalaSyd)

The default position of a distributed system is failure. Networks fail. Machines fail. Systems fail.

The problem is APIs are, at their core, a complex distributed system. At some point in their lifetime, APIs will likely have to scale, maybe due to high-volume, large data-sets, a high-number of clients, or maybe just scale to rapid change. When this happens, we want our systems to bend not break.

This talk is a tour of how systems fail, combining analysis of how complex systems break at scale with anecdotes capturing the lighter side of catastrophic failure. We will then ground this with a set of practical tools and techniques to deal with building and testing complex systems for reliability.

Mark Hibberd

September 09, 2015

More Decks by Mark Hibberd

Other Decks in Programming


  1. “A common mistake that people make when trying to design

    something completely foolproof was to underestimate the ingenuity of complete fools.” Douglas Adams - Mostly Harmless (1992)
  2. OCTOBER 27, 2012 Bug triggers Cascading Failures Causes Major Amazon

    Outage https://aws.amazon.com/message/680342/
  3. At 10:00AM PDT Monday, a small number of Amazon Elastic

    Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck”
  4. 17%

  5. 17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s 4.5%

    extra traffic means a 36% load increase on each server
  6. 25% 100k req/s 12.5% extra traffic means a 300% load

    increase on each server 12.5k >>> 14.3k >>> 17k >>> 50k req/s
  7. JUNE 11, 2010 A Perfect Storm.....of Whales Heavy Loads Causes

    Series of Twitter Outages During World Cup http://engineering.twitter.com/2010/06/perfect-stormof-whales.html
  8. Since Saturday, Twitter has experienced several incidences of poor site

    performance and a high number of errors due to one of our internal sub-networks being over-capacity.
  9. critical licensing service, 100 million + active users a day,

    millions of $$$. A couple of “simple” services. Thick clients, non- updatable, load-balanced on client.
  10. NOVEMBER 14, 2014 Link Imbalance Mystery Issue Causes Metastable Failure

    State https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable- failure-state-at-scale/
  11. “Anything that happens, happens. Anything that, in happening, causes something

    else to happen, causes something else to happen. Anything that, in happening, causes itself to happen again, happens again. It doesn’t necessarily do it in chronological order, though.” Douglas Adams - Mostly Harmless (1992)
  12. “A beach house isn't just real estate. It's a state

    of mind.” Douglas Adams - Mostly Harmless (1992)
  13. services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}

    } { version: {…}, stats: {…}, source: {…} }