Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breaking Point: Building Scalable, Resilient APIs (ScalaSyd)

Mark Hibberd
September 09, 2015

Breaking Point: Building Scalable, Resilient APIs (ScalaSyd)

The default position of a distributed system is failure. Networks fail. Machines fail. Systems fail.

The problem is APIs are, at their core, a complex distributed system. At some point in their lifetime, APIs will likely have to scale, maybe due to high-volume, large data-sets, a high-number of clients, or maybe just scale to rapid change. When this happens, we want our systems to bend not break.

This talk is a tour of how systems fail, combining analysis of how complex systems break at scale with anecdotes capturing the lighter side of catastrophic failure. We will then ground this with a set of practical tools and techniques to deal with building and testing complex systems for reliability.

Mark Hibberd

September 09, 2015
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. @markhibberd
    Breaking Point
    Building Scalable,
    Resilient APIs

    View full-size slide

  2. “A common mistake that people make when
    trying to design something completely
    foolproof was to underestimate the
    ingenuity of complete fools.”
    Douglas Adams -
    Mostly Harmless (1992)

    View full-size slide

  3. How Did We Get Here

    View full-size slide

  4. THE API
    G
    $ $ $ $

    View full-size slide

  5. THE API
    G
    $ $ $ $

    View full-size slide

  6. THE API
    G
    $ $ $ $

    View full-size slide

  7. failure
    is inevitable

    View full-size slide

  8. How Systems Fail

    View full-size slide

  9. “You live and learn. At any rate, you live.”
    Douglas Adams -
    Mostly Harmless (1992)

    View full-size slide

  10. The Crash
    one

    View full-size slide

  11. Systems Never
    Fail Cleanly

    View full-size slide

  12. Cascading Failures
    two

    View full-size slide

  13. OCTOBER 27, 2012
    Bug triggers
    Cascading Failures
    Causes Major Amazon Outage
    https://aws.amazon.com/message/680342/

    View full-size slide

  14. At 10:00AM PDT Monday, a small number of Amazon Elastic
    Block Store (EBS) volumes in one of our five Availability
    Zones in the US-East Region began seeing degraded
    performance, and in some cases, became “stuck”

    View full-size slide

  15. Can Be
    Triggered By
    As Little As A
    Performance
    Issue

    View full-size slide

  16. Don’t Listen To Programmers,
    Performance Matters

    View full-size slide

  17. (well, at least asymptotics matter)

    View full-size slide

  18. A Failure is Indistinguishable from a
    Slow Response

    View full-size slide

  19. Chain Reactions
    three

    View full-size slide

  20. 17%
    100k req/s
    12.5k >>> 14.3k >>> 17k req/s

    View full-size slide

  21. 17%
    100k req/s
    12.5k >>> 14.3k >>> 17k req/s
    4.5% extra traffic
    means a 36% load
    increase on each
    server

    View full-size slide

  22. 25%
    100k req/s
    12.5% extra traffic
    means a 300% load
    increase on each
    server
    12.5k >>> 14.3k >>> 17k >>> 50k req/s

    View full-size slide

  23. Capacity Skew
    four

    View full-size slide

  24. JUNE 11, 2010
    A Perfect Storm.....of Whales
    Heavy Loads
    Causes Series of Twitter Outages
    During World Cup
    http://engineering.twitter.com/2010/06/perfect-stormof-whales.html

    View full-size slide

  25. Since Saturday, Twitter has experienced several
    incidences of poor site performance and a high number of
    errors due to one of our internal sub-networks being
    over-capacity.

    View full-size slide

  26. Self Denial of Service
    five

    View full-size slide

  27. Clients
    Server

    View full-size slide

  28. Clients
    Server

    View full-size slide

  29. Clients
    Server

    View full-size slide

  30. Clients
    Server

    View full-size slide

  31. Clients
    Server

    View full-size slide

  32. a very painful experience
    The Quiet Time

    View full-size slide

  33. critical licensing service, 100 million + active users a
    day, millions of $$$.
    A couple of “simple” services. Thick clients, non-
    updatable, load-balanced on client.

    View full-size slide

  34. server client

    View full-size slide

  35. /call
    server client
    on-demand

    View full-size slide

  36. /call
    server client
    on-demand

    View full-size slide

  37. /call
    server client
    /check
    on-demand
    periodically

    View full-size slide

  38. /call
    server client
    /check
    on-demand
    periodically

    View full-size slide

  39. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View full-size slide

  40. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View full-size slide

  41. /call
    server
    /check
    /check2
    /check2z
    /v3check

    View full-size slide

  42. System Collusion
    six

    View full-size slide

  43. NOVEMBER 14, 2014
    Link Imbalance
    Mystery Issue
    Causes Metastable Failure State
    https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable-
    failure-state-at-scale/

    View full-size slide

  44. Bonded Link
    Should Evenly
    Utilise Each
    Network Pipe

    View full-size slide

  45. multiple causes
    for imbalance

    View full-size slide

  46. systems couldn’t
    correct
    themselves

    View full-size slide

  47. individually each
    component was
    behaving correctly

    View full-size slide

  48. Temporary
    Latency To Db
    Caused Skew
    Connection Pool
    Started Favouring
    Overloaded Link

    View full-size slide

  49. failure
    is not clean

    View full-size slide

  50. How To Control Failure

    View full-size slide

  51. “Anything that happens, happens.
    Anything that, in happening, causes something
    else to happen, causes something else to happen.
    Anything that, in happening, causes itself to
    happen again, happens again.
    It doesn’t necessarily do it in chronological
    order, though.”
    Douglas Adams -
    Mostly Harmless (1992)

    View full-size slide

  52. bad things can happen…

    View full-size slide

  53. P(failure) = 0.1

    View full-size slide

  54. P(failure) = 0.1

    View full-size slide

  55. P(individual failure) = 0.1

    View full-size slide

  56. P(system failure) = 0.1^10

    View full-size slide

  57. are failures really independent?

    View full-size slide

  58. P(mutually assured destruction) = 1

    View full-size slide

  59. but if one goes…

    View full-size slide

  60. P(individual failure) = 0.1

    View full-size slide

  61. P(individual success) = 1 - 0.1 = 0.9

    View full-size slide

  62. P(all successes) = 0.9^10

    View full-size slide

  63. P(system failure) = 1 - 0.9^10

    View full-size slide

  64. P(system failure) = 1 - 0.9^10 = 0.65

    View full-size slide

  65. P(system failure) = 1 - 0.9^10 = 0.65

    View full-size slide

  66. Timeouts & Retries
    one

    View full-size slide

  67. Other Systems Will Always Be Your
    Most Vulnerable Failure Modes

    View full-size slide

  68. Never Make
    a Network Call, or Go to Disk,
    Without a Timeout

    View full-size slide

  69. Pay Particular Attention to
    Reusable & Pooled Resources

    View full-size slide

  70. Backoff Requests
    That Time Out

    View full-size slide

  71. Heartbeats
    two

    View full-size slide

  72. If You Know Something Is Failing,
    Fail Fast

    View full-size slide

  73. {
    “name”: 123,
    “version”: “mth”,
    “stats”: {…},
    “status”: “ok”
    }
    /status

    View full-size slide

  74. Circuit Breakers
    three

    View full-size slide

  75. {
    “id”: 123,
    “username”: “mth”,
    “profile”: {
    “bio”: “…”
    “image”: “…”
    },
    “friends”: [
    191,
    1
    ]
    }

    View full-size slide

  76. {
    “id”: 123,
    “username”: “mth”,
    “profile”: {
    “bio”: “…”
    “image”: “…”
    },
    “friends”: [
    191,
    1
    ]
    }

    View full-size slide

  77. {
    “id”: 123,
    “username”: “mth”,
    “profile”: {
    “bio”: “…”
    “image”: “…”
    }
    }

    View full-size slide

  78. Requires Co-Ordination to Manage
    Degradation Of Service

    View full-size slide

  79. Partitioning
    four

    View full-size slide

  80. traffic
    Spike

    View full-size slide

  81. traffic
    Spike

    View full-size slide

  82. traffic
    Spike

    View full-size slide

  83. traffic
    Spike

    View full-size slide

  84. traffic
    Spike

    View full-size slide

  85. survives
    another
    day

    View full-size slide

  86. Partitioning Can Also Be Performed
    Within Services Via Limited Thread &
    Resource Pools

    View full-size slide

  87. multiple types
    of request
    /shout /download

    View full-size slide

  88. multiple types
    of request
    /shout /download

    View full-size slide

  89. multiple types
    of request
    /shout /download
    fast

    View full-size slide

  90. multiple types
    of request
    /shout /download
    fast slow

    View full-size slide

  91. multiple types
    of request
    /shout /download
    fast really
    really
    slow

    View full-size slide

  92. multiple types
    of request
    /shout /download

    View full-size slide

  93. multiple types
    of request
    /shout /download

    View full-size slide

  94. multiple types
    of request
    /shout /download

    View full-size slide

  95. Backpressure
    five

    View full-size slide

  96. Fail One Request,
    Instead of Failing All Requests

    View full-size slide

  97. Measure And
    Signal Slow
    Requests

    View full-size slide

  98. Upstream
    Drops Requests
    To Allow For
    Recovery

    View full-size slide

  99. Taper Limits

    View full-size slide

  100. Taper Limits

    View full-size slide

  101. Taper Limits
    200k Db
    Requests

    View full-size slide

  102. Taper Limits
    200k Db
    Requests
    50k Server
    Requests

    View full-size slide

  103. Taper Limits
    200k Db
    Requests
    50k Server
    Requests
    45k Proxy
    Requests

    View full-size slide

  104. failure
    can be mitigated

    View full-size slide

  105. How To Prevent Failure

    View full-size slide

  106. “A beach house isn't just real estate. It's a
    state of mind.”
    Douglas Adams -
    Mostly Harmless (1992)

    View full-size slide

  107. the testing you probably
    don’t want to do is the
    testing you need to do most

    View full-size slide

  108. Speed
    Scalability
    Stability

    View full-size slide

  109. tools help
    charles
    proxy
    ipfw / pf netem
    monitoring
    tools
    simian
    army
    ab / siege

    View full-size slide

  110. two
    Measure Everything

    View full-size slide

  111. every result computed
    should have traceability
    back to the code & data

    View full-size slide

  112. gather metadata for
    everything that
    touches a request

    View full-size slide

  113. services: {
    auth: {…}
    }

    View full-size slide

  114. services: {
    auth: {…},
    profile: {…},
    recommend: {…}
    }

    View full-size slide

  115. services: {
    auth: {…},
    profile: {…},
    recommend: {…},
    friends: {…}
    }

    View full-size slide

  116. services: {
    auth: {…},
    profile: {…},
    recommend: {…},
    friends: {…}
    }
    {
    version: {…},
    stats: {…},
    source: {…}
    }

    View full-size slide

  117. statistics work,
    measurements over time
    will find errors

    View full-size slide

  118. deviation:

    percentiles:
    90:
    95:
    histogram:
    20x: 121
    30x: 12
    40x: 13
    50x: 121311313

    View full-size slide

  119. statistics work,
    we can use them to automate
    corrective actions

    View full-size slide

  120. three
    Production In Development

    View full-size slide

  121. production quality data
    automation of environments
    lots of testing

    View full-size slide

  122. production quality data
    automation of environments
    lots of testing
    Rather Old Hat

    View full-size slide

  123. four
    Development in Production

    View full-size slide

  124. yes, really.
    i want to ship your worst,
    un-tried, experimental
    code to production

    View full-size slide

  125. query
    /chord {id: ab123}
    datastore
    ;chord

    View full-size slide

  126. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result

    View full-size slide

  127. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    /chord/ab123
    client

    View full-size slide

  128. split environments

    View full-size slide

  129. query
    /chord {id: ab123}

    View full-size slide

  130. query
    /chord {id: ab123}
    production:live

    View full-size slide

  131. /chord {id: ab123}
    production:live
    proxy
    query

    View full-size slide

  132. /chord {id: ab123}
    production:dev
    proxy
    query

    View full-size slide

  133. /chord {id: ab123}
    production:*
    proxy
    query query

    View full-size slide

  134. tandem deployments

    View full-size slide

  135. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View full-size slide

  136. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View full-size slide

  137. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View full-size slide

  138. staged deployments

    View full-size slide

  139. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View full-size slide

  140. /chord {id: ab123}
    production:*
    proxy
    query
    x

    View full-size slide

  141. /chord {id: ab123}
    production:*
    proxy
    query
    x

    View full-size slide

  142. failure
    is inevitable

    View full-size slide

  143. failure
    is not clean

    View full-size slide

  144. failure
    can be mitigated

    View full-size slide

  145. Unmodified. CC BY 2.0 (https://creativecommons.org/licenses/by/2.0/)
    https://www.flickr.com/photos/timothymorgan/75288582/
    https://www.flickr.com/photos/timothymorgan/75288583/
    https://www.flickr.com/photos/timothymorgan/75294154/
    https://www.flickr.com/photos/timothymorgan/75593155/
    https://www.flickr.com/photos/timothymorgan/75593155/

    View full-size slide