$30 off During Our Annual Pro Sale. View Details »

Breaking Point: Building Scalable, Resilient APIs

Mark Hibberd
February 10, 2015

Breaking Point: Building Scalable, Resilient APIs

The default position of a distributed system is failure. Networks fail. Machines fail. Systems fail.

The problem is APIs are, at their core, a complex distributed system. At some point in their lifetime, APIs will likely have to scale, maybe due to high-volume, large data-sets, a high-number of clients, or maybe just scale to rapid change. When this happens, we want our systems to bend not break.

This talk is a tour of how systems fail, combining analysis of how complex systems break at scale with anecdotes capturing the lighter side of catastrophic failure. We will then ground this with a set of practical tools and techniques to deal with building and testing complex systems for reliability.

Mark Hibberd

February 10, 2015
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. @markhibberd
    Breaking Point
    Building Scalable,
    Resilient APIs

    View Slide

  2. “A common mistake that people make when
    trying to design something completely
    foolproof was to underestimate the
    ingenuity of complete fools.”
    Douglas Adams -!
    Mostly Harmless (1992)

    View Slide

  3. How Did We Get Here

    View Slide

  4. THE API

    View Slide

  5. THE API

    View Slide

  6. THE API

    View Slide

  7. THE API

    View Slide

  8. THE API

    View Slide

  9. THE API

    View Slide

  10. THE API

    View Slide

  11. THE API

    View Slide

  12. THE API
    G

    View Slide

  13. THE API
    G

    View Slide

  14. THE API
    G

    View Slide

  15. THE API
    G

    View Slide

  16. THE API
    G

    View Slide

  17. THE API
    G

    View Slide

  18. THE API
    G

    View Slide

  19. THE API
    G
    $ $ $ $

    View Slide

  20. THE API
    G
    $ $ $ $

    View Slide

  21. THE API
    G
    $ $ $ $

    View Slide

  22. failure
    is inevitable

    View Slide

  23. How Systems Fail

    View Slide

  24. “You live and learn. At any rate, you live.”
    Douglas Adams -!
    Mostly Harmless (1992)

    View Slide

  25. The Crash
    one

    View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. Systems Never
    Fail Cleanly

    View Slide

  31. Cascading Failures
    two

    View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. View Slide

  38. OCTOBER 27, 2012
    !
    Bug triggers
    Cascading Failures
    Causes Major Amazon Outage
    !
    https://aws.amazon.com/message/680342/

    View Slide

  39. At 10:00AM PDT Monday, a small number of Amazon Elastic
    Block Store (EBS) volumes in one of our five Availability
    Zones in the US-East Region began seeing degraded
    performance, and in some cases, became “stuck”

    View Slide

  40. Can Be
    Triggered By
    As Little As A
    Performance
    Issue

    View Slide

  41. Don’t Listen To Programmers,
    Performance Matters

    View Slide

  42. (well, at least asymptotics matter)

    View Slide

  43. A Failure is Indistinguishable from a
    Slow Response

    View Slide

  44. Chain Reactions
    three

    View Slide

  45. View Slide

  46. 12.5%

    View Slide

  47. 14.3%

    View Slide

  48. 17%

    View Slide

  49. 17%
    100k req/s
    12.5k >>> 14.3k >>> 17k req/s

    View Slide

  50. 17%
    100k req/s
    12.5k >>> 14.3k >>> 17k req/s
    4.5% extra traffic
    means a 36% load
    increase on each
    server

    View Slide

  51. 25%
    100k req/s
    12.5% extra traffic
    means a 300% load
    increase on each
    server
    12.5k >>> 14.3k >>> 17k >>> 50k req/s

    View Slide

  52. Capacity Skew
    four

    View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. JUNE 11, 2010
    !
    A Perfect Storm.....of Whales
    Heavy Loads
    Causes Series of Twitter Outages
    During World Cup
    http://engineering.twitter.com/2010/06/perfect-stormof-whales.html

    View Slide

  58. Since Saturday, Twitter has experienced several
    incidences of poor site performance and a high number of
    errors due to one of our internal sub-networks being
    over-capacity.

    View Slide

  59. Self Denial of Service
    five

    View Slide

  60. Clients
    Server

    View Slide

  61. Clients
    Server

    View Slide

  62. Clients
    Server

    View Slide

  63. Clients
    Server

    View Slide

  64. Clients
    Server

    View Slide

  65. a very painful experience
    !
    The Quiet Time

    View Slide

  66. critical licensing service, 100 million + active users a
    day, millions of $$$.
    !
    A couple of “simple” services. Thick clients, non-
    updatable, load-balanced on client.

    View Slide

  67. server client

    View Slide

  68. /call
    server client
    on-demand

    View Slide

  69. /call
    server client
    on-demand

    View Slide

  70. /call
    server client
    /check
    on-demand
    periodically

    View Slide

  71. /call
    server client
    /check
    on-demand
    periodically

    View Slide

  72. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View Slide

  73. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View Slide

  74. /call
    server
    /check
    /check2
    /check2z
    /v3check

    View Slide

  75. System Collusion
    six

    View Slide

  76. View Slide

  77. View Slide

  78. View Slide

  79. NOVEMBER 14, 2014
    !
    Link Imbalance
    Mystery Issue
    Causes Metastable Failure State
    !
    https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable-
    failure-state-at-scale/

    View Slide

  80. Bonded Link
    Should Evenly
    Utilise Each
    Network Pipe

    View Slide

  81. multiple causes
    for imbalance

    View Slide

  82. systems couldn’t
    correct
    themselves

    View Slide

  83. individually each
    component was
    behaving correctly

    View Slide

  84. Temporary
    Latency To Db
    Caused Skew
    Connection Pool
    Started Favouring
    Overloaded Link

    View Slide

  85. failure
    is not clean

    View Slide

  86. Incident Reports Are For
    You and Your Customers

    View Slide

  87. A Good Incident Report, Helps Others
    Learn From Your Mistake, And Ensures
    You Really Understand What Went
    Wrong

    View Slide

  88. 1. Summary & Impact
    2. Timeline
    3. Root Cause
    4. Resolution and Recovery
    5. Corrective and Preventative Measures
    5 Steps To A Good Incident Report
    https://sysadmincasts.com/episodes/20-how-to-write-an-incident-report-postmortem

    View Slide

  89. How To Control Failure

    View Slide

  90. “Anything that happens, happens.
    !
    Anything that, in happening, causes something
    else to happen, causes something else to happen.
    !
    Anything that, in happening, causes itself to
    happen again, happens again.
    !
    It doesn’t necessarily do it in chronological
    order, though.”
    Douglas Adams -!
    Mostly Harmless (1992)

    View Slide

  91. Timeouts
    one

    View Slide

  92. Other Systems Will Always Be Your
    Most Vulnerable Failure Modes

    View Slide

  93. Never Make
    a Network Call, or Go to Disk,
    Without a Timeout

    View Slide

  94. Pay Particular Attention to
    Reusable & Pooled Resources

    View Slide

  95. Backoff Requests
    That Time Out

    View Slide

  96. Heartbeats
    two

    View Slide

  97. If You Know Something Is Failing,
    Fail Fast

    View Slide

  98. {
    “name”: 123,
    “version”: “mth”,
    “stats”: {…},
    “status”: “ok”
    }
    /status

    View Slide

  99. View Slide

  100. View Slide

  101. View Slide

  102. View Slide

  103. View Slide

  104. View Slide

  105. View Slide

  106. Circuit Breakers
    three

    View Slide

  107. View Slide

  108. View Slide

  109. View Slide

  110. {
    “id”: 123,
    “username”: “mth”,
    “profile”: {
    “bio”: “…”
    “image”: “…”
    },
    “friends”: [
    191,
    1
    ]
    }

    View Slide

  111. {
    “id”: 123,
    “username”: “mth”,
    “profile”: {
    “bio”: “…”
    “image”: “…”
    },
    “friends”: [
    191,
    1
    ]
    }

    View Slide

  112. {
    “id”: 123,
    “username”: “mth”,
    “profile”: {
    “bio”: “…”
    “image”: “…”
    }
    }

    View Slide

  113. Requires Co-Ordination to Manage
    Degradation Of Service

    View Slide

  114. Partitioning
    four

    View Slide

  115. View Slide

  116. traffic
    Spike

    View Slide

  117. traffic
    Spike

    View Slide

  118. traffic
    Spike

    View Slide

  119. View Slide

  120. View Slide

  121. traffic
    Spike

    View Slide

  122. traffic
    Spike

    View Slide

  123. survives
    another
    day

    View Slide

  124. Partitioning Can Also Be Performed
    Within Services Via Limited Thread &
    Resource Pools

    View Slide

  125. multiple types
    of request
    /shout /download

    View Slide

  126. multiple types
    of request
    /shout /download

    View Slide

  127. multiple types
    of request
    /shout /download
    fast

    View Slide

  128. multiple types
    of request
    /shout /download
    fast slow

    View Slide

  129. multiple types
    of request
    /shout /download
    fast really
    really
    slow

    View Slide

  130. multiple types
    of request
    /shout /download

    View Slide

  131. multiple types
    of request
    /shout /download

    View Slide

  132. multiple types
    of request
    /shout /download

    View Slide

  133. Backpressure
    five

    View Slide

  134. Fail One Request,
    Instead of Failing All Requests

    View Slide

  135. View Slide

  136. Measure And
    Signal Slow
    Requests

    View Slide

  137. Upstream
    Drops Requests
    To Allow For
    Recovery

    View Slide

  138. Taper Limits

    View Slide

  139. Taper Limits

    View Slide

  140. Taper Limits
    200k Db
    Requests

    View Slide

  141. Taper Limits
    200k Db
    Requests
    50k Server
    Requests

    View Slide

  142. Taper Limits
    200k Db
    Requests
    50k Server
    Requests
    45k Proxy
    Requests

    View Slide

  143. failure
    can be mitigated

    View Slide

  144. View Slide

  145. How To Prevent Failure

    View Slide

  146. “A beach house isn't just real estate. It's a
    state of mind.”
    Douglas Adams -!
    Mostly Harmless (1992)

    View Slide

  147. one
    Testing

    View Slide

  148. the testing you probably
    don’t want to do is the
    testing you need to do most

    View Slide

  149. Speed
    Scalability
    Stability

    View Slide

  150. tools help
    charles
    proxy
    ipfw / pf netem
    monitoring
    tools
    simian
    army
    ab / siege

    View Slide

  151. two
    Measure Everything

    View Slide

  152. every result computed
    should have traceability
    back to the code & data

    View Slide

  153. gather metadata for
    everything that
    touches a request

    View Slide

  154. services: {
    auth: {…}
    !
    !
    !
    }

    View Slide

  155. services: {
    auth: {…},
    profile: {…},
    recommend: {…}
    !
    }

    View Slide

  156. services: {
    auth: {…},
    profile: {…},
    recommend: {…},
    friends: {…}
    }

    View Slide

  157. services: {
    auth: {…},
    profile: {…},
    recommend: {…},
    friends: {…}
    }
    {
    version: {…},
    stats: {…},
    source: {…}
    }

    View Slide

  158. statistics work,
    measurements over time
    will find errors

    View Slide

  159. !
    deviation:

    percentiles:
    90:
    95:
    histogram:
    20x: 121
    30x: 12
    40x: 13
    50x: 121311313

    View Slide

  160. statistics work,
    we can use them to automate
    corrective actions

    View Slide

  161. three
    Production In Development

    View Slide

  162. production quality data
    automation of environments
    lots of testing

    View Slide

  163. production quality data
    automation of environments
    lots of testing
    Rather Old Hat

    View Slide

  164. four
    Development in Production

    View Slide

  165. yes, really.
    i want to ship your worst,
    un-tried, experimental
    code to production

    View Slide

  166. @ambiata
    we deal with ingesting and processing lots of data
    100s TB / per day / per customer
    scientific experiment and measurement is key
    experiments affect users directly
    researchers / non-specialist engineers produce code

    View Slide

  167. query
    /chord

    View Slide

  168. query
    /chord {id: ab123}
    datastore
    ;chord

    View Slide

  169. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result

    View Slide

  170. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    /chord/ab123
    client

    View Slide

  171. split environments

    View Slide

  172. query
    /chord {id: ab123}

    View Slide

  173. query
    /chord {id: ab123}
    production:live

    View Slide

  174. /chord {id: ab123}
    production:live
    proxy
    query

    View Slide

  175. /chord {id: ab123}
    production:exp
    proxy
    query

    View Slide

  176. /chord {id: ab123}
    production:*
    proxy
    query query

    View Slide

  177. implemented through machine level acls
    experiment
    live
    control

    View Slide

  178. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View Slide

  179. implemented through machine level acls
    experiment
    live
    control

    View Slide

  180. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View Slide

  181. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View Slide

  182. checkpoints

    View Slide

  183. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    /chord/ab123
    client
    x
    x

    View Slide

  184. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    /chord/ab123
    client
    x
    x

    View Slide

  185. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    /chord/ab123
    client
    x
    x

    View Slide

  186. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    x
    x
    behaviour change
    through in
    production testing

    View Slide

  187. query
    /chord {id: ab123}
    datastore
    ;chord
    report
    ;result
    /chord/ab123
    client
    x
    x

    View Slide

  188. deep implementation,
    intra- and inter- process
    crosschecks

    View Slide

  189. tandem deployments

    View Slide

  190. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View Slide

  191. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View Slide

  192. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View Slide

  193. staged deployments

    View Slide

  194. View Slide

  195. /chord {id: ab123}
    production:*
    proxy
    query query
    x x

    View Slide

  196. /chord {id: ab123}
    production:*
    proxy
    query
    x

    View Slide

  197. /chord {id: ab123}
    production:*
    proxy
    query
    x

    View Slide

  198. View Slide

  199. View Slide

  200. View Slide

  201. View Slide

  202. failure
    is inevitable

    View Slide

  203. failure
    is not clean

    View Slide

  204. failure
    can be mitigated

    View Slide

  205. View Slide

  206. Unmodified. CC BY 2.0 (https://creativecommons.org/licenses/by/2.0/)!
    https://www.flickr.com/photos/timothymorgan/75288582/!
    https://www.flickr.com/photos/timothymorgan/75288583/!
    https://www.flickr.com/photos/timothymorgan/75294154/!
    https://www.flickr.com/photos/timothymorgan/75593155/!
    https://www.flickr.com/photos/timothymorgan/75593155/

    View Slide