Pro Yearly is on sale from $80 to $50! »

Breaking Point: Building Scalable, Resilient APIs (ScalaSyd)

42d9867a0fee0fa6de6534e9df0f1e9b?s=47 Mark Hibberd
September 09, 2015

Breaking Point: Building Scalable, Resilient APIs (ScalaSyd)

The default position of a distributed system is failure. Networks fail. Machines fail. Systems fail.

The problem is APIs are, at their core, a complex distributed system. At some point in their lifetime, APIs will likely have to scale, maybe due to high-volume, large data-sets, a high-number of clients, or maybe just scale to rapid change. When this happens, we want our systems to bend not break.

This talk is a tour of how systems fail, combining analysis of how complex systems break at scale with anecdotes capturing the lighter side of catastrophic failure. We will then ground this with a set of practical tools and techniques to deal with building and testing complex systems for reliability.

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

September 09, 2015
Tweet

Transcript

  1. @markhibberd Breaking Point Building Scalable, Resilient APIs

  2. “A common mistake that people make when trying to design

    something completely foolproof was to underestimate the ingenuity of complete fools.” Douglas Adams - Mostly Harmless (1992)
  3. How Did We Get Here

  4. THE API

  5. THE API

  6. THE API

  7. THE API

  8. THE API

  9. THE API

  10. THE API

  11. THE API

  12. THE API G

  13. THE API G

  14. THE API G

  15. THE API G

  16. THE API G

  17. THE API G

  18. THE API G

  19. THE API G $ $ $ $

  20. THE API G $ $ $ $

  21. THE API G $ $ $ $

  22. failure is inevitable

  23. How Systems Fail

  24. “You live and learn. At any rate, you live.” Douglas

    Adams - Mostly Harmless (1992)
  25. The Crash one

  26. None
  27. None
  28. None
  29. None
  30. Systems Never Fail Cleanly

  31. Cascading Failures two

  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. OCTOBER 27, 2012 Bug triggers Cascading Failures Causes Major Amazon

    Outage https://aws.amazon.com/message/680342/
  39. At 10:00AM PDT Monday, a small number of Amazon Elastic

    Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck”
  40. Can Be Triggered By As Little As A Performance Issue

  41. Don’t Listen To Programmers, Performance Matters

  42. (well, at least asymptotics matter)

  43. A Failure is Indistinguishable from a Slow Response

  44. Chain Reactions three

  45. None
  46. 12.5%

  47. 14.3%

  48. 17%

  49. 17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s

  50. 17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s 4.5%

    extra traffic means a 36% load increase on each server
  51. 25% 100k req/s 12.5% extra traffic means a 300% load

    increase on each server 12.5k >>> 14.3k >>> 17k >>> 50k req/s
  52. Capacity Skew four

  53. None
  54. None
  55. None
  56. None
  57. JUNE 11, 2010 A Perfect Storm.....of Whales Heavy Loads Causes

    Series of Twitter Outages During World Cup http://engineering.twitter.com/2010/06/perfect-stormof-whales.html
  58. Since Saturday, Twitter has experienced several incidences of poor site

    performance and a high number of errors due to one of our internal sub-networks being over-capacity.
  59. Self Denial of Service five

  60. Clients Server

  61. Clients Server

  62. Clients Server

  63. Clients Server

  64. Clients Server

  65. a very painful experience The Quiet Time

  66. critical licensing service, 100 million + active users a day,

    millions of $$$. A couple of “simple” services. Thick clients, non- updatable, load-balanced on client.
  67. server client

  68. /call server client on-demand

  69. /call server client on-demand

  70. /call server client /check on-demand periodically

  71. /call server client /check on-demand periodically

  72. /call server client /check on-demand periodically /check2 /check2z /v3check

  73. /call server client /check on-demand periodically /check2 /check2z /v3check

  74. /call server /check /check2 /check2z /v3check

  75. System Collusion six

  76. None
  77. None
  78. None
  79. NOVEMBER 14, 2014 Link Imbalance Mystery Issue Causes Metastable Failure

    State https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable- failure-state-at-scale/
  80. Bonded Link Should Evenly Utilise Each Network Pipe

  81. multiple causes for imbalance

  82. systems couldn’t correct themselves

  83. individually each component was behaving correctly

  84. Temporary Latency To Db Caused Skew Connection Pool Started Favouring

    Overloaded Link
  85. failure is not clean

  86. How To Control Failure

  87. “Anything that happens, happens. Anything that, in happening, causes something

    else to happen, causes something else to happen. Anything that, in happening, causes itself to happen again, happens again. It doesn’t necessarily do it in chronological order, though.” Douglas Adams - Mostly Harmless (1992)
  88. bad things can happen…

  89. P(failure) = 0.1

  90. P(failure) = 0.1

  91. redundancy

  92. redundancy

  93. P(individual failure) = 0.1

  94. P(system failure) = 0.1^10

  95. are failures really independent?

  96. P(mutually assured destruction) = 1

  97. redundancy

  98. but if one goes…

  99. they all do

  100. P(individual failure) = 0.1

  101. P(individual success) = 1 - 0.1 = 0.9

  102. P(all successes) = 0.9^10

  103. P(system failure) = 1 - 0.9^10

  104. P(system failure) = 1 - 0.9^10 = 0.65

  105. None
  106. P(system failure) = 1 - 0.9^10 = 0.65

  107. Timeouts & Retries one

  108. Other Systems Will Always Be Your Most Vulnerable Failure Modes

  109. Never Make a Network Call, or Go to Disk, Without

    a Timeout
  110. Pay Particular Attention to Reusable & Pooled Resources

  111. Backoff Requests That Time Out

  112. Heartbeats two

  113. If You Know Something Is Failing, Fail Fast

  114. { “name”: 123, “version”: “mth”, “stats”: {…}, “status”: “ok” }

    /status
  115. None
  116. None
  117. None
  118. None
  119. None
  120. None
  121. None
  122. Circuit Breakers three

  123. None
  124. None
  125. None
  126. { “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:

    “…” }, “friends”: [ 191, 1 ] }
  127. { “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:

    “…” }, “friends”: [ 191, 1 ] }
  128. { “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:

    “…” } }
  129. Requires Co-Ordination to Manage Degradation Of Service

  130. Partitioning four

  131. None
  132. traffic Spike

  133. traffic Spike

  134. traffic Spike

  135. None
  136. None
  137. traffic Spike

  138. traffic Spike

  139. survives another day

  140. Partitioning Can Also Be Performed Within Services Via Limited Thread

    & Resource Pools
  141. multiple types of request /shout /download

  142. multiple types of request /shout /download

  143. multiple types of request /shout /download fast

  144. multiple types of request /shout /download fast slow

  145. multiple types of request /shout /download fast really really slow

  146. multiple types of request /shout /download

  147. multiple types of request /shout /download

  148. multiple types of request /shout /download

  149. Backpressure five

  150. Fail One Request, Instead of Failing All Requests

  151. None
  152. Measure And Signal Slow Requests

  153. Upstream Drops Requests To Allow For Recovery

  154. Taper Limits

  155. Taper Limits

  156. Taper Limits 200k Db Requests

  157. Taper Limits 200k Db Requests 50k Server Requests

  158. Taper Limits 200k Db Requests 50k Server Requests 45k Proxy

    Requests
  159. failure can be mitigated

  160. How To Prevent Failure

  161. “A beach house isn't just real estate. It's a state

    of mind.” Douglas Adams - Mostly Harmless (1992)
  162. one Testing

  163. the testing you probably don’t want to do is the

    testing you need to do most
  164. Speed Scalability Stability

  165. tools help charles proxy ipfw / pf netem monitoring tools

    simian army ab / siege
  166. two Measure Everything

  167. every result computed should have traceability back to the code

    & data
  168. gather metadata for everything that touches a request

  169. services: { auth: {…} }

  170. services: { auth: {…}, profile: {…}, recommend: {…} }

  171. services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}

    }
  172. services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}

    } { version: {…}, stats: {…}, source: {…} }
  173. statistics work, measurements over time will find errors

  174. deviation: … percentiles: 90: 95: histogram: 20x: 121 30x: 12

    40x: 13 50x: 121311313
  175. statistics work, we can use them to automate corrective actions

  176. three Production In Development

  177. production quality data automation of environments lots of testing

  178. production quality data automation of environments lots of testing Rather

    Old Hat
  179. four Development in Production

  180. yes, really. i want to ship your worst, un-tried, experimental

    code to production
  181. query /chord

  182. query /chord {id: ab123} datastore ;chord

  183. query /chord {id: ab123} datastore ;chord report ;result

  184. query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

  185. split environments

  186. query /chord {id: ab123}

  187. query /chord {id: ab123} production:live

  188. /chord {id: ab123} production:live proxy query

  189. /chord {id: ab123} production:dev proxy query

  190. /chord {id: ab123} production:* proxy query query

  191. tandem deployments

  192. /chord {id: ab123} production:* proxy query query x x

  193. /chord {id: ab123} production:* proxy query query x x

  194. /chord {id: ab123} production:* proxy query query x x

  195. staged deployments

  196. None
  197. /chord {id: ab123} production:* proxy query query x x

  198. /chord {id: ab123} production:* proxy query x

  199. /chord {id: ab123} production:* proxy query x

  200. None
  201. None
  202. None
  203. None
  204. failure is inevitable

  205. failure is not clean

  206. failure can be mitigated

  207. None
  208. None
  209. None
  210. Unmodified. CC BY 2.0 (https://creativecommons.org/licenses/by/2.0/) https://www.flickr.com/photos/timothymorgan/75288582/ https://www.flickr.com/photos/timothymorgan/75288583/ https://www.flickr.com/photos/timothymorgan/75294154/ https://www.flickr.com/photos/timothymorgan/75593155/ https://www.flickr.com/photos/timothymorgan/75593155/