Breaking Point: Building Scalable, Resilient APIs

42d9867a0fee0fa6de6534e9df0f1e9b?s=47 Mark Hibberd
February 10, 2015

Breaking Point: Building Scalable, Resilient APIs

The default position of a distributed system is failure. Networks fail. Machines fail. Systems fail.

The problem is APIs are, at their core, a complex distributed system. At some point in their lifetime, APIs will likely have to scale, maybe due to high-volume, large data-sets, a high-number of clients, or maybe just scale to rapid change. When this happens, we want our systems to bend not break.

This talk is a tour of how systems fail, combining analysis of how complex systems break at scale with anecdotes capturing the lighter side of catastrophic failure. We will then ground this with a set of practical tools and techniques to deal with building and testing complex systems for reliability.

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

February 10, 2015
Tweet

Transcript

  1. @markhibberd Breaking Point Building Scalable, Resilient APIs

  2. “A common mistake that people make when trying to design

    something completely foolproof was to underestimate the ingenuity of complete fools.” Douglas Adams -! Mostly Harmless (1992)
  3. How Did We Get Here

  4. THE API

  5. THE API

  6. THE API

  7. THE API

  8. THE API

  9. THE API

  10. THE API

  11. THE API

  12. THE API G

  13. THE API G

  14. THE API G

  15. THE API G

  16. THE API G

  17. THE API G

  18. THE API G

  19. THE API G $ $ $ $

  20. THE API G $ $ $ $

  21. THE API G $ $ $ $

  22. failure is inevitable

  23. How Systems Fail

  24. “You live and learn. At any rate, you live.” Douglas

    Adams -! Mostly Harmless (1992)
  25. The Crash one

  26. None
  27. None
  28. None
  29. None
  30. Systems Never Fail Cleanly

  31. Cascading Failures two

  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. OCTOBER 27, 2012 ! Bug triggers Cascading Failures Causes Major

    Amazon Outage ! https://aws.amazon.com/message/680342/
  39. At 10:00AM PDT Monday, a small number of Amazon Elastic

    Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck”
  40. Can Be Triggered By As Little As A Performance Issue

  41. Don’t Listen To Programmers, Performance Matters

  42. (well, at least asymptotics matter)

  43. A Failure is Indistinguishable from a Slow Response

  44. Chain Reactions three

  45. None
  46. 12.5%

  47. 14.3%

  48. 17%

  49. 17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s

  50. 17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s 4.5%

    extra traffic means a 36% load increase on each server
  51. 25% 100k req/s 12.5% extra traffic means a 300% load

    increase on each server 12.5k >>> 14.3k >>> 17k >>> 50k req/s
  52. Capacity Skew four

  53. None
  54. None
  55. None
  56. None
  57. JUNE 11, 2010 ! A Perfect Storm.....of Whales Heavy Loads

    Causes Series of Twitter Outages During World Cup http://engineering.twitter.com/2010/06/perfect-stormof-whales.html
  58. Since Saturday, Twitter has experienced several incidences of poor site

    performance and a high number of errors due to one of our internal sub-networks being over-capacity.
  59. Self Denial of Service five

  60. Clients Server

  61. Clients Server

  62. Clients Server

  63. Clients Server

  64. Clients Server

  65. a very painful experience ! The Quiet Time

  66. critical licensing service, 100 million + active users a day,

    millions of $$$. ! A couple of “simple” services. Thick clients, non- updatable, load-balanced on client.
  67. server client

  68. /call server client on-demand

  69. /call server client on-demand

  70. /call server client /check on-demand periodically

  71. /call server client /check on-demand periodically

  72. /call server client /check on-demand periodically /check2 /check2z /v3check

  73. /call server client /check on-demand periodically /check2 /check2z /v3check

  74. /call server /check /check2 /check2z /v3check

  75. System Collusion six

  76. None
  77. None
  78. None
  79. NOVEMBER 14, 2014 ! Link Imbalance Mystery Issue Causes Metastable

    Failure State ! https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable- failure-state-at-scale/
  80. Bonded Link Should Evenly Utilise Each Network Pipe

  81. multiple causes for imbalance

  82. systems couldn’t correct themselves

  83. individually each component was behaving correctly

  84. Temporary Latency To Db Caused Skew Connection Pool Started Favouring

    Overloaded Link
  85. failure is not clean

  86. Incident Reports Are For You and Your Customers

  87. A Good Incident Report, Helps Others Learn From Your Mistake,

    And Ensures You Really Understand What Went Wrong
  88. 1. Summary & Impact 2. Timeline 3. Root Cause 4.

    Resolution and Recovery 5. Corrective and Preventative Measures 5 Steps To A Good Incident Report https://sysadmincasts.com/episodes/20-how-to-write-an-incident-report-postmortem
  89. How To Control Failure

  90. “Anything that happens, happens. ! Anything that, in happening, causes

    something else to happen, causes something else to happen. ! Anything that, in happening, causes itself to happen again, happens again. ! It doesn’t necessarily do it in chronological order, though.” Douglas Adams -! Mostly Harmless (1992)
  91. Timeouts one

  92. Other Systems Will Always Be Your Most Vulnerable Failure Modes

  93. Never Make a Network Call, or Go to Disk, Without

    a Timeout
  94. Pay Particular Attention to Reusable & Pooled Resources

  95. Backoff Requests That Time Out

  96. Heartbeats two

  97. If You Know Something Is Failing, Fail Fast

  98. { “name”: 123, “version”: “mth”, “stats”: {…}, “status”: “ok” }

    /status
  99. None
  100. None
  101. None
  102. None
  103. None
  104. None
  105. None
  106. Circuit Breakers three

  107. None
  108. None
  109. None
  110. { “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:

    “…” }, “friends”: [ 191, 1 ] }
  111. { “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:

    “…” }, “friends”: [ 191, 1 ] }
  112. { “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:

    “…” } }
  113. Requires Co-Ordination to Manage Degradation Of Service

  114. Partitioning four

  115. None
  116. traffic Spike

  117. traffic Spike

  118. traffic Spike

  119. None
  120. None
  121. traffic Spike

  122. traffic Spike

  123. survives another day

  124. Partitioning Can Also Be Performed Within Services Via Limited Thread

    & Resource Pools
  125. multiple types of request /shout /download

  126. multiple types of request /shout /download

  127. multiple types of request /shout /download fast

  128. multiple types of request /shout /download fast slow

  129. multiple types of request /shout /download fast really really slow

  130. multiple types of request /shout /download

  131. multiple types of request /shout /download

  132. multiple types of request /shout /download

  133. Backpressure five

  134. Fail One Request, Instead of Failing All Requests

  135. None
  136. Measure And Signal Slow Requests

  137. Upstream Drops Requests To Allow For Recovery

  138. Taper Limits

  139. Taper Limits

  140. Taper Limits 200k Db Requests

  141. Taper Limits 200k Db Requests 50k Server Requests

  142. Taper Limits 200k Db Requests 50k Server Requests 45k Proxy

    Requests
  143. failure can be mitigated

  144. None
  145. How To Prevent Failure

  146. “A beach house isn't just real estate. It's a state

    of mind.” Douglas Adams -! Mostly Harmless (1992)
  147. one Testing

  148. the testing you probably don’t want to do is the

    testing you need to do most
  149. Speed Scalability Stability

  150. tools help charles proxy ipfw / pf netem monitoring tools

    simian army ab / siege
  151. two Measure Everything

  152. every result computed should have traceability back to the code

    & data
  153. gather metadata for everything that touches a request

  154. services: { auth: {…} ! ! ! }

  155. services: { auth: {…}, profile: {…}, recommend: {…} ! }

  156. services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}

    }
  157. services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}

    } { version: {…}, stats: {…}, source: {…} }
  158. statistics work, measurements over time will find errors

  159. ! deviation: … percentiles: 90: 95: histogram: 20x: 121 30x:

    12 40x: 13 50x: 121311313
  160. statistics work, we can use them to automate corrective actions

  161. three Production In Development

  162. production quality data automation of environments lots of testing

  163. production quality data automation of environments lots of testing Rather

    Old Hat
  164. four Development in Production

  165. yes, really. i want to ship your worst, un-tried, experimental

    code to production
  166. @ambiata we deal with ingesting and processing lots of data

    100s TB / per day / per customer scientific experiment and measurement is key experiments affect users directly researchers / non-specialist engineers produce code
  167. query /chord

  168. query /chord {id: ab123} datastore ;chord

  169. query /chord {id: ab123} datastore ;chord report ;result

  170. query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

  171. split environments

  172. query /chord {id: ab123}

  173. query /chord {id: ab123} production:live

  174. /chord {id: ab123} production:live proxy query

  175. /chord {id: ab123} production:exp proxy query

  176. /chord {id: ab123} production:* proxy query query

  177. implemented through machine level acls experiment live control

  178. implemented through machine level acls experiment live control write read

  179. implemented through machine level acls experiment live control

  180. implemented through machine level acls experiment live control write read

  181. implemented through machine level acls experiment live control write read

  182. checkpoints

  183. query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

    x x
  184. query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

    x x
  185. query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

    x x
  186. query /chord {id: ab123} datastore ;chord report ;result x x

    behaviour change through in production testing
  187. query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

    x x
  188. deep implementation, intra- and inter- process crosschecks

  189. tandem deployments

  190. /chord {id: ab123} production:* proxy query query x x

  191. /chord {id: ab123} production:* proxy query query x x

  192. /chord {id: ab123} production:* proxy query query x x

  193. staged deployments

  194. None
  195. /chord {id: ab123} production:* proxy query query x x

  196. /chord {id: ab123} production:* proxy query x

  197. /chord {id: ab123} production:* proxy query x

  198. None
  199. None
  200. None
  201. None
  202. failure is inevitable

  203. failure is not clean

  204. failure can be mitigated

  205. None
  206. Unmodified. CC BY 2.0 (https://creativecommons.org/licenses/by/2.0/)! https://www.flickr.com/photos/timothymorgan/75288582/! https://www.flickr.com/photos/timothymorgan/75288583/! https://www.flickr.com/photos/timothymorgan/75294154/! https://www.flickr.com/photos/timothymorgan/75593155/! https://www.flickr.com/photos/timothymorgan/75593155/