Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crossroads of asynchrony and graceful degradation at QCon SF 2015

A3668e66eb7b8980ac91daaa4e9fe691?s=47 Nitesh Kant
November 17, 2015

Crossroads of asynchrony and graceful degradation at QCon SF 2015

Netflix with more than 60 million subscribers worldwide and accounting for a third of the internet traffic in the United States, is a highly available internet service. In order to guarantee high availability for our service, we have architected our systems in a way that different failures modes in distributed systems causes graceful degradation and not unavailability.

In our constant endeavor to improve availability of our services, we are on our path to embrace asynchrony in its entirety in our services using libraries like RxJava and RxNetty. Transitioning from a synchronous world, asynchronous applications brings in interesting challenges as well as novel solutions specifically in terms of handling various different failure modes in distributed systems like latency, partial failures and abusive clients.

In this talk Nitesh Kant will describe how embracing asynchrony in our applications, from networking to business processing, creates gracefully degrading and highly resilient applications

Presented at QCon SF 2015: https://qconsf.com/sf2015/presentation/crossroads-of-asynchrony-and-graceful-degradation

Video: https://www.infoq.com/presentations/netflix-asynchronous-apps

A3668e66eb7b8980ac91daaa4e9fe691?s=128

Nitesh Kant

November 17, 2015
Tweet

More Decks by Nitesh Kant

Other Decks in Technology

Transcript

  1. Crossroads of asynchrony and graceful degradation Nitesh Kant, Software Engineer,

    Netflix Edge Engineering. @NiteshKant
  2. None
  3. None
  4. None
  5. Nitesh Kant Who Am I? ❖ Engineer, Edge Engineering, Netflix.

    ❖ Core contributor, RxNetty* ❖ Contributor, Zuul** * https://github.com/ReactiveX/RxNetty ** https://github.com/Netflix/zuul @NiteshKant
  6. None
  7. How do systems fail?

  8. A simple example. Showing a movie on Netflix.

  9. Video Metadata

  10. Video Bookmark

  11. Video Rating

  12. public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark

    bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing
  13. Synchronicity public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId);

    Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing
  14. The bigger picture Price of being synchronous? public Movie getMovie(String

    movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing
  15. In a microservices world Edge Service Ratings Service Video Metadata

    Service Bookmarks Service Disclaimer: This is an example and not an exact representation of the processing
  16. In a microservices world Edge Service Server threadpool Thread Thread

    Thread Thread Thread getMovieMetadata(movieId) Disclaimer: This is an example and not an exact representation of the processing
  17. In a microservices world Edge Service Server threadpool Thread Thread

    Thread Thread Thread getMovieMetadata(movieId) getBookmark(movieId, userId) Disclaimer: This is an example and not an exact representation of the processing
  18. In a microservices world Edge Service Server threadpool Thread Thread

    Thread Thread Thread getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) Disclaimer: This is an example and not an exact representation of the processing
  19. Busy thread time = Sum of the time taken to

    make all 3 service calls
  20. How do systems fail? 1. Latency Latency is your worst

    enemy in a synchronous world.
  21. Edge Service Server threadpool Thread Thread Thread Thread Thread getRatings(movieId)

    Disclaimer: This is an example and not an exact representation of the processing
  22. Disclaimer: This is an example and not an exact representation

    of the processing Ratings Service Edge Service Server threadpool Thread Thread Thread Thread Thread getRatings(movieId)
  23. Disclaimer: This is an example and not an exact representation

    of the processing Edge Service getRatings(movieId) Server threadpool Thread Thread Thread Thread Thread
  24. Edge Service Disclaimer: This is an example and not an

    exact representation of the processing getRatings(movieId) Server threadpool Thread Thread Thread Thread Thread
  25. Edge Service Server threadpool Thread Thread Thread Thread Thread Disclaimer:

    This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)
  26. Edge Service Server threadpool Thread Thread Thread Thread Thread Disclaimer:

    This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)
  27. Edge Service Server threadpool Thread Thread Thread Thread Thread Disclaimer:

    This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)
  28. Managing client thread pools Disclaimer: This is an example and

    not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread
  29. Managing client thread pools

  30. Managing client thread pools

  31. Managing client thread pools

  32. Managing client thread pools

  33. Clients have become our babies

  34. Clients have become our babies Edge Service Server threadpool Thread

    Thread Thread Thread Thread getMovieMetadata(movieId) Disclaimer: This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread Client Threadpool Thread Thread Thread Thread Thread Client Threadpool Thread Thread Thread Thread Thread getBookmark(movieId, userId) getRatings(movieId)
  35. Clients have become our babies

  36. Untuned/Wrongly tuned clients cause many outages.

  37. Have we exchanged a bigger problem with a smaller one?

  38. How do systems fail? 2. Overload Abusive clients, recovery spikes,

    special events ….
  39. We did a load test… https://github.com/Netflix-Skunkworks/ WSPerfLab Hello Netflix!

  40. Detailed analysis available online: https://github.com/Netflix-Skunkworks/WSPerfLab/blob/master/test-results/RxNetty_vs_Tomcat_April2015.pdf

  41. None
  42. ! Graceful This isn’t graceful degradation!

  43. This happens at high CPU usage.

  44. This happens at high CPU usage. So, don’t let the

    system reach that limit…
  45. This happens at high CPU usage. So, don’t let the

    system reach that limit… a.k.a Throttling.
  46. Fairness? One abusive request type can penalize other request paths.

  47. How do systems fail? 3. Thundering herds The failure after

    recovery….
  48. Retries Edge Service Video Metadata Service Disclaimer: This is an

    example and not an exact representation of the processing
  49. Retries Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Video Metadata Service Cluster
  50. Retries Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Video Metadata Service Cluster
  51. Retries Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Video Metadata Service Cluster
  52. Retries are useful in steady state…. …but…

  53. Retries Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Video Metadata Service Cluster
  54. None
  55. Our systems are missing empathy.

  56. Because they lack knowledge about the peers.

  57. Knowledge comes from various signals..

  58. Ability to adapt to those signals is important.

  59. This can not adapt… public Movie getMovie(String movieId) { Metadata

    metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing
  60. Asynchrony It is the key to success.

  61. What should be async?

  62. What should be async? Edge Service Video Metadata Service

  63. What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId)

    getBookmark(movieId, userId) getRatings(movieId) Application logic
  64. What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId)

    getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O
  65. What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId)

    getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O Network protocol
  66. Key aspects of being async.

  67. Key aspects of being async. 1. Lifecycle control

  68. Lifecycle control Start processing Stop processing

  69. Key aspects of being async. 2. Flow control

  70. Flow control When How much

  71. Key aspects of being async. 3. Function composition

  72. Function composition public Movie getMovie(String movieId) { Metadata metadata =

    getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); }
  73. Function composition Composing the processing of a method into a

    single control point. public Observable<Movie> getMovie(String movieId) { return Observable.zip(getMovieMetadata(movieId), getBookmark(movieId, userId), getRatings(movieId), (meta,bmark,rating)->new Movie(meta,bmark,rating)); }
  74. Composing the processing of a method into a single control

    point. Flow & Lifecycle Control with
  75. What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId)

    getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O Network protocol
  76. I/O Edge Service Server threadpool Thread Thread Thread Thread Thread

    Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)
  77. I/O Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection Eventloops = f (Number of cores)
  78. I/O Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection Connections multiplexed on a single eventloop.
  79. I/O Disclaimer: This is an example and not an exact

    representation of the processing Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection
  80. I/O Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Clients share the eventloops with the server.
  81. I/O Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection All clients share the same eventloop
  82. Composing the processing of a service into a single control

    point. Flow & Lifecycle Control with
  83. What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId)

    getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O Network protocol
  84. Network Protocol HTTP/1.1?

  85. HTTP/1.1 GET /movie?id=1 HTTP/1.1

  86. HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

  87. HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

  88. HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

    HTTP/1.1 200 OK ID: 1 …
  89. HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

    HTTP/1.1 200 OK ID: 1 … HTTP/1.1 200 OK ID: 2 …
  90. HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

    HTTP/1.1 200 OK ID: 1 … HTTP/1.1 200 OK ID: 2 … HTTP/1.1 200 OK ID: 3 …
  91. HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

    HTTP/1.1 200 OK ID: 1 … HTTP/1.1 200 OK ID: 2 … HTTP/1.1 200 OK ID: 3 … Head Of Line Blocking => Synchronous
  92. Network Protocol HTTP/1.1?

  93. Network Protocol We need a multiplexed bi-directional protocol

  94. Multiplexed GET /movie?id=1 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=3 HTTP/1.1

  95. Bi-directional GET /movie?id=1 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=3 HTTP/1.1

    CANCEL
  96. Composing the processing of the entire application into a single

    control point. Flow & Lifecycle Control with
  97. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing
  98. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing Observable<Movie>
  99. Observable<Movie> Composing the processing of the entire application into a

    single control point.
  100. Revisiting the failure modes

  101. Latency Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection
  102. Latency Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Impact is localized to the connection.
  103. Latency Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Impact is localized to the connection. An outstanding request has little cost.
  104. An outstanding request has little cost. GET /movie?id=1 HTTP/1.1 HTTP/1.1

    200 OK … } Any stored state between request - response is costly.
  105. Outstanding requests have low cost so Latency is a lesser

    evil in asynchronous systems.
  106. Overload & Thundering Herds Edge Service Eventloop (Inbound) Connection Connection

    Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Disclaimer: This is an example and not an exact representation of the processing Reduce work done when overloaded
  107. Overload & Thundering Herds Edge Service Eventloop (Inbound) Connection Connection

    Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Disclaimer: This is an example and not an exact representation of the processing Reduce work done when overloaded Stop accepting new requests.
  108. Stop accepting new requests Non-blocking I/O gives better control

  109. Stop accepting new requests But … we are still “throttling”

  110. Stop accepting new requests Are we being empathetic?

  111. Request-leasing http://reactivesocket.io/

  112. Request-leasing Peer 1 Peer 2 Network connection

  113. Request-leasing Peer 1 Peer 2 “Lease” 5 requests for 1

    minute. Network connection
  114. Request-leasing Peer 1 Peer 2 “Lease” 5 requests for 1

    minute. GET /movie?id=1 HTTP/1.1 GET /movie?id=2 HTTP/1.1 Network connection
  115. Server Client 1 Client 2 Client 8

  116. Server Capacity: 100 RPM Client 1 Client 2 Client 8

    “Lease” 10 requests for 1 minute.
  117. Server Capacity: 100 RPM Client 1 Client 2 Client 8

    “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute.
  118. Server Capacity: 100 RPM Client 1 Client 2 Client 8

    “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute. Reserve Capacity: 20 RPM
  119. Time bound lease. “Lease” 10 requests for 1 minute.

  120. Time bound lease. No extra work for cancelling leases. “Lease”

    10 requests for 1 minute.
  121. Time bound lease. No extra work for cancelling leases. Receiver

    controls the flow of requests “Lease” 10 requests for 1 minute.
  122. When things go south

  123. Server Capacity: 20 RPM Client 1 Client 2 Client 8

    “Lease” 5 requests for 1 minute. “Lease” 2 requests for 1 minute. X No more “Lease”
  124. Server Capacity: 20 RPM Client 1 Client 2 Client 8

    “Lease” 5 requests for 1 minute. “Lease” 2 requests for 1 minute. X No more “Lease” Prioritization
  125. Managing client configs?

  126. Threadpools? Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection I/O is non-blocking.
  127. Threadpools? Edge Service Disclaimer: This is an example and not

    an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Application code is non-blocking.
  128. Threadpools? Disclaimer: This is an example and not an exact

    representation of the processing Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection No blocking/Waiting => Only CPU work
  129. Threadpools? Disclaimer: This is an example and not an exact

    representation of the processing Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection No blocking/Waiting => Only CPU work So, Eventloops = # of cores
  130. Tuning parameters?

  131. Tuning parameters? X

  132. Tuning parameters? X X

  133. Tuning parameters? X X ? ?

  134. Case for timeouts?

  135. Case for timeouts? Read Timeouts Thread Timeouts ✤ Useful in

    unblocking threads 
 on socket reads. ✤ Business level SLA. ✤ Unblock the calling thread.
  136. Case for timeouts? Read Timeouts Thread Timeouts ✤ Useful in

    unblocking threads 
 on socket reads. ✤ Business level SLA. ✤ Unblock the calling thread. X X As there are no blocking calls. X
  137. Case for timeouts? Read Timeouts Thread Timeouts ✤ Useful in

    unblocking threads 
 on socket reads. ✤ Business level SLA. ✤ Unblock the calling thread.
  138. Business level SLA Edge Service Video Metadata Service Disclaimer: This

    is an example and not an exact representation of the processing
  139. Business level SLA Edge Service Video Metadata Service Disclaimer: This

    is an example and not an exact representation of the processing Rating service C* store C* store
  140. Business level SLA Edge Service Video Metadata Service Disclaimer: This

    is an example and not an exact representation of the processing Rating service C* store C* store
  141. Business level SLA Edge Service Video Metadata Service Disclaimer: This

    is an example and not an exact representation of the processing Rating service C* store C* store Thread timeouts are pretty invasive at every level
  142. Business level SLA Edge Service Video Metadata Service Disclaimer: This

    is an example and not an exact representation of the processing Rating service C* store C* store Thread timeouts are pretty invasive at every level Do we need them at every step?
  143. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing
  144. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 Business timeouts are for a client request. Disclaimer: This is an example and not an exact representation of the processing
  145. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing
  146. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 X Disclaimer: This is an example and not an exact representation of the processing
  147. Edge Service Video Metadata Service Rating service C* store C*

    store /movie?id=123 X X X X X Disclaimer: This is an example and not an exact representation of the processing
  148. Tuning parameters? X X X X

  149. Less tuning

  150. None
  151. Edge Service Video Metadata Service Rating service C* store C*

    store Disclaimer: This is an example and not an exact representation of the processing Request Leases Cancellations Observable<Movie>
  152. Edge Service Video Metadata Service Rating service C* store C*

    store Disclaimer: This is an example and not an exact representation of the processing Request Leases Cancellations Observable<Movie>
  153. None
  154. public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark

    bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } public Observable<Movie> getMovie(String movieId) { return Observable.zip(getMovieMetadata(movieId), getBookmark(movieId, userId), getRatings(movieId), (meta,bmark,rating)->new Movie(meta,bmark,rating)); }
  155. Resources Asynchronous Function composition : I/O : Network Protocol :

    https://github.com/ReactiveX/RxJava https://github.com/ReactiveX/RxNetty http://reactivesocket.io/
  156. Nitesh Kant, Engineer, Netflix Edge Gateway @NiteshKant