Slide 1

Slide 1 text

Crossroads of asynchrony and graceful degradation Nitesh Kant, Software Engineer, Netflix Edge Engineering. @NiteshKant

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Nitesh Kant Who Am I? ❖ Engineer, Edge Engineering, Netflix. ❖ Core contributor, RxNetty* ❖ Contributor, Zuul** * https://github.com/ReactiveX/RxNetty ** https://github.com/Netflix/zuul @NiteshKant

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

How do systems fail?

Slide 8

Slide 8 text

A simple example. Showing a movie on Netflix.

Slide 9

Slide 9 text

Video Metadata

Slide 10

Slide 10 text

Video Bookmark

Slide 11

Slide 11 text

Video Rating

Slide 12

Slide 12 text

public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing

Slide 13

Slide 13 text

Synchronicity public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing

Slide 14

Slide 14 text

The bigger picture Price of being synchronous? public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing

Slide 15

Slide 15 text

In a microservices world Edge Service Ratings Service Video Metadata Service Bookmarks Service Disclaimer: This is an example and not an exact representation of the processing

Slide 16

Slide 16 text

In a microservices world Edge Service Server threadpool Thread Thread Thread Thread Thread getMovieMetadata(movieId) Disclaimer: This is an example and not an exact representation of the processing

Slide 17

Slide 17 text

In a microservices world Edge Service Server threadpool Thread Thread Thread Thread Thread getMovieMetadata(movieId) getBookmark(movieId, userId) Disclaimer: This is an example and not an exact representation of the processing

Slide 18

Slide 18 text

In a microservices world Edge Service Server threadpool Thread Thread Thread Thread Thread getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) Disclaimer: This is an example and not an exact representation of the processing

Slide 19

Slide 19 text

Busy thread time = Sum of the time taken to make all 3 service calls

Slide 20

Slide 20 text

How do systems fail? 1. Latency Latency is your worst enemy in a synchronous world.

Slide 21

Slide 21 text

Edge Service Server threadpool Thread Thread Thread Thread Thread getRatings(movieId) Disclaimer: This is an example and not an exact representation of the processing

Slide 22

Slide 22 text

Disclaimer: This is an example and not an exact representation of the processing Ratings Service Edge Service Server threadpool Thread Thread Thread Thread Thread getRatings(movieId)

Slide 23

Slide 23 text

Disclaimer: This is an example and not an exact representation of the processing Edge Service getRatings(movieId) Server threadpool Thread Thread Thread Thread Thread

Slide 24

Slide 24 text

Edge Service Disclaimer: This is an example and not an exact representation of the processing getRatings(movieId) Server threadpool Thread Thread Thread Thread Thread

Slide 25

Slide 25 text

Edge Service Server threadpool Thread Thread Thread Thread Thread Disclaimer: This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)

Slide 26

Slide 26 text

Edge Service Server threadpool Thread Thread Thread Thread Thread Disclaimer: This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)

Slide 27

Slide 27 text

Edge Service Server threadpool Thread Thread Thread Thread Thread Disclaimer: This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)

Slide 28

Slide 28 text

Managing client thread pools Disclaimer: This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread

Slide 29

Slide 29 text

Managing client thread pools

Slide 30

Slide 30 text

Managing client thread pools

Slide 31

Slide 31 text

Managing client thread pools

Slide 32

Slide 32 text

Managing client thread pools

Slide 33

Slide 33 text

Clients have become our babies

Slide 34

Slide 34 text

Clients have become our babies Edge Service Server threadpool Thread Thread Thread Thread Thread getMovieMetadata(movieId) Disclaimer: This is an example and not an exact representation of the processing Client Threadpool Thread Thread Thread Thread Thread Client Threadpool Thread Thread Thread Thread Thread Client Threadpool Thread Thread Thread Thread Thread getBookmark(movieId, userId) getRatings(movieId)

Slide 35

Slide 35 text

Clients have become our babies

Slide 36

Slide 36 text

Untuned/Wrongly tuned clients cause many outages.

Slide 37

Slide 37 text

Have we exchanged a bigger problem with a smaller one?

Slide 38

Slide 38 text

How do systems fail? 2. Overload Abusive clients, recovery spikes, special events ….

Slide 39

Slide 39 text

We did a load test… https://github.com/Netflix-Skunkworks/ WSPerfLab Hello Netflix!

Slide 40

Slide 40 text

Detailed analysis available online: https://github.com/Netflix-Skunkworks/WSPerfLab/blob/master/test-results/RxNetty_vs_Tomcat_April2015.pdf

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

! Graceful This isn’t graceful degradation!

Slide 43

Slide 43 text

This happens at high CPU usage.

Slide 44

Slide 44 text

This happens at high CPU usage. So, don’t let the system reach that limit…

Slide 45

Slide 45 text

This happens at high CPU usage. So, don’t let the system reach that limit… a.k.a Throttling.

Slide 46

Slide 46 text

Fairness? One abusive request type can penalize other request paths.

Slide 47

Slide 47 text

How do systems fail? 3. Thundering herds The failure after recovery….

Slide 48

Slide 48 text

Retries Edge Service Video Metadata Service Disclaimer: This is an example and not an exact representation of the processing

Slide 49

Slide 49 text

Retries Edge Service Disclaimer: This is an example and not an exact representation of the processing Video Metadata Service Cluster

Slide 50

Slide 50 text

Retries Edge Service Disclaimer: This is an example and not an exact representation of the processing Video Metadata Service Cluster

Slide 51

Slide 51 text

Retries Edge Service Disclaimer: This is an example and not an exact representation of the processing Video Metadata Service Cluster

Slide 52

Slide 52 text

Retries are useful in steady state…. …but…

Slide 53

Slide 53 text

Retries Edge Service Disclaimer: This is an example and not an exact representation of the processing Video Metadata Service Cluster

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

Our systems are missing empathy.

Slide 56

Slide 56 text

Because they lack knowledge about the peers.

Slide 57

Slide 57 text

Knowledge comes from various signals..

Slide 58

Slide 58 text

Ability to adapt to those signals is important.

Slide 59

Slide 59 text

This can not adapt… public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } Disclaimer: This is an example and not an exact representation of the processing

Slide 60

Slide 60 text

Asynchrony It is the key to success.

Slide 61

Slide 61 text

What should be async?

Slide 62

Slide 62 text

What should be async? Edge Service Video Metadata Service

Slide 63

Slide 63 text

What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) Application logic

Slide 64

Slide 64 text

What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O

Slide 65

Slide 65 text

What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O Network protocol

Slide 66

Slide 66 text

Key aspects of being async.

Slide 67

Slide 67 text

Key aspects of being async. 1. Lifecycle control

Slide 68

Slide 68 text

Lifecycle control Start processing Stop processing

Slide 69

Slide 69 text

Key aspects of being async. 2. Flow control

Slide 70

Slide 70 text

Flow control When How much

Slide 71

Slide 71 text

Key aspects of being async. 3. Function composition

Slide 72

Slide 72 text

Function composition public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); }

Slide 73

Slide 73 text

Function composition Composing the processing of a method into a single control point. public Observable getMovie(String movieId) { return Observable.zip(getMovieMetadata(movieId), getBookmark(movieId, userId), getRatings(movieId), (meta,bmark,rating)->new Movie(meta,bmark,rating)); }

Slide 74

Slide 74 text

Composing the processing of a method into a single control point. Flow & Lifecycle Control with

Slide 75

Slide 75 text

What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O Network protocol

Slide 76

Slide 76 text

I/O Edge Service Server threadpool Thread Thread Thread Thread Thread Client Threadpool Thread Thread Thread Thread Thread getRatings(movieId)

Slide 77

Slide 77 text

I/O Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection Eventloops = f (Number of cores)

Slide 78

Slide 78 text

I/O Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection Connections multiplexed on a single eventloop.

Slide 79

Slide 79 text

I/O Disclaimer: This is an example and not an exact representation of the processing Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection

Slide 80

Slide 80 text

I/O Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Clients share the eventloops with the server.

Slide 81

Slide 81 text

I/O Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection All clients share the same eventloop

Slide 82

Slide 82 text

Composing the processing of a service into a single control point. Flow & Lifecycle Control with

Slide 83

Slide 83 text

What should be async? Edge Service Video Metadata Service getMovieMetadata(movieId) getBookmark(movieId, userId) getRatings(movieId) I/O I/O I/O Application logic I/O Network protocol

Slide 84

Slide 84 text

Network Protocol HTTP/1.1?

Slide 85

Slide 85 text

HTTP/1.1 GET /movie?id=1 HTTP/1.1

Slide 86

Slide 86 text

HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

Slide 87

Slide 87 text

HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1

Slide 88

Slide 88 text

HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1 HTTP/1.1 200 OK ID: 1 …

Slide 89

Slide 89 text

HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1 HTTP/1.1 200 OK ID: 1 … HTTP/1.1 200 OK ID: 2 …

Slide 90

Slide 90 text

HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1 HTTP/1.1 200 OK ID: 1 … HTTP/1.1 200 OK ID: 2 … HTTP/1.1 200 OK ID: 3 …

Slide 91

Slide 91 text

HTTP/1.1 GET /movie?id=3 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=1 HTTP/1.1 HTTP/1.1 200 OK ID: 1 … HTTP/1.1 200 OK ID: 2 … HTTP/1.1 200 OK ID: 3 … Head Of Line Blocking => Synchronous

Slide 92

Slide 92 text

Network Protocol HTTP/1.1?

Slide 93

Slide 93 text

Network Protocol We need a multiplexed bi-directional protocol

Slide 94

Slide 94 text

Multiplexed GET /movie?id=1 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=3 HTTP/1.1

Slide 95

Slide 95 text

Bi-directional GET /movie?id=1 HTTP/1.1 GET /movie?id=2 HTTP/1.1 GET /movie?id=3 HTTP/1.1 CANCEL

Slide 96

Slide 96 text

Composing the processing of the entire application into a single control point. Flow & Lifecycle Control with

Slide 97

Slide 97 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing

Slide 98

Slide 98 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing Observable

Slide 99

Slide 99 text

Observable Composing the processing of the entire application into a single control point.

Slide 100

Slide 100 text

Revisiting the failure modes

Slide 101

Slide 101 text

Latency Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection

Slide 102

Slide 102 text

Latency Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Impact is localized to the connection.

Slide 103

Slide 103 text

Latency Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Impact is localized to the connection. An outstanding request has little cost.

Slide 104

Slide 104 text

An outstanding request has little cost. GET /movie?id=1 HTTP/1.1 HTTP/1.1 200 OK … } Any stored state between request - response is costly.

Slide 105

Slide 105 text

Outstanding requests have low cost so Latency is a lesser evil in asynchronous systems.

Slide 106

Slide 106 text

Overload & Thundering Herds Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Disclaimer: This is an example and not an exact representation of the processing Reduce work done when overloaded

Slide 107

Slide 107 text

Overload & Thundering Herds Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Disclaimer: This is an example and not an exact representation of the processing Reduce work done when overloaded Stop accepting new requests.

Slide 108

Slide 108 text

Stop accepting new requests Non-blocking I/O gives better control

Slide 109

Slide 109 text

Stop accepting new requests But … we are still “throttling”

Slide 110

Slide 110 text

Stop accepting new requests Are we being empathetic?

Slide 111

Slide 111 text

Request-leasing http://reactivesocket.io/

Slide 112

Slide 112 text

Request-leasing Peer 1 Peer 2 Network connection

Slide 113

Slide 113 text

Request-leasing Peer 1 Peer 2 “Lease” 5 requests for 1 minute. Network connection

Slide 114

Slide 114 text

Request-leasing Peer 1 Peer 2 “Lease” 5 requests for 1 minute. GET /movie?id=1 HTTP/1.1 GET /movie?id=2 HTTP/1.1 Network connection

Slide 115

Slide 115 text

Server Client 1 Client 2 Client 8

Slide 116

Slide 116 text

Server Capacity: 100 RPM Client 1 Client 2 Client 8 “Lease” 10 requests for 1 minute.

Slide 117

Slide 117 text

Server Capacity: 100 RPM Client 1 Client 2 Client 8 “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute.

Slide 118

Slide 118 text

Server Capacity: 100 RPM Client 1 Client 2 Client 8 “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute. “Lease” 10 requests for 1 minute. Reserve Capacity: 20 RPM

Slide 119

Slide 119 text

Time bound lease. “Lease” 10 requests for 1 minute.

Slide 120

Slide 120 text

Time bound lease. No extra work for cancelling leases. “Lease” 10 requests for 1 minute.

Slide 121

Slide 121 text

Time bound lease. No extra work for cancelling leases. Receiver controls the flow of requests “Lease” 10 requests for 1 minute.

Slide 122

Slide 122 text

When things go south

Slide 123

Slide 123 text

Server Capacity: 20 RPM Client 1 Client 2 Client 8 “Lease” 5 requests for 1 minute. “Lease” 2 requests for 1 minute. X No more “Lease”

Slide 124

Slide 124 text

Server Capacity: 20 RPM Client 1 Client 2 Client 8 “Lease” 5 requests for 1 minute. “Lease” 2 requests for 1 minute. X No more “Lease” Prioritization

Slide 125

Slide 125 text

Managing client configs?

Slide 126

Slide 126 text

Threadpools? Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection I/O is non-blocking.

Slide 127

Slide 127 text

Threadpools? Edge Service Disclaimer: This is an example and not an exact representation of the processing Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection Application code is non-blocking.

Slide 128

Slide 128 text

Threadpools? Disclaimer: This is an example and not an exact representation of the processing Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection No blocking/Waiting => Only CPU work

Slide 129

Slide 129 text

Threadpools? Disclaimer: This is an example and not an exact representation of the processing Edge Service Eventloop (Inbound) Connection Connection Connection Connection Connection getMovieMetadata(movieId) Eventloop (Outbound) Connection Connection Connection Connection Connection No blocking/Waiting => Only CPU work So, Eventloops = # of cores

Slide 130

Slide 130 text

Tuning parameters?

Slide 131

Slide 131 text

Tuning parameters? X

Slide 132

Slide 132 text

Tuning parameters? X X

Slide 133

Slide 133 text

Tuning parameters? X X ? ?

Slide 134

Slide 134 text

Case for timeouts?

Slide 135

Slide 135 text

Case for timeouts? Read Timeouts Thread Timeouts ✤ Useful in unblocking threads 
 on socket reads. ✤ Business level SLA. ✤ Unblock the calling thread.

Slide 136

Slide 136 text

Case for timeouts? Read Timeouts Thread Timeouts ✤ Useful in unblocking threads 
 on socket reads. ✤ Business level SLA. ✤ Unblock the calling thread. X X As there are no blocking calls. X

Slide 137

Slide 137 text

Case for timeouts? Read Timeouts Thread Timeouts ✤ Useful in unblocking threads 
 on socket reads. ✤ Business level SLA. ✤ Unblock the calling thread.

Slide 138

Slide 138 text

Business level SLA Edge Service Video Metadata Service Disclaimer: This is an example and not an exact representation of the processing

Slide 139

Slide 139 text

Business level SLA Edge Service Video Metadata Service Disclaimer: This is an example and not an exact representation of the processing Rating service C* store C* store

Slide 140

Slide 140 text

Business level SLA Edge Service Video Metadata Service Disclaimer: This is an example and not an exact representation of the processing Rating service C* store C* store

Slide 141

Slide 141 text

Business level SLA Edge Service Video Metadata Service Disclaimer: This is an example and not an exact representation of the processing Rating service C* store C* store Thread timeouts are pretty invasive at every level

Slide 142

Slide 142 text

Business level SLA Edge Service Video Metadata Service Disclaimer: This is an example and not an exact representation of the processing Rating service C* store C* store Thread timeouts are pretty invasive at every level Do we need them at every step?

Slide 143

Slide 143 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing

Slide 144

Slide 144 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 Business timeouts are for a client request. Disclaimer: This is an example and not an exact representation of the processing

Slide 145

Slide 145 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 Disclaimer: This is an example and not an exact representation of the processing

Slide 146

Slide 146 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 X Disclaimer: This is an example and not an exact representation of the processing

Slide 147

Slide 147 text

Edge Service Video Metadata Service Rating service C* store C* store /movie?id=123 X X X X X Disclaimer: This is an example and not an exact representation of the processing

Slide 148

Slide 148 text

Tuning parameters? X X X X

Slide 149

Slide 149 text

Less tuning

Slide 150

Slide 150 text

No content

Slide 151

Slide 151 text

Edge Service Video Metadata Service Rating service C* store C* store Disclaimer: This is an example and not an exact representation of the processing Request Leases Cancellations Observable

Slide 152

Slide 152 text

Edge Service Video Metadata Service Rating service C* store C* store Disclaimer: This is an example and not an exact representation of the processing Request Leases Cancellations Observable

Slide 153

Slide 153 text

No content

Slide 154

Slide 154 text

public Movie getMovie(String movieId) { Metadata metadata = getMovieMetadata(movieId); Bookmark bookmark = getBookmark(movieId, userId); Rating rating = getRatings(movieId); return new Movie(metadata, bookmark, rating); } public Observable getMovie(String movieId) { return Observable.zip(getMovieMetadata(movieId), getBookmark(movieId, userId), getRatings(movieId), (meta,bmark,rating)->new Movie(meta,bmark,rating)); }

Slide 155

Slide 155 text

Resources Asynchronous Function composition : I/O : Network Protocol : https://github.com/ReactiveX/RxJava https://github.com/ReactiveX/RxNetty http://reactivesocket.io/

Slide 156

Slide 156 text

Nitesh Kant, Engineer, Netflix Edge Gateway @NiteshKant