Slide 1

Slide 1 text

Oleksiy Dyagilev Lead Software Engineer

Slide 2

Slide 2 text

o o o

Slide 3

Slide 3 text

consider amazon.com

Slide 4

Slide 4 text

consider amazon.com

Slide 5

Slide 5 text

consider amazon.com

Slide 6

Slide 6 text

consider amazon.com

Slide 7

Slide 7 text

consider amazon.com

Slide 8

Slide 8 text

consider amazon.com

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

AUGUST 19, 2013 amazon.com around 1 p.m. Pacific time for 45 mins estimated cost $117,882 . http://venturebeat.com/2013/08/19/amazon-website-down/

Slide 13

Slide 13 text

HTTP session XAP in-memory datagrid App container (jetty, tomcat, etc) Spring Session Filter App container (jetty, tomcat, etc) Spring Session Filter network call: read, write session object one of our production use cases

Slide 14

Slide 14 text

HTTP session XAP in-memory datagrid App container (jetty, tomcat, etc) Spring Session Filter App container (jetty, tomcat, etc) Spring Session Filter network call: read, write session object one of our production use cases

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

misconfiguration bursty traffic software bugs hardware issues

Slide 17

Slide 17 text

misconfiguration bursty traffic software bugs hardware issues

Slide 18

Slide 18 text

misconfiguration bursty traffic software bugs hardware issues

Slide 19

Slide 19 text

misconfiguration bursty traffic software bugs hardware issues

Slide 20

Slide 20 text

misconfiguration bursty traffic software bugs hardware issues

Slide 21

Slide 21 text

XAP in-memory datagrid App container (jetty, tomcat, etc) Spring Session Filter App container (jetty, tomcat, etc) Spring Session Filter power failure misconfiguration firmware bugs topology changes cable damage malicious traffic

Slide 22

Slide 22 text

if application depends on 30 services where each has 99.99% uptime (4.3 mins downtime/month) It’s uptime is 99.9930 = 99.7% (2.1 hours downtime/month)

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

o preventing any single dependency from using all container(Tomcat, etc) user threads o shedding load and failing fast instead of queueing o providing fallbacks wherever feasible to protect users from failure o Real-time metrics and monitoring

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

public class CommandHelloWorld extends HystrixCommand { private final String name; public CommandHelloWorld(String name) { super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup")); this.name = name; } @Override protected String run() { // a real example would do work like a network call here return "Hello " + name + "!"; } } @Test public void testExecute() { assertEquals("Hello World!", new CommandHelloWorld("World").execute()); }

Slide 27

Slide 27 text

public class CommandHelloFailure extends HystrixCommand { private final String name; public CommandHelloFailure(String name) { super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup")); this.name = name; } @Override protected String run() { throw new RuntimeException("this command always fails"); } @Override protected String getFallback() { return "Hello Failure " + name + "!"; } } @Test public void testSynchronous() { assertEquals("Hello Failure World!", new CommandHelloFailure("World").execute()); }

Slide 28

Slide 28 text

public class GetSessionHystrixCommand extends ConfigurableHystrixCommand { private static Logger log = LoggerFactory.getLogger(GetSessionHystrixCommand.class); private final String sessionId; private final GetSessionCommand getSessionCommand; public GetSessionHystrixCommand(String id, RestExecutionContext context, CommandSettings settings) { super(XAP_SESSION_COMMAND, settings); this.sessionId = id; this.getSessionCommand = new GetSessionCommand(id, context); } @Override protected XapSession run() throws Exception { try { return getSessionCommand.execute(); } catch (Exception exception) { log.error("Failed to get session", exception); throw exception; } } @Override protected ExpiringSession getFallback() { log.error("Falling back on getting session due to {}", getExecutionEvents()); return FailoverSession.create(sessionId); } } production code

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Primary datagrid App container (jetty, tomcat, etc) Spring Session Filter sacrificing consistency to availability Secondary datagrid WAN replication fallback

Slide 32

Slide 32 text

Primary datagrid App container (jetty, tomcat, etc) Spring Session Filter sacrificing consistency to availability Secondary datagrid WAN replication fallback You might not need this If the entire infrastructure replicated in another DC

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

http://martinfowler.com/bliki/CircuitBreaker.html

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

.execute()

Slide 37

Slide 37 text

.execute() Circuit- breaker open?

Slide 38

Slide 38 text

.execute() Circuit- breaker open? Thread pool rejected ? no

Slide 39

Slide 39 text

.execute() Circuit- breaker open? .run() Thread pool rejected ? no no

Slide 40

Slide 40 text

.execute() Circuit- breaker open? .run() Thread pool rejected ? execution fails? no no

Slide 41

Slide 41 text

.execute() Circuit- breaker open? .run() Thread pool rejected ? execution fails? timeout no no no

Slide 42

Slide 42 text

.execute() Circuit- breaker open? .run() Return result of run() Thread pool rejected ? execution fails? timeout no no no no

Slide 43

Slide 43 text

.execute() Circuit- breaker open? .run() .getFallback() Return result of run() Thread pool rejected ? execution fails? timeout no no yes, short-circuit yes, reject yes yes no no

Slide 44

Slide 44 text

.execute() Circuit- breaker open? .run() .getFallback() Return result of run() Thread pool rejected ? execution fails? timeout Fallback successful ? no no yes, short-circuit yes, reject yes yes no no

Slide 45

Slide 45 text

.execute() Circuit- breaker open? .run() .getFallback() Return result of fallback() Return result of run() Thread pool rejected ? execution fails? timeout Fallback successful ? no no yes, short-circuit yes, reject yes yes yes no no

Slide 46

Slide 46 text

.execute() Circuit- breaker open? .run() .getFallback() Return result of fallback() Return exception Return result of run() Thread pool rejected ? execution fails? timeout Fallback successful ? no no yes, short-circuit yes, reject yes yes yes no no no

Slide 47

Slide 47 text

Future s = new CommandHelloWorld("Bob").queue(); Observable s = new CommandHelloWorld("Bob").observe();

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

With 60 requests/second At the 90thpercentile there is a cost of 3ms At the 99thpercentile there is a cost of 9ms

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

o Resilience can be a strong requirement o Distributed systems are complex o Isolate your dependencies o It’s not only about microservices, but very applicable there o Circuit Breaker is your friend o Monitoring is a must o Use

Slide 54

Slide 54 text

No content