Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient distributed systems with Netflix Hystrix

Resilient distributed systems with Netflix Hystrix

Hystrix is a latency and fault tolerance library designed to enable resilience in distributed systems where failure is inevitable.
Oleksiy will introduce this library and show how and why they use it in mission critical projects.

Oleksii Diagiliev

August 06, 2015
Tweet

More Decks by Oleksii Diagiliev

Other Decks in Programming

Transcript

  1. AUGUST 19, 2013 amazon.com around 1 p.m. Pacific time for

    45 mins estimated cost $117,882 . http://venturebeat.com/2013/08/19/amazon-website-down/
  2. HTTP session XAP in-memory datagrid App container (jetty, tomcat, etc)

    Spring Session Filter App container (jetty, tomcat, etc) Spring Session Filter network call: read, write session object one of our production use cases
  3. HTTP session XAP in-memory datagrid App container (jetty, tomcat, etc)

    Spring Session Filter App container (jetty, tomcat, etc) Spring Session Filter network call: read, write session object one of our production use cases
  4. XAP in-memory datagrid App container (jetty, tomcat, etc) Spring Session

    Filter App container (jetty, tomcat, etc) Spring Session Filter power failure misconfiguration firmware bugs topology changes cable damage malicious traffic
  5. if application depends on 30 services where each has 99.99%

    uptime (4.3 mins downtime/month) It’s uptime is 99.9930 = 99.7% (2.1 hours downtime/month)
  6. o preventing any single dependency from using all container(Tomcat, etc)

    user threads o shedding load and failing fast instead of queueing o providing fallbacks wherever feasible to protect users from failure o Real-time metrics and monitoring
  7. public class CommandHelloWorld extends HystrixCommand<String> { private final String name;

    public CommandHelloWorld(String name) { super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup")); this.name = name; } @Override protected String run() { // a real example would do work like a network call here return "Hello " + name + "!"; } } @Test public void testExecute() { assertEquals("Hello World!", new CommandHelloWorld("World").execute()); }
  8. public class CommandHelloFailure extends HystrixCommand<String> { private final String name;

    public CommandHelloFailure(String name) { super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup")); this.name = name; } @Override protected String run() { throw new RuntimeException("this command always fails"); } @Override protected String getFallback() { return "Hello Failure " + name + "!"; } } @Test public void testSynchronous() { assertEquals("Hello Failure World!", new CommandHelloFailure("World").execute()); }
  9. public class GetSessionHystrixCommand extends ConfigurableHystrixCommand<ExpiringSession> { private static Logger log

    = LoggerFactory.getLogger(GetSessionHystrixCommand.class); private final String sessionId; private final GetSessionCommand getSessionCommand; public GetSessionHystrixCommand(String id, RestExecutionContext context, CommandSettings settings) { super(XAP_SESSION_COMMAND, settings); this.sessionId = id; this.getSessionCommand = new GetSessionCommand(id, context); } @Override protected XapSession run() throws Exception { try { return getSessionCommand.execute(); } catch (Exception exception) { log.error("Failed to get session", exception); throw exception; } } @Override protected ExpiringSession getFallback() { log.error("Falling back on getting session due to {}", getExecutionEvents()); return FailoverSession.create(sessionId); } } production code
  10. Primary datagrid App container (jetty, tomcat, etc) Spring Session Filter

    sacrificing consistency to availability Secondary datagrid WAN replication fallback
  11. Primary datagrid App container (jetty, tomcat, etc) Spring Session Filter

    sacrificing consistency to availability Secondary datagrid WAN replication fallback You might not need this If the entire infrastructure replicated in another DC
  12. .execute() Circuit- breaker open? .run() Return result of run() Thread

    pool rejected ? execution fails? timeout no no no no
  13. .execute() Circuit- breaker open? .run() .getFallback() Return result of run()

    Thread pool rejected ? execution fails? timeout no no yes, short-circuit yes, reject yes yes no no
  14. .execute() Circuit- breaker open? .run() .getFallback() Return result of run()

    Thread pool rejected ? execution fails? timeout Fallback successful ? no no yes, short-circuit yes, reject yes yes no no
  15. .execute() Circuit- breaker open? .run() .getFallback() Return result of fallback()

    Return result of run() Thread pool rejected ? execution fails? timeout Fallback successful ? no no yes, short-circuit yes, reject yes yes yes no no
  16. .execute() Circuit- breaker open? .run() .getFallback() Return result of fallback()

    Return exception Return result of run() Thread pool rejected ? execution fails? timeout Fallback successful ? no no yes, short-circuit yes, reject yes yes yes no no no
  17. With 60 requests/second At the 90thpercentile there is a cost

    of 3ms At the 99thpercentile there is a cost of 9ms
  18. o Resilience can be a strong requirement o Distributed systems

    are complex o Isolate your dependencies o It’s not only about microservices, but very applicable there o Circuit Breaker is your friend o Monitoring is a must o Use