Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient distributed systems with Netflix Hystrix

Resilient distributed systems with Netflix Hystrix

Hystrix is a latency and fault tolerance library designed to enable resilience in distributed systems where failure is inevitable.
Oleksiy will introduce this library and show how and why they use it in mission critical projects.

Oleksii Diagiliev

August 06, 2015
Tweet

More Decks by Oleksii Diagiliev

Other Decks in Programming

Transcript

  1. Oleksiy Dyagilev
    Lead Software Engineer

    View Slide

  2. o
    o
    o

    View Slide

  3. consider amazon.com

    View Slide

  4. consider amazon.com

    View Slide

  5. consider amazon.com

    View Slide

  6. consider amazon.com

    View Slide

  7. consider amazon.com

    View Slide

  8. consider amazon.com

    View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. AUGUST 19, 2013
    amazon.com
    around 1 p.m. Pacific time
    for 45 mins estimated cost
    $117,882 .
    http://venturebeat.com/2013/08/19/amazon-website-down/

    View Slide

  13. HTTP session
    XAP in-memory
    datagrid
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    network call: read, write
    session object
    one of our production use cases

    View Slide

  14. HTTP session
    XAP in-memory
    datagrid
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    network call: read, write
    session object
    one of our production use cases

    View Slide

  15. View Slide

  16. misconfiguration
    bursty traffic
    software bugs
    hardware issues

    View Slide

  17. misconfiguration
    bursty traffic
    software bugs
    hardware issues

    View Slide

  18. misconfiguration
    bursty traffic
    software bugs
    hardware issues

    View Slide

  19. misconfiguration
    bursty traffic
    software bugs
    hardware issues

    View Slide

  20. misconfiguration
    bursty traffic
    software bugs
    hardware issues

    View Slide

  21. XAP in-memory
    datagrid
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    power failure
    misconfiguration
    firmware bugs
    topology changes
    cable damage
    malicious traffic

    View Slide

  22. if application depends on 30 services where each has 99.99% uptime
    (4.3 mins downtime/month)
    It’s uptime is 99.9930 = 99.7%
    (2.1 hours downtime/month)

    View Slide

  23. View Slide

  24. o preventing any single dependency from using all container(Tomcat, etc) user threads
    o shedding load and failing fast instead of queueing
    o providing fallbacks wherever feasible to protect users from failure
    o Real-time metrics and monitoring

    View Slide

  25. View Slide

  26. public class CommandHelloWorld extends HystrixCommand {
    private final String name;
    public CommandHelloWorld(String name) {
    super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
    this.name = name;
    }
    @Override
    protected String run() {
    // a real example would do work like a network call here
    return "Hello " + name + "!";
    }
    }
    @Test
    public void testExecute() {
    assertEquals("Hello World!", new CommandHelloWorld("World").execute());
    }

    View Slide

  27. public class CommandHelloFailure extends HystrixCommand {
    private final String name;
    public CommandHelloFailure(String name) {
    super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
    this.name = name;
    }
    @Override
    protected String run() {
    throw new RuntimeException("this command always fails");
    }
    @Override
    protected String getFallback() {
    return "Hello Failure " + name + "!";
    }
    }
    @Test
    public void testSynchronous() {
    assertEquals("Hello Failure World!", new CommandHelloFailure("World").execute());
    }

    View Slide

  28. public class GetSessionHystrixCommand extends ConfigurableHystrixCommand {
    private static Logger log = LoggerFactory.getLogger(GetSessionHystrixCommand.class);
    private final String sessionId;
    private final GetSessionCommand getSessionCommand;
    public GetSessionHystrixCommand(String id, RestExecutionContext context, CommandSettings settings) {
    super(XAP_SESSION_COMMAND, settings);
    this.sessionId = id;
    this.getSessionCommand = new GetSessionCommand(id, context);
    }
    @Override
    protected XapSession run() throws Exception {
    try {
    return getSessionCommand.execute();
    } catch (Exception exception) {
    log.error("Failed to get session", exception);
    throw exception;
    }
    }
    @Override
    protected ExpiringSession getFallback() {
    log.error("Falling back on getting session due to {}", getExecutionEvents());
    return FailoverSession.create(sessionId);
    }
    }
    production code

    View Slide

  29. View Slide

  30. View Slide

  31. Primary
    datagrid
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    sacrificing consistency to availability
    Secondary
    datagrid
    WAN replication
    fallback

    View Slide

  32. Primary
    datagrid
    App container (jetty,
    tomcat, etc)
    Spring Session Filter
    sacrificing consistency to availability
    Secondary
    datagrid
    WAN replication
    fallback
    You might not need this If the entire
    infrastructure replicated in another DC

    View Slide

  33. View Slide

  34. http://martinfowler.com/bliki/CircuitBreaker.html

    View Slide

  35. View Slide

  36. .execute()

    View Slide

  37. .execute()
    Circuit-
    breaker
    open?

    View Slide

  38. .execute()
    Circuit-
    breaker
    open?
    Thread
    pool
    rejected
    ?
    no

    View Slide

  39. .execute()
    Circuit-
    breaker
    open?
    .run()
    Thread
    pool
    rejected
    ?
    no no

    View Slide

  40. .execute()
    Circuit-
    breaker
    open?
    .run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    no no

    View Slide

  41. .execute()
    Circuit-
    breaker
    open?
    .run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    timeout
    no no
    no

    View Slide

  42. .execute()
    Circuit-
    breaker
    open?
    .run()
    Return result
    of run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    timeout
    no no
    no
    no

    View Slide

  43. .execute()
    Circuit-
    breaker
    open?
    .run()
    .getFallback()
    Return result
    of run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    timeout
    no no
    yes, short-circuit yes, reject
    yes
    yes
    no
    no

    View Slide

  44. .execute()
    Circuit-
    breaker
    open?
    .run()
    .getFallback()
    Return result
    of run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    timeout
    Fallback
    successful
    ?
    no no
    yes, short-circuit yes, reject
    yes
    yes
    no
    no

    View Slide

  45. .execute()
    Circuit-
    breaker
    open?
    .run()
    .getFallback()
    Return result
    of fallback()
    Return result
    of run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    timeout
    Fallback
    successful
    ?
    no no
    yes, short-circuit yes, reject
    yes
    yes
    yes no
    no

    View Slide

  46. .execute()
    Circuit-
    breaker
    open?
    .run()
    .getFallback()
    Return result
    of fallback()
    Return
    exception
    Return result
    of run()
    Thread
    pool
    rejected
    ?
    execution
    fails?
    timeout
    Fallback
    successful
    ?
    no no
    yes, short-circuit yes, reject
    yes
    yes
    yes no
    no
    no

    View Slide

  47. Future s = new CommandHelloWorld("Bob").queue();
    Observable s = new CommandHelloWorld("Bob").observe();

    View Slide

  48. View Slide

  49. View Slide

  50. With
    60 requests/second
    At the
    90thpercentile there is a cost of
    3ms
    At the
    99thpercentile there is a cost of
    9ms

    View Slide

  51. View Slide

  52. View Slide

  53. o Resilience can be a strong requirement
    o Distributed systems are complex
    o Isolate your dependencies
    o It’s not only about microservices, but very applicable there
    o Circuit Breaker is your friend
    o Monitoring is a must
    o Use

    View Slide

  54. View Slide