Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Handing Failure in Microservice Architectures

mattheath
January 15, 2016

Handing Failure in Microservice Architectures

Presented at NDC London on 15th January 2016

Microservice architectures allow us to decompose domain logic into small services with a bounded context, which allows us to gain simplicity within services at the expense of complexity in the interactions between services.

However any distributed system operating at scale will experience failure, and this interaction complexity makes dealing with failure harder. This is especially important when requests may traverse many systems, and failures of a single component may cascade through several more. In this talk we look at a number of common patterns from simple usage of concurrency primitives and timeouts to control and throttle concurrency, to more complex patterns such as the CircuitBreaker which can be used to prevent cascading failures; increasing the reliability of our systems.

mattheath

January 15, 2016
Tweet

More Decks by mattheath

Other Decks in Programming

Transcript

  1. Handling Failure in
    Microservice Architectures
    Matt Heath, Mondo
    #ndclondon

    View Slide

  2. @mattheath

    View Slide

  3. View Slide

  4. View Slide

  5. 1895

    View Slide

  6. monoliths
    traditional dev

    View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. ?

    View Slide

  17. DATABASE
    APPLICATION

    View Slide

  18. DATABASE
    APPLICATION

    View Slide

  19. DATABASE
    DATABASES
    APPLICATION

    View Slide

  20. DATABASE
    DATABASES
    APPLICATION
    SEARCH

    View Slide

  21. DATABASE
    DATABASES
    APPLICATION
    CACHE
    SEARCH

    View Slide

  22. DATABASE
    DATABASES
    APPLICATION
    CACHE
    SEARCH
    CAT GIFS

    View Slide

  23. ALL
    HAIL
    THE
    MONOLITH

    View Slide

  24. DATABASE
    DATABASES
    APPLICATION
    CACHE
    SEARCH
    CAT GIFS

    View Slide

  25. APPLICATION

    View Slide

  26. View Slide

  27. View Slide

  28. Are your systems
    reliable?

    View Slide

  29. 2012 June - RBS - Batch processing causes 3 day outage
    2013 December - RBS - Card payments, cash withdrawals
    2015 June - RBS lose 600,000 payments
    2015 August - HSBC lose 275,000 payments
    2015 October - Barclays - Failure of accounts and cards
    2016 January - HSBC - 2 day outage of online systems

    View Slide

  30. Graceful handling
    of failure is essential

    View Slide

  31. Do microservices
    make this worse?

    View Slide

  32. The Fallacies of
    Distributed Computing

    View Slide

  33. Everything
    can and will fail

    View Slide

  34. Identifying
    Failure

    View Slide

  35. LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  36. API
    SERVICE
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  37. View Slide

  38. /webhooks —-> Webhook API

    View Slide

  39. WEBHOOK
    API
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  40. WEBHOOK
    API
    AUTH
    SERVICE
    WEBHOOK
    SERVICE
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  41. WEBHOOK
    API
    AUTH
    SERVICE
    WEBHOOK
    SERVICE
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  42. WEBHOOK
    API
    AUTH
    SERVICE
    WEBHOOK
    SERVICE
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  43. WEBHOOK
    API
    AUTH
    SERVICE
    WEBHOOK
    SERVICE
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  44. Error Tracking
    and Propagation

    View Slide

  45. api webhook api webhook service
    api webhook api webhook service

    View Slide

  46. api webhook api webhook service
    api webhook api webhook service

    View Slide

  47. api webhook api webhook service
    api webhook api webhook service
    error

    View Slide

  48. api webhook api webhook service
    api webhook api webhook service
    error
    SENTRY

    View Slide

  49. View Slide

  50. api webhook api webhook service
    api webhook api webhook service
    error
    SENTRY

    View Slide

  51. api webhook api webhook service
    api webhook api webhook service
    error
    SENTRY
    error
    partition

    View Slide

  52. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error

    View Slide

  53. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    timeout

    View Slide

  54. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    timeout
    error
    timeout

    View Slide

  55. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    error
    error
    error
    duplicated errors

    View Slide

  56. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    error
    error
    error

    View Slide

  57. 8096820c-3b7b-47ec-bce6-1c239252ab40

    View Slide

  58. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    error
    error
    error

    View Slide

  59. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    error
    error
    error

    View Slide

  60. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    error
    error
    error

    View Slide

  61. api webhook api webhook service
    api webhook api webhook service
    ?????
    SENTRY
    error
    error
    error
    error
    hash & deduplicate

    View Slide

  62. View Slide

  63. View Slide

  64. Service
    Healthchecks

    View Slide

  65. type Checker func() (error, map[string]string)

    View Slide

  66. View Slide

  67. View Slide

  68. Endpoint error %
    DB Connection Status
    Configuration loaded?

    View Slide

  69. Instrumentation

    View Slide

  70. View Slide

  71. View Slide

  72. metrics.Counter(1.0, "cassandra.read.error", 1)

    View Slide

  73. metrics.Timing(1.0, "cassandra.read", time.Since(start))

    View Slide

  74. STATSD
    SERVICE A UDP
    SERVICE B UDP
    HOST INSTANCES
    GRAFANA
    INFLUXDB
    METRICS

    View Slide

  75. TELEGRAF
    w/
    STATSD

    PLUGIN
    SERVICE A UDP
    SERVICE B UDP
    HOST INSTANCES
    GRAFANA
    INFLUXDB
    TAGGED METRICS

    View Slide

  76. View Slide

  77. Handling

    Failure

    View Slide

  78. Timing out
    and moving on

    View Slide

  79. Sensible
    Timeouts?

    View Slide

  80. Measure
    EVERYTHING!

    View Slide

  81. View Slide

  82. TIMEOUT?

    View Slide

  83. api webhook api webhook service
    api webhook api webhook service

    View Slide

  84. api webhook api webhook service
    api webhook api webhook service
    Client
    Logic
    Server

    View Slide

  85. api webhook api webhook service
    api webhook api webhook service
    Client
    Logic
    Server

    View Slide

  86. View Slide

  87. Client
    Remote Server
    Circuit Breaker

    View Slide

  88. Client
    Remote Server
    Circuit Breaker
    Error! Error!

    View Slide

  89. Client
    Remote Server
    Circuit Breaker
    Timeout! Timeout!

    View Slide

  90. Client
    Remote Server
    Circuit Breaker
    Circuit
    Open!
    OPEN

    View Slide

  91. View Slide

  92. Client
    Remote Server
    Circuit Breaker
    Return
    Error
    or
    Cached
    Results
    OPEN

    View Slide

  93. Topology
    Management

    View Slide

  94. WEBHOOK
    API
    LOAD BALANCER
    HTTP API & ROUTING LAYER
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  95. WEBHOOK
    API
    LOAD BALANCER
    HTTP API & ROUTING LAYER
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  96. WEBHOOK
    API
    LOAD BALANCER
    HTTP API & ROUTING LAYER
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  97. WEBHOOK
    API
    LOAD BALANCER
    HTTP API & ROUTING LAYER
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    SLOW /
    ERRORS

    View Slide

  98. Fanout &
    Cancellation

    View Slide

  99. WEBHOOK
    API
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  100. WEBHOOK
    API
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  101. WEBHOOK
    API
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  102. WEBHOOK
    API
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  103. WEBHOOK
    API
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE
    WEBHOOK
    SERVICE

    View Slide

  104. Event Driven
    Architectures

    View Slide

  105. API
    SERVICE
    SERVICE
    A
    SERVICE
    B
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  106. API
    SERVICE
    SERVICE
    A
    SERVICE
    B
    LOAD BALANCER
    HTTP API & ROUTING LAYER

    View Slide

  107. API
    SERVICE
    SERVICE
    A
    SERVICE
    B
    LOAD BALANCER
    HTTP API & ROUTING LAYER
    SERVICE
    C
    SERVICE
    D
    E

    View Slide

  108. API
    SERVICE
    SERVICE
    A
    SERVICE
    B
    LOAD BALANCER
    HTTP API & ROUTING LAYER
    SERVICE
    C
    SERVICE
    D
    G
    E
    F

    View Slide

  109. Retry Strategies

    View Slide

  110. Bounded exponential
    backoff

    View Slide

  111. Bounded exponential
    backoff with Jitter

    View Slide

  112. Encouraging
    Failure?

    View Slide

  113. Antifragility

    View Slide

  114. Chaos
    Engineering

    View Slide

  115. View Slide

  116. View Slide

  117. Load
    Failure
    Degradation

    View Slide

  118. Eliminate primaries
    and special nodes

    View Slide

  119. Putting it
    all together

    View Slide

  120. API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns
    API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns

    View Slide

  121. API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns
    API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns

    View Slide

  122. API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns
    API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns

    View Slide

  123. API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns
    API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns

    View Slide

  124. API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns
    API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns

    View Slide

  125. View Slide

  126. API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns
    API card-api card-processing cards transactions balance transaction-enrichment merchant feed-generator feed apns

    View Slide

  127. View Slide

  128. View Slide

  129. #ndclondon
    Thanks!
    @mattheath
    @getmondo

    View Slide

  130. ATM: Thomas Hawk

    Bank of Commerce: ABQ Museum Archives
    IBM System/360: IBM
    Absorbed: Saxbald Photography
    Orbital Ion Cannon: www.rom.ac
    Credits

    View Slide