$30 off During Our Annual Pro Sale. View Details »

Expect the un-expected: How to handle errors gracefully

Expect the un-expected: How to handle errors gracefully

Even though you tested your application perfectly errors and bugs will still happen in production, especially if other services or databases go down or are under high load. Thus it is very important to see errors happing and to be able to react to them quickly. In this talk I’ll introduce you to efficient ways for monitoring and logging for errors and show how you can handle them if they happen, covering deployment strategies, using intelligent circuit breakers and gracefully reducing functionality.

Bastian Hofmann

April 08, 2017
Tweet

More Decks by Bastian Hofmann

Other Decks in Programming

Transcript

  1. Expect the
    un-expected
    How to handle errors
    gracefully
    @BastianHofmann

    View Slide

  2. View Slide

  3. View Slide

  4. 12 million users

    View Slide

  5. 193 countries

    View Slide

  6. ~1800 request/s

    View Slide

  7. lots of data

    View Slide

  8. >100 million
    publications

    View Slide

  9. ~ 140 components

    View Slide

  10. ~ 400 repositories

    View Slide

  11. ~ 80 engineers

    View Slide

  12. Growing Traffic

    View Slide

  13. Active
    development

    View Slide

  14. Continuous Delivery

    View Slide

  15. Bugs

    View Slide

  16. Performance
    regressions

    View Slide

  17. Database
    overloads

    View Slide

  18. Hardware failures

    View Slide

  19. Distributed Systems

    View Slide

  20. Microservices

    View Slide

  21. Multiple technology
    stacks

    View Slide

  22. $complexity++

    View Slide

  23. Errors happen

    View Slide

  24. How to detect
    them

    View Slide

  25. How to handle
    them

    View Slide

  26. Detecting if
    something breaks

    View Slide

  27. Logging

    View Slide

  28. Error Logs

    View Slide

  29. callable set_exception_handler(
    callable $exception_handler
    );

    View Slide

  30. bool error_log ( string $message [, int
    $message_type = 0 [, string $destination [, string
    $extra_headers ]]] )

    View Slide

  31. ErrorLog /var/logs/apache/error.log

    View Slide

  32. View Slide

  33. Monolog

    View Slide

  34. Log additional info

    View Slide

  35. View Slide

  36. Request
    Information

    View Slide

  37. PHP has more than
    just
    Exceptions
    Throwables

    View Slide

  38. Handling Notices,
    Warnings, Errors

    View Slide

  39. callable set_error_handler(
    callable $error_handler
    [, int $error_types = E_ALL |
    E_STRICT ]
    );

    View Slide

  40. set_error_handler(
    function($errno, $msg, $file, $line) {
    $e = new \Exception();
    $error = [
    'type' => $errno,
    'message' => $msg,
    'file' => $file,
    'line' => $line,
    'trace' => $e->getTrace(),
    ];
    error_log(json_encode($error));
    return true;
    });

    View Slide

  41. Or directly turn
    them into
    Exceptions

    View Slide

  42. set_error_handler(
    function($errno, $msg, $file, $line) {
    switch ($errno) {
    case E_RECOVERABLE_ERROR:
    case E_USER_ERROR:
    throw new \ErrorException($msg, null, $errno, $file, $line);
    case E_WARNING:
    case E_USER_WARNING:
    case E_CORE_WARNING:
    case E_COMPILE_WARNING:
    throw new WarningException($msg, null, $errno, $file, $line);
    case E_NOTICE:
    case E_USER_NOTICE:
    throw new NoticeException($msg, null, $errno, $file, $line);
    case E_STRICT:
    throw new StrictException($msg, null, $errno, $file, $line);
    }
    return true;
    });
    $a = [];
    try {
    $b = $a['doesNotExist'];
    } catch (NoticeException $e) {
    }

    View Slide

  43. But what about
    fatal errors

    View Slide

  44. Use error_get_last()
    in a Shutdown
    Handler
    http://php.net/manual/en/function.error-
    get-last.php

    View Slide

  45. Log in a structured
    way

    View Slide

  46. JSON
    http://www.ietf.org/rfc/rfc4627.txt

    View Slide

  47. Logs from other
    services

    View Slide

  48. web server http service
    http service
    http service
    http service
    user request
    log
    log
    log
    log
    log

    View Slide

  49. Correlation /
    Tracing ID

    View Slide

  50. web server http service
    http service
    http service
    http service
    create
    unique
    trace_id for
    request
    user request
    trace_id
    trace_id
    trace_id
    trace_id
    log
    log
    log
    log
    log

    View Slide

  51. X-Trace-Id: bbr8ehb984tbab894

    View Slide

  52. Getting to the logs

    View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. Put the logs in a
    central place

    View Slide

  57. Make them easily
    full-text searchable

    View Slide

  58. Make them
    aggregate-able

    View Slide

  59. Logstash +
    Elasticsearch +
    Kibana

    View Slide

  60. Full text search

    View Slide

  61. Structured
    Messages

    View Slide

  62. Logstash
    elasticsearch
    webserver webserver webserver
    AMQP
    log log log
    logstash logstash logstash

    View Slide

  63. logstash
    http://logstash.net/

    View Slide

  64. input filter output

    View Slide

  65. Very rich plugin
    system

    View Slide

  66. Logstash
    elasticsearch
    webserver webserver webserver
    AMQP
    log log log
    logstash logstash logstash

    View Slide

  67. input {
    file {
    type => "error"
    path =>
    [ "/var/logs/php/*.log" ]
    add_field =>
    [ "severity", "error" ]
    }
    }

    View Slide

  68. filter{
    json {
    source => "message"
    }
    }

    View Slide

  69. output {
    amqp {
    host => "amqp.host"
    exchange_type => "fanout"
    name => "logs"
    }
    }

    View Slide

  70. Logstash
    elasticsearch
    webserver webserver webserver
    AMQP
    log log log
    logstash logstash logstash

    View Slide

  71. input {

    rabbitmq {

    queue => "logs"

    host => "amqp.host"

    exchange => "ls_exchange"

    exclusive => false

    }

    }

    View Slide

  72. output {

    elasticsearch {

    embedded => false

    bind_host => "localhost"

    bind_port => "9305"

    host => "localhost"

    cluster => "logs"

    }

    }

    View Slide

  73. Kibana
    http://www.elasticsearch.org/overview/kibana/

    View Slide

  74. View Slide

  75. Always Log to file

    View Slide

  76. Monitoring

    View Slide

  77. Latency

    View Slide

  78. Availability

    View Slide

  79. Throughput

    View Slide

  80. Overall

    View Slide

  81. Granular

    View Slide

  82. Graphite
    http://graphite.wikidot.com/

    View Slide

  83. View Slide

  84. StatsD
    https://github.com/etsy/statsd/

    View Slide

  85. webserver webserver webserver
    statsd statsd
    statsd
    graphite

    View Slide

  86. https://www.librato.com

    View Slide

  87. http://www.soasta.com/

    View Slide

  88. Dashboards

    View Slide

  89. View Slide

  90. View Slide

  91. View Slide

  92. Alerts

    View Slide

  93. View Slide

  94. Errors happen

    View Slide

  95. Handling
    Exceptions
    gracefully

    View Slide

  96. View Slide

  97. Component based
    fronted

    View Slide

  98. View Slide

  99. View Slide

  100. View Slide

  101. View Slide

  102. View Slide

  103. View Slide

  104. View Slide

  105. Degrading
    Functionality

    View Slide

  106. Profile Publications Publication
    Publication
    Publication
    AboutMe
    LeftColumn Image
    Menu
    Institution

    View Slide

  107. Profile Publications Publication
    Publication
    Publication
    AboutMe
    LeftColumn Image
    Menu
    EXCEPTION
    Institution

    View Slide

  108. Profile Publications Publication
    Publication
    Publication
    LeftColumn Image
    Menu
    Institution

    View Slide

  109. Deployments

    View Slide

  110. Errors happen

    View Slide

  111. Safe deployments

    View Slide

  112. Feature Flags/
    Toggles

    View Slide

  113. Release !==
    Deployment

    View Slide

  114. public function hasAccess(
    $accountId, array $roleNames
    ) {
    return featureFlag()->isActive(
    FeatureFlag::TEST_ONE
    );
    }

    View Slide

  115. View Slide

  116. View Slide

  117. Partial rollout

    View Slide

  118. Automation

    View Slide

  119. Canary
    environments

    View Slide

  120. Server
    Server Server
    Server

    View Slide

  121. Server
    Server Server
    Server
    Test with
    low
    amount
    of traffic

    View Slide

  122. Fast rollbacks

    View Slide

  123. Oftentimes it’s not
    only one monolith

    View Slide

  124. View Slide

  125. Safe Service-to
    Service
    Communication

    View Slide

  126. Service A Service B
    200 OK

    View Slide

  127. Distributed Systems
    are hard

    View Slide

  128. Errors happen

    View Slide

  129. Failures

    View Slide

  130. Service A Service B
    5xx

    View Slide

  131. Timeouts

    View Slide

  132. Service A Service B
    Timeout

    View Slide

  133. Low timeout
    settings

    View Slide

  134. Service A Service B
    Only wait for
    max.
    50ms

    View Slide

  135. Don’t call service
    instances directly

    View Slide

  136. Use a Load
    Balancer

    View Slide

  137. Service A Service B
    Service C Service C
    Load balancer
    Choose one running instance

    View Slide

  138. Health checks

    View Slide

  139. Service A Service B
    Service C Service C
    Load balancer
    Are you still alive?

    View Slide

  140. Service A Service B
    Service C Service C
    Load balancer
    Are you still alive?

    View Slide

  141. Service A Service B
    Service C Service C
    Load balancer

    View Slide

  142. Circuit Breakers

    View Slide

  143. Service A Service B
    200 OK
    Circuit
    Breaker
    Status: closed
    Error rate: 0

    View Slide

  144. Service A Service B
    Error
    Circuit
    Breaker
    Status: -> open
    Error rate:
    > threshold

    View Slide

  145. Service A Service B
    Circuit
    Breaker
    Status: -> open
    Error rate:
    > threshold

    View Slide

  146. Service A Service B
    Error
    Circuit
    Breaker
    Status: -> open
    Error rate:
    > threshold
    Test if still failing

    View Slide

  147. Service A Service B
    200 OK
    Circuit
    Breaker
    Status: -> close
    Error rate: 0
    Test if still failing

    View Slide

  148. How do I handle
    traffic spikes?

    View Slide

  149. Service A Service B
    200 OK
    Circuit
    Breaker

    View Slide

  150. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    Timeouts

    View Slide

  151. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    Timeouts

    View Slide

  152. Throttling

    View Slide

  153. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    Only allow xx% of calls

    View Slide

  154. View Slide

  155. Priority

    View Slide

  156. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    100% of calls
    10% of calls

    View Slide

  157. https://github.com/Netflix/Hystrix

    View Slide

  158. https://github.com/odesk/phystrix

    View Slide

  159. View Slide

  160. Service A Service B
    Service C
    Circuit
    Breaker
    LinkerD

    View Slide

  161. Test it

    View Slide

  162. http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

    View Slide

  163. Summing it up

    View Slide

  164. Central Log
    Management

    View Slide

  165. Monitor and
    Measure everything

    View Slide

  166. Alerts

    View Slide

  167. Graceful Exception
    Handling

    View Slide

  168. Feature Flags

    View Slide

  169. Partial Rollouts

    View Slide

  170. Circuit Breakers

    View Slide

  171. http://speakerdeck.com/u/bastianhofmann

    View Slide

  172. http://twitter.com/BastianHofmann
    http://lanyrd.com/people/BastianHofmann
    http://speakerdeck.com/u/bastianhofmann
    [email protected]

    View Slide