Expect the Un-expected: How to Handle Errors Gracefully

Expect the Un-expected: How to Handle Errors Gracefully

Even though you tested your application perfectly, errors and bugs will still happen in production, especially if other services or databases go down or are under high load. Thus it is very important to see errors happening and to be able to react to them quickly. In this session we'll introduce you to efficient ways for monitoring and logging for errors and show how you can handle them if they happen, covering deployment strategies, using intelligent circuit breakers, and gracefully reducing functionality. The session will give examples and recommendations so that you can quickly get started with implementing these.

Ded87c77266697ee6981c2277bb97633?s=128

Bastian Hofmann

October 24, 2017
Tweet

Transcript

  1. Expect the un-expected How to handle errors gracefully @BastianHofmann

  2. None
  3. None
  4. 14 million users

  5. 193 countries

  6. ~1800 request/s

  7. >100 million publications

  8. ~ 140 components

  9. ~ 400 repositories

  10. ~ 80 engineers

  11. Growing Traffic

  12. Active development

  13. Continuous Delivery

  14. Constant change

  15. Things will go wrong

  16. Bugs

  17. Performance regressions

  18. Database overloads

  19. Hardware failures

  20. Distributed Systems

  21. Network

  22. Microservices

  23. Multiple technology stacks

  24. $complexity++

  25. Errors happen

  26. How to detect them

  27. How to handle them

  28. … in a way that the user does not notice

    them
  29. Detecting if something breaks

  30. Logging

  31. Error Logs

  32. None
  33. [24-Oct-2017 18:18:24 UTC] PHP Fatal error: Uncaught Exception: error in

    /Users/ bastian/error_test_php/index.php:5 Stack trace: #0 /Users/bastian/error_test_php/index.php(8): index() #1 {main} thrown in /Users/bastian/error_test_php/index.php on line 5
  34. try { // your complex potentially failing code } catch

    (\Throwable $e) { // do some error handling }
  35. try { // your complex potentially failing code } catch

    (SomeException $e) { // do some error handling }
  36. try { // your complex potentially failing code } catch

    (SomeException $e) { // do some error handling } catch (OtherError $e) { // do some error handling }
  37. try { // your complex potentially failing code } catch

    (SomeException | OtherError $e) { // do some error handling }
  38. callable set_exception_handler( callable $exception_handler );

  39. bool error_log ( string $message [, int $message_type = 0

    [, string $destination [, string $extra_headers ]]] )
  40. Monolog

  41. <?php use Monolog\Logger; use Monolog\Handler\StreamHandler; // create a log channel

    $log = new Logger('name'); $log->pushHandler( new StreamHandler( 'path/to/your.log', Logger::WARNING ) ); // add records to the log $log->addWarning('Foo'); $log->addError('Bar');
  42. None
  43. Log additional info

  44. None
  45. Request Information

  46. PHP has more than just Exceptions Throwables

  47. Handling Notices, Warnings, Errors

  48. callable set_error_handler( callable $error_handler [, int $error_types = E_ALL |

    E_STRICT ] );
  49. set_error_handler( function($errno, $msg, $file, $line) { $e = new \Exception();

    $error = [ 'type' => $errno, 'message' => $msg, 'file' => $file, 'line' => $line, 'trace' => $e->getTrace(), ]; error_log(json_encode($error)); return true; });
  50. Or directly turn them into Exceptions

  51. set_error_handler( function($errno, $msg, $file, $line) { switch ($errno) { case

    E_RECOVERABLE_ERROR: case E_USER_ERROR: throw new \ErrorException($msg, null, $errno, $file, $line); case E_WARNING: case E_USER_WARNING: case E_CORE_WARNING: case E_COMPILE_WARNING: throw new WarningException($msg, null, $errno, $file, $line); case E_NOTICE: case E_USER_NOTICE: throw new NoticeException($msg, null, $errno, $file, $line); case E_STRICT: throw new StrictException($msg, null, $errno, $file, $line); } return true; }); $a = []; try { $b = $a['doesNotExist']; } catch (NoticeException $e) { }
  52. But what about fatal errors

  53. E.g. OutOfMemory

  54. Use error_get_last() in a Shutdown Handler http://php.net/manual/en/function.error- get-last.php

  55. Log in a structured way

  56. JSON http://www.ietf.org/rfc/rfc4627.txt

  57. Microservices

  58. Logs from many services

  59. gateway service http service http service http service http service

    user request log log log log log
  60. Correlation / Tracing ID

  61. http service http service http service http service create unique

    trace_id for request user request trace_id trace_id trace_id trace_id log log log log log gateway service
  62. Getting to the logs

  63. None
  64. None
  65. None
  66. Put the logs in a central place

  67. Make them easily full-text searchable

  68. Make them aggregate-able

  69. Elasticsearch + Logstash + Kibana

  70. Full text search

  71. Structured Messages

  72. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  73. logstash http://logstash.net/

  74. input filter output

  75. Very rich plugin system

  76. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  77. input { file { type => "error" path => [

    "/var/logs/php/*.log" ] add_field => [ "severity", "error" ] } }
  78. filter{ json { source => "message" } }

  79. output { amqp { host => "amqp.host" exchange_type => "fanout"

    name => "logs" } }
  80. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  81. input {
 rabbitmq {
 queue => "logs"
 host => "amqp.host"


    exchange => "ls_exchange"
 exclusive => false
 }
 }
  82. output {
 elasticsearch {
 embedded => false
 bind_host => "localhost"


    bind_port => "9305"
 host => "localhost"
 cluster => "logs"
 }
 }
  83. Filebeat https://www.elastic.co/products/beats/filebeat

  84. Kibana http://www.elasticsearch.org/overview/kibana/

  85. None
  86. Always Log to file

  87. Not only exceptions

  88. Performance regressions

  89. Conversion regressions

  90. Traffic regressions

  91. Monitoring

  92. Latency

  93. Availability

  94. Throughput

  95. Overall

  96. Granular

  97. Graphite http://graphite.wikidot.com/

  98. None
  99. StatsD https://github.com/etsy/statsd/

  100. webserver webserver webserver statsd statsd statsd graphite

  101. https://www.librato.com

  102. http://www.soasta.com/

  103. Dashboards

  104. None
  105. Alerts

  106. None
  107. Errors happen

  108. Handling Exceptions gracefully

  109. None
  110. Component based fronted

  111. None
  112. None
  113. None
  114. None
  115. None
  116. None
  117. None
  118. None
  119. Degrading Functionality

  120. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu Institution

  121. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu EXCEPTION

    Institution
  122. Profile Publications Publication Publication Publication LeftColumn Image Menu Institution

  123. Big cause for regressions

  124. Deployments

  125. Errors happen

  126. Safe deployments

  127. Feature Flags/ Toggles

  128. Release !== Deployment

  129. public function hasAccess() { return featureFlag()->isActive( FeatureFlag::YOUR_FEATURE ); }

  130. None
  131. None
  132. Partial rollout

  133. Fast rollbacks

  134. Microservices

  135. Safe Service-to Service Communication

  136. Service A Service B 200 OK

  137. Distributed Systems are hard

  138. Errors happen

  139. Failures

  140. Service A Service B 5xx

  141. Very slow responses

  142. Service A Service B 2 seconds later

  143. One incoming request leads to multiple requests to services

  144. Low timeout settings

  145. Service A Service B Only wait for max. 50ms

  146. Don’t call service instances directly

  147. High availability

  148. Use a Load Balancer

  149. Service A Service B Service C Service C Load balancer

    Choose one running instance
  150. Health checks

  151. Service A Service B Service C Service C Load balancer

    Are you still alive?
  152. Service A Service B Service C Service C Load balancer

    Are you still alive?
  153. Service A Service B Service C Service C Load balancer

  154. Even better

  155. Circuit Breakers

  156. Service A Service B 200 OK Circuit Breaker Status: closed

    Error rate: 0
  157. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold
  158. Service A Service B Circuit Breaker Status: -> open Error

    rate: > threshold
  159. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold Test if still failing
  160. Service A Service B 200 OK Circuit Breaker Status: ->

    close Error rate: 0 Test if still failing
  161. Handling traffic spikes

  162. Service A Service B 200 OK Circuit Breaker

  163. Service A Service B Circuit Breaker Service C Circuit Breaker

    Timeouts
  164. Service A Service B Circuit Breaker Service C Circuit Breaker

    Timeouts
  165. Throttling

  166. Service A Service B Circuit Breaker Service C Circuit Breaker

    Only allow xx% of calls
  167. None
  168. Priority

  169. Service A Service B Circuit Breaker Service C Circuit Breaker

    100% of calls 10% of calls
  170. https://github.com/Netflix/Hystrix

  171. https://github.com/odesk/phystrix

  172. None
  173. Service A Service B Service C Circuit Breaker LinkerD

  174. Server Service A Server Service B Service C Service C

    Linkerd Consul Linkerd Consul
  175. Test it

  176. http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

  177. Summing it up

  178. Central Log Management

  179. Monitor and Measure everything

  180. Alerts

  181. Graceful Exception Handling

  182. Feature Flags

  183. Partial Rollouts

  184. Circuit Breakers

  185. http://speakerdeck.com/u/bastianhofmann

  186. http://twitter.com/BastianHofmann http://lanyrd.com/people/BastianHofmann http://speakerdeck.com/u/bastianhofmann mail@bastianhofmann.de