Expect the un-expected: How to handle errors gracefully

Expect the un-expected: How to handle errors gracefully

Even though you tested your application perfectly errors and bugs will still happen in production, especially if other services or databases go down or are under high load. Thus it is very important to see errors happing and to be able to react to them quickly. In this talk I’ll introduce you to efficient ways for monitoring and logging for errors and show how you can handle them if they happen, covering deployment strategies, using intelligent circuit breakers and gracefully reducing functionality.

Ded87c77266697ee6981c2277bb97633?s=128

Bastian Hofmann

January 28, 2017
Tweet

Transcript

  1. Expect the un-expected How to handle errors gracefully @BastianHofmann

  2. None
  3. None
  4. 11 million users

  5. 193 countries

  6. ~1800 request/s

  7. lots of data

  8. >100 million publications

  9. ~ 140 components

  10. ~ 400 repositories

  11. haproxy node memcache postgresql mongodb solr infinispan hbase mongodb solr

    community services
  12. + async events, stream and batch processing

  13. Growing Traffic

  14. Active development

  15. Continuous Delivery

  16. Bugs

  17. Performance regressions

  18. Database overloads

  19. Hardware failures

  20. Distributed Systems

  21. Microservices

  22. Multiple technology stacks

  23. $complexity++

  24. Errors happen

  25. How to detect them

  26. How to handle them

  27. Detecting if something breaks

  28. Logging

  29. Error Logs

  30. ErrorLog /var/logs/apache/error.log

  31. callable set_exception_handler( callable $exception_handler );

  32. bool error_log ( string $message [, int $message_type = 0

    [, string $destination [, string $extra_headers ]]] )
  33. https://github.com/Seldaek/monolog/

  34. None
  35. Log additional info

  36. None
  37. Request Information

  38. PHP has more than just Exceptions Throwables

  39. Handling Notices, Warnings, Errors

  40. callable set_error_handler( callable $error_handler [, int $error_types = E_ALL |

    E_STRICT ] );
  41. set_error_handler( function($errno, $msg, $file, $line) { $e = new \Exception();

    $error = [ 'type' => $errno, 'message' => $msg, 'file' => $file, 'line' => $line, 'trace' => $e->getTrace(), ]; error_log(json_encode($error)); return true; });
  42. Or directly turn them into Exceptions

  43. set_error_handler( function($errno, $msg, $file, $line) { switch ($errno) { case

    E_RECOVERABLE_ERROR: case E_USER_ERROR: throw new \ErrorException($msg, null, $errno, $file, $line); case E_WARNING: case E_USER_WARNING: case E_CORE_WARNING: case E_COMPILE_WARNING: throw new WarningException($msg, null, $errno, $file, $line); case E_NOTICE: case E_USER_NOTICE: throw new NoticeException($msg, null, $errno, $file, $line); case E_STRICT: throw new StrictException($msg, null, $errno, $file, $line); } return true; }); $a = []; try { $b = $a['doesNotExist']; } catch (NoticeException $e) { }
  44. But what about fatal errors

  45. Use error_get_last() in a Shutdown Handler http://php.net/manual/en/function.error- get-last.php

  46. Log in a structured way

  47. JSON http://www.ietf.org/rfc/rfc4627.txt

  48. <?php use Monolog\Logger; use Monolog\Handler\StreamHandler; use Monolog\Formatter\JsonFormatter; $log = new

    Logger('name'); $handler = new StreamHandler( 'path/to/your.log', Logger::WARNING ); $handler->setFormatter(new JsonFormatter()); $log->pushHandler($handler);
  49. Debug Logs

  50. FingersCrossed Handler

  51. Logs from other services

  52. web server http service http service http service http service

    user request log log log log log
  53. Correlation / Tracing ID

  54. web server http service http service http service http service

    create unique trace_id for request user request trace_id trace_id trace_id trace_id log log log log log
  55. X-Trace-Id: bbr8ehb984tbab894

  56. Getting to the logs

  57. None
  58. None
  59. None
  60. Aggregate the logs in a central place

  61. Make them easily full-text searchable

  62. Make them aggregate-able

  63. Always Log to file

  64. Directly to a database

  65. webserver webserver webserver DB

  66. Disadvantages

  67. Database is down?

  68. Database is slow?

  69. Database is full?

  70. How to integrate access logs?

  71. Influences application performance

  72. Better solutions?

  73. Logstash + Elasticsearch + Kibana

  74. Full text search

  75. Structured Messages

  76. Always Log to file

  77. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  78. logstash http://logstash.net/

  79. input filter output

  80. Very rich plugin system

  81. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  82. input { file { type => "error" path => [

    "/var/logs/php/*.log" ] add_field => [ "severity", "error" ] } file { type => "access" path => [ "/var/logs/apache/*_access.log" ] add_field => [ "severity", "info" ] }
  83. filter{ grok { match => ["@source", "\/%{USERNAME:facility}\.log$"] } grok {

    type => "access" pattern => "^%{IP:OriginalIp} \s[a-zA-Z0-9_-]+\s[a-zA-Z0-9_-]+\s\[.*? \]\s\"%{DATA:Request}..." } }
  84. output { amqp { host => "amqp.host" exchange_type => "fanout"

    name => "logs" } }
  85. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  86. input {
 rabbitmq {
 queue => "logs"
 host => "amqp.host"


    exchange => "ls_exchange"
 exclusive => false
 }
 }
  87. filter { grok {
 match => [
 'Type','EntityNotFoundException| TimeoutException'
 ]


    break_on_match => true
 add_tag => ['not_serious']
 tag_on_failure => ['serious']
 } }
  88. output {
 elasticsearch {
 embedded => false
 bind_host => "localhost"


    bind_port => "9305"
 host => "localhost"
 cluster => "logs"
 }
 }
  89. Kibana http://www.elasticsearch.org/overview/kibana/

  90. None
  91. Monitoring

  92. Latency

  93. Availability

  94. Throughput

  95. Overall

  96. Granular

  97. Graphite http://graphite.wikidot.com/

  98. None
  99. StatsD https://github.com/etsy/statsd/

  100. webserver webserver webserver statsd statsd statsd graphite

  101. https://www.librato.com

  102. http://www.soasta.com/

  103. http://www.monitor.us/

  104. Alerts

  105. None
  106. Dashboards

  107. None
  108. None
  109. None
  110. None
  111. Handling Exceptions gracefully

  112. None
  113. Component based fronted

  114. None
  115. None
  116. None
  117. None
  118. None
  119. None
  120. None
  121. Degrading Functionality

  122. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu Institution

  123. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu EXCEPTION

    Institution
  124. Profile Publications Publication Publication Publication LeftColumn Image Menu Institution

  125. Safe deployments

  126. Feature Flags/ Toggles

  127. Release !== Deployment

  128. public function hasAccess( $accountId, array $roleNames ) { return featureFlag()->isActive(

    FeatureFlag::TEST_ONE ); }
  129. None
  130. None
  131. Partial rollout

  132. Automation

  133. Canary environments

  134. Server Server Server Server

  135. Server Server Server Server Test with low amount of traffic

  136. Fast rollbacks

  137. None
  138. None
  139. Safe Service-to Service Communication

  140. Service A Service B 200 OK

  141. One central HTTP library

  142. Guzzle

  143. Errors will happen

  144. Errors

  145. Service A Service B 5xx

  146. Timeouts

  147. Service A Service B Timeout

  148. Low timeout settings

  149. Service A Service B Only wait for max. 50ms

  150. Measure upper 9x and upper response times

  151. Use a Load Balancer

  152. Service A Service B Service C Service C Load balancer

    Choose one running instance
  153. Health check

  154. Service A Service B Service C Service C Load balancer

    Are you still alive?
  155. Service A Service B Service C Service C Load balancer

    Are you still alive?
  156. Service A Service B Service C Service C Load balancer

  157. Circuit Breakers

  158. Service A Service B 200 OK Circuit Breaker Status: closed

    Error rate: 0
  159. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold
  160. Service A Service B Circuit Breaker Status: -> open Error

    rate: > threshold
  161. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold Test if still failing
  162. Service A Service B 200 OK Circuit Breaker Status: ->

    close Error rate: 0 Test if still failing
  163. How do I handle traffic spikes?

  164. Service A Service B 200 OK Circuit Breaker

  165. Service A Service B Circuit Breaker Service C Circuit Breaker

    Timeouts
  166. Service A Service B Circuit Breaker Service C Circuit Breaker

    Timeouts
  167. Throttling

  168. Service A Service B Circuit Breaker Service C Circuit Breaker

    Only allow xx% of calls
  169. None
  170. Priority

  171. Service A Service B Circuit Breaker Service C Circuit Breaker

    100% of calls 10% of calls
  172. https://github.com/Netflix/Hystrix

  173. https://github.com/odesk/phystrix

  174. None
  175. Service A Service B Service C Circuit Breaker LinkerD

  176. Test it

  177. http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

  178. Summing it up

  179. Central Log Management

  180. Monitor and Measure everything

  181. Alerts

  182. Graceful Exception Handling

  183. Feature Flags

  184. Partial Rollouts

  185. Circuit Breakers

  186. http://speakerdeck.com/u/bastianhofmann

  187. http://twitter.com/BastianHofmann http://lanyrd.com/people/BastianHofmann http://speakerdeck.com/u/bastianhofmann mail@bastianhofmann.de