Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Erwarte die Ausnahmen – Elegante Fehlerbehandlung

Erwarte die Ausnahmen – Elegante Fehlerbehandlung

Auch die beste Testabdeckung einer Web Applikation verhindert nicht, dass Fehler und Bugs in der Produktionsumgebung auftreten können, insbesondere bei Datenbank Problemen oder Lastspitzen. Deshalb ist es wichtig, Fehler und Probleme frühzeitig zu erkennen, um diese schnell beheben zu können. In diesem Talk werde ich effiziente Wege wie z.B. Deployment Strategien, Canary Environments oder Circuit Breaker vorstellen um Fehler messen, loggen und dann elegant behandeln zu können. Ziel ist, den Einfluss auf Gesamtstabilität und User-Experience der Applikation so gering wie möglich zu halten.

8e82eb7e128a14a16d642ae55227339b?s=128

Bastian Hofmann

April 07, 2017
Tweet

Transcript

  1. Expect the un-expected How to handle errors gracefully @BastianHofmann

  2. None
  3. None
  4. 12 million users

  5. 193 countries

  6. ~1800 request/s

  7. lots of data

  8. >100 million publications

  9. ~ 140 components

  10. ~ 400 repositories

  11. Growing Traffic

  12. Active development

  13. Continuous Delivery

  14. Bugs

  15. Performance regressions

  16. Database overloads

  17. Hardware failures

  18. Distributed Systems

  19. Microservices

  20. Multiple technology stacks

  21. $complexity++

  22. Errors happen

  23. How to detect them

  24. How to handle them

  25. None
  26. Detecting if something breaks

  27. Logging

  28. Error Logs

  29. callable set_exception_handler( callable $exception_handler );

  30. bool error_log ( string $message [, int $message_type = 0

    [, string $destination [, string $extra_headers ]]] )
  31. ErrorLog /var/logs/apache/error.log

  32. None
  33. Log additional info

  34. None
  35. Request Information

  36. PHP has more than just Exceptions Throwables

  37. Handling Notices, Warnings, Errors

  38. callable set_error_handler( callable $error_handler [, int $error_types = E_ALL |

    E_STRICT ] );
  39. set_error_handler( function($errno, $msg, $file, $line) { $e = new \Exception();

    $error = [ 'type' => $errno, 'message' => $msg, 'file' => $file, 'line' => $line, 'trace' => $e->getTrace(), ]; error_log(json_encode($error)); return true; });
  40. Or directly turn them into Exceptions

  41. set_error_handler( function($errno, $msg, $file, $line) { switch ($errno) { case

    E_RECOVERABLE_ERROR: case E_USER_ERROR: throw new \ErrorException($msg, null, $errno, $file, $line); case E_WARNING: case E_USER_WARNING: case E_CORE_WARNING: case E_COMPILE_WARNING: throw new WarningException($msg, null, $errno, $file, $line); case E_NOTICE: case E_USER_NOTICE: throw new NoticeException($msg, null, $errno, $file, $line); case E_STRICT: throw new StrictException($msg, null, $errno, $file, $line); } return true; }); $a = []; try { $b = $a['doesNotExist']; } catch (NoticeException $e) { }
  42. But what about fatal errors

  43. Use error_get_last() in a Shutdown Handler http://php.net/manual/en/function.error- get-last.php

  44. Log in a structured way

  45. JSON http://www.ietf.org/rfc/rfc4627.txt

  46. Logs from other services

  47. web server http service http service http service http service

    user request log log log log log
  48. Correlation / Tracing ID

  49. web server http service http service http service http service

    create unique trace_id for request user request trace_id trace_id trace_id trace_id log log log log log
  50. X-Trace-Id: bbr8ehb984tbab894

  51. Getting to the logs

  52. None
  53. None
  54. None
  55. Aggregate the logs in a central place

  56. Make them easily full-text searchable

  57. Make them aggregate-able

  58. Logstash + Elasticsearch + Kibana

  59. Full text search

  60. Structured Messages

  61. Always Log to file

  62. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  63. logstash http://logstash.net/

  64. input filter output

  65. Very rich plugin system

  66. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  67. input { file { type => "error" path => [

    "/var/logs/php/*.log" ] add_field => [ "severity", "error" ] } file { type => "access" path => [ "/var/logs/apache/*_access.log" ] add_field => [ "severity", "info" ] }
  68. filter{ grok { match => ["@source", "\/%{USERNAME:facility}\.log$"] } grok {

    type => "access" pattern => "^%{IP:OriginalIp} \s[a-zA-Z0-9_-]+\s[a-zA-Z0-9_-]+\s\[.*? \]\s\"%{DATA:Request}..." } }
  69. output { amqp { host => "amqp.host" exchange_type => "fanout"

    name => "logs" } }
  70. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  71. input {
 rabbitmq {
 queue => "logs"
 host => "amqp.host"


    exchange => "ls_exchange"
 exclusive => false
 }
 }
  72. filter { grok {
 match => [
 'Type','EntityNotFoundException| TimeoutException'
 ]


    break_on_match => true
 add_tag => ['not_serious']
 tag_on_failure => ['serious']
 } }
  73. output {
 elasticsearch {
 embedded => false
 bind_host => "localhost"


    bind_port => "9305"
 host => "localhost"
 cluster => "logs"
 }
 }
  74. Kibana http://www.elasticsearch.org/overview/kibana/

  75. None
  76. Monitoring

  77. Latency

  78. Availability

  79. Throughput

  80. Overall

  81. Granular

  82. Graphite http://graphite.wikidot.com/

  83. None
  84. StatsD https://github.com/etsy/statsd/

  85. webserver webserver webserver statsd statsd statsd graphite

  86. https://www.librato.com

  87. http://www.soasta.com/

  88. Dashboards

  89. None
  90. None
  91. None
  92. Alerts

  93. None
  94. Errors happen

  95. Handling Exceptions gracefully

  96. None
  97. Component based fronted

  98. None
  99. None
  100. None
  101. None
  102. None
  103. None
  104. None
  105. Degrading Functionality

  106. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu Institution

  107. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu EXCEPTION

    Institution
  108. Profile Publications Publication Publication Publication LeftColumn Image Menu Institution

  109. Deployments

  110. Errors happen

  111. Safe deployments

  112. Feature Flags/ Toggles

  113. Release !== Deployment

  114. public function hasAccess( $accountId, array $roleNames ) { return featureFlag()->isActive(

    FeatureFlag::TEST_ONE ); }
  115. None
  116. None
  117. Partial rollout

  118. Automation

  119. Canary environments

  120. Server Server Server Server

  121. Server Server Server Server Test with low amount of traffic

  122. Fast rollbacks

  123. Oftentimes it’s not only one monolith

  124. None
  125. Safe Service-to Service Communication

  126. Service A Service B 200 OK

  127. Distributed Systems are hard

  128. Errors happen

  129. Failures

  130. Service A Service B 5xx

  131. Timeouts

  132. Service A Service B Timeout

  133. Low timeout settings

  134. Service A Service B Only wait for max. 50ms

  135. Don’t call service instances directly

  136. Use a Load Balancer

  137. Service A Service B Service C Service C Load balancer

    Choose one running instance
  138. Health checks

  139. Service A Service B Service C Service C Load balancer

    Are you still alive?
  140. Service A Service B Service C Service C Load balancer

    Are you still alive?
  141. Service A Service B Service C Service C Load balancer

  142. Circuit Breakers

  143. Service A Service B 200 OK Circuit Breaker Status: closed

    Error rate: 0
  144. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold
  145. Service A Service B Circuit Breaker Status: -> open Error

    rate: > threshold
  146. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold Test if still failing
  147. Service A Service B 200 OK Circuit Breaker Status: ->

    close Error rate: 0 Test if still failing
  148. How do I handle traffic spikes?

  149. Service A Service B 200 OK Circuit Breaker

  150. Service A Service B Circuit Breaker Service C Circuit Breaker

    Timeouts
  151. Service A Service B Circuit Breaker Service C Circuit Breaker

    Timeouts
  152. Throttling

  153. Service A Service B Circuit Breaker Service C Circuit Breaker

    Only allow xx% of calls
  154. None
  155. Priority

  156. Service A Service B Circuit Breaker Service C Circuit Breaker

    100% of calls 10% of calls
  157. https://github.com/Netflix/Hystrix

  158. https://github.com/odesk/phystrix

  159. None
  160. Service A Service B Service C Circuit Breaker LinkerD

  161. Test it

  162. http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

  163. Summing it up

  164. Central Log Management

  165. Monitor and Measure everything

  166. Alerts

  167. Graceful Exception Handling

  168. Feature Flags

  169. Partial Rollouts

  170. Circuit Breakers

  171. http://speakerdeck.com/u/bastianhofmann

  172. http://twitter.com/BastianHofmann http://lanyrd.com/people/BastianHofmann http://speakerdeck.com/u/bastianhofmann mail@bastianhofmann.de