Expect the Un-expected: How to Handle Errors Gracefully

Expect the un-expected How to handle errors gracefully @BastianHofmann

14 million users

193 countries

~1800 request/s

>100 million publications

~ 140 components

~ 400 repositories

~ 80 engineers

Growing Traffic

Active development

Continuous Delivery

Constant change

Things will go wrong

Performance regressions

Database overloads

Hardware failures

Distributed Systems

Network

Microservices

Multiple technology stacks

$complexity++

Errors happen

How to detect them

How to handle them

… in a way that the user does not notice
them

Detecting if something breaks

Logging

Error Logs

[24-Oct-2017 18:18:24 UTC] PHP Fatal error: Uncaught Exception: error in
/Users/ bastian/error_test_php/index.php:5 Stack trace: #0 /Users/bastian/error_test_php/index.php(8): index() #1 {main} thrown in /Users/bastian/error_test_php/index.php on line 5

try { // your complex potentially failing code } catch
(\Throwable $e) { // do some error handling }

(SomeException $e) { // do some error handling }

(SomeException $e) { // do some error handling } catch (OtherError $e) { // do some error handling }

(SomeException | OtherError $e) { // do some error handling }

callable set_exception_handler( callable $exception_handler );

bool error_log ( string $message [, int $message_type = 0
[, string $destination [, string $extra_headers ]]] )

Monolog

<?php use Monolog\Logger; use Monolog\Handler\StreamHandler; // create a log channel
$log = new Logger('name'); $log->pushHandler( new StreamHandler( 'path/to/your.log', Logger::WARNING ) ); // add records to the log $log->addWarning('Foo'); $log->addError('Bar');

Log additional info

Request Information

PHP has more than just Exceptions Throwables

Handling Notices, Warnings, Errors

callable set_error_handler( callable $error_handler [, int $error_types = E_ALL |
E_STRICT ] );

set_error_handler( function($errno, $msg, $file, $line) { $e = new \Exception();
$error = [ 'type' => $errno, 'message' => $msg, 'file' => $file, 'line' => $line, 'trace' => $e->getTrace(), ]; error_log(json_encode($error)); return true; });

Or directly turn them into Exceptions

set_error_handler( function($errno, $msg, $file, $line) { switch ($errno) { case
E_RECOVERABLE_ERROR: case E_USER_ERROR: throw new \ErrorException($msg, null, $errno, $file, $line); case E_WARNING: case E_USER_WARNING: case E_CORE_WARNING: case E_COMPILE_WARNING: throw new WarningException($msg, null, $errno, $file, $line); case E_NOTICE: case E_USER_NOTICE: throw new NoticeException($msg, null, $errno, $file, $line); case E_STRICT: throw new StrictException($msg, null, $errno, $file, $line); } return true; }); $a = []; try { $b = $a['doesNotExist']; } catch (NoticeException $e) { }

But what about fatal errors

E.g. OutOfMemory

Use error_get_last() in a Shutdown Handler http://php.net/manual/en/function.error- get-last.php

Log in a structured way

JSON http://www.ietf.org/rfc/rfc4627.txt

Microservices

Logs from many services

gateway service http service http service http service http service
user request log log log log log

Correlation / Tracing ID

http service http service http service http service create unique
trace_id for request user request trace_id trace_id trace_id trace_id log log log log log gateway service

Getting to the logs

Put the logs in a central place

Make them easily full-text searchable

Make them aggregate-able

Elasticsearch + Logstash + Kibana

Full text search

Structured Messages

Logstash elasticsearch webserver webserver webserver AMQP log log log logstash
logstash logstash

logstash http://logstash.net/

input ﬁlter output

Very rich plugin system

logstash logstash

input { file { type => "error" path => [
"/var/logs/php/*.log" ] add_field => [ "severity", "error" ] } }

filter{ json { source => "message" } }

output { amqp { host => "amqp.host" exchange_type => "fanout"
name => "logs" } }

logstash logstash

input {  rabbitmq {  queue => "logs"  host => "amqp.host" 
exchange => "ls_exchange"  exclusive => false  }  }

output {  elasticsearch {  embedded => false  bind_host => "localhost" 
bind_port => "9305"  host => "localhost"  cluster => "logs"  }  }

Filebeat https://www.elastic.co/products/beats/ﬁlebeat

Kibana http://www.elasticsearch.org/overview/kibana/

Always Log to file

Not only exceptions

Performance regressions

Conversion regressions

Traffic regressions

Monitoring

Latency

Availability

Throughput

Overall

Granular

Graphite http://graphite.wikidot.com/

StatsD https://github.com/etsy/statsd/

webserver webserver webserver statsd statsd statsd graphite

https://www.librato.com

http://www.soasta.com/

Dashboards

Alerts

Errors happen

Handling Exceptions gracefully

Component based fronted

Degrading Functionality

Proﬁle Publications Publication Publication Publication AboutMe LeftColumn Image Menu Institution

Proﬁle Publications Publication Publication Publication AboutMe LeftColumn Image Menu EXCEPTION
Institution

Proﬁle Publications Publication Publication Publication LeftColumn Image Menu Institution

Big cause for regressions

Deployments

Errors happen

Safe deployments

Feature Flags/ Toggles

Release !== Deployment

public function hasAccess() { return featureFlag()->isActive( FeatureFlag::YOUR_FEATURE ); }

Partial rollout

Fast rollbacks

Microservices

Safe Service-to Service Communication

Service A Service B 200 OK

Distributed Systems are hard

Errors happen

Failures

Service A Service B 5xx

Very slow responses

Service A Service B 2 seconds later

One incoming request leads to multiple requests to services

Low timeout settings

Service A Service B Only wait for max. 50ms

Don’t call service instances directly

High availability

Use a Load Balancer

Service A Service B Service C Service C Load balancer
Choose one running instance

Health checks

Are you still alive?

Even better

Circuit Breakers

Service A Service B 200 OK Circuit Breaker Status: closed
Error rate: 0

Service A Service B Error Circuit Breaker Status: -> open
Error rate: > threshold

Service A Service B Circuit Breaker Status: -> open Error
rate: > threshold

Service A Service B Error Circuit Breaker Status: -> open
Error rate: > threshold Test if still failing

Service A Service B 200 OK Circuit Breaker Status: ->
close Error rate: 0 Test if still failing

Handling traffic spikes

Service A Service B 200 OK Circuit Breaker

Service A Service B Circuit Breaker Service C Circuit Breaker
Timeouts

Throttling

Only allow xx% of calls

Priority

100% of calls 10% of calls

https://github.com/Netﬂix/Hystrix

https://github.com/odesk/phystrix

Service A Service B Service C Circuit Breaker LinkerD

Server Service A Server Service B Service C Service C
Linkerd Consul Linkerd Consul

Test it

http://techblog.netﬂix.com/2014/09/introducing-chaos-engineering.html

Summing it up

Central Log Management

Monitor and Measure everything

Alerts

Graceful Exception Handling

Feature Flags

Partial Rollouts

Circuit Breakers

http://speakerdeck.com/u/bastianhofmann

http://twitter.com/BastianHofmann http://lanyrd.com/people/BastianHofmann http://speakerdeck.com/u/bastianhofmann [email protected]

Expect the Un-expected: How to Handle Errors Gr...

Expect the Un-expected: How to Handle Errors Gracefully

More Decks by Bastian Hofmann

Other Decks in Programming

Featured

Transcript