Slide 1

Slide 1 text

Handling errors at scale Andrew Betts, Assanka

Slide 2

Slide 2 text

Embarrassing on live, often unhelpful on dev

Slide 3

Slide 3 text

Small parts…

Slide 4

Slide 4 text

of a bigger picture

Slide 5

Slide 5 text

Seeing the bigger picture Error reporting Logging User feedback Monitoring

Slide 6

Slide 6 text

Problems §  Developers often leave non fatal errors in code •  Error reporting set too low •  Open source projects often riddled with them (eg. Wordpress) §  Standard error output can be difficult to debug §  Displaying raw errors to end users is really bad §  Finding error patterns on live applications running across multiple servers requires tedious manual aggregation of logs §  Matching errors to user feedback can be a pain §  Finding evidence in logs can be very tough

Slide 7

Slide 7 text

A more elegant way forward for error handling, reporting and ‘problem management’

Slide 8

Slide 8 text

Errors

Slide 9

Slide 9 text

Principles: handling errors §  No half measures: stop execution on every serious error §  Display something sensible to the end user that they'll understand §  React appropriately depending on the environment, dev vs live •  In dev, stop on EVERY error §  Provide as much debug information as possible to help the developer solve the problem quickly

Slide 10

Slide 10 text

Displaying Developer facing, HTML page

Slide 11

Slide 11 text

Anatomy of an error report §  Error title, line number and file §  Code context (three lines either side of the error line) §  Variable context (variables and objects defined in error scope) §  Globals and superglobals §  HTTP request details (GET, POST, headers) §  Backtrace

Slide 12

Slide 12 text

Resolving references Click Open and highlight Just one copy of each piece of data Also handles recursive references

Slide 13

Slide 13 text

Abbreviating output Truncated Truncated

Slide 14

Slide 14 text

Other useful features §  Exposing private members §  Recognising time and date formats §  Network tools for IP addresses

Slide 15

Slide 15 text

AJAX / CLI errors Massive amounts of HTML would not be appreciated!

Slide 16

Slide 16 text

What we show the public Customer-facing, HTML page Hash: backtrace minus paths

Slide 17

Slide 17 text

Public, AJAX version

Slide 18

Slide 18 text

Error hashing §  Take backtrace, strip embedded file paths •  Paths may vary for what is basically the same error §  Serialise §  Hash it •  8 character string of CRC32 •  MD5 was too long •  Customer needs to be able to read it out over the phone •  No significant collisions yet (25,000 hashes recorded) §  Same error on different servers will produce same hash

Slide 19

Slide 19 text

Aggregate Recognises project Tracks response Debug data Occurrence graph Comment history

Slide 20

Slide 20 text

Graphing (log scale) Tends to happen 04:00 - 05:00

Slide 21

Slide 21 text

Debug logs viewable No more than 1 per 4 hours per server

Slide 22

Slide 22 text

Similar tools: Zend Server

Slide 23

Slide 23 text

Feedback

Slide 24

Slide 24 text

Remember this? Customer-facing, HTML page Capture user diagnostics

Slide 25

Slide 25 text

User diagnostics §  In addition to error log, useful to know: •  User's cookies •  Screen size •  Viewport size •  Installed plugins •  Presence of proxy/firewall •  Browser and OS •  What they did §  Often you need to know this stuff to resolve an issue that is not firing any errors (eg layout issue) §  Diagnostics app collects data, files with the bug report

Slide 26

Slide 26 text

Hide this if linked from error page

Slide 27

Slide 27 text

Appears on bug report User reports

Slide 28

Slide 28 text

Logging

Slide 29

Slide 29 text

CERN: 43 terabytes / day Probably a bit high.

Slide 30

Slide 30 text

Probably a bit low.

Slide 31

Slide 31 text

Logging strategies §  Log locally, read individual servers logs when you need to •  Aggregate when necessary §  Log locally, pull logs into central logging store on a schedule §  Set up your own centralised remote logging service and log to it directly •  Third party tools include Splunk (www.splunk.com) §  Use a third party remote logging service •  Loggly (www.loggly.com) •  Gmail? (for the financially challenged! But has great search J)

Slide 32

Slide 32 text

Third party services: Loggly

Slide 33

Slide 33 text

Third party apps: Splunk

Slide 34

Slide 34 text

Monitoring

Slide 35

Slide 35 text

Monitoring tools §  Self-hosted (all free) •  Zabbix (www.zabbix.com) •  Nagios (www.nagios.org) •  Munin (www.munin-monitoring.org) §  Web services •  Serverdensity (www.serverdensity.com £7/server/month) §  Uptime reporting •  Pingdom (www.pingdom.com, $6/check/year)

Slide 36

Slide 36 text

Pingdom: Not just uptime Check that your APIs are working by submitting complex requests

Slide 37

Slide 37 text

Zabbix: not just CPU Monitor any fluctuating numeric value associated with your application that might indicate a health problem. Choose a sensible sample rate.

Slide 38

Slide 38 text

Background and discussion (Implementation for PHP)

Slide 39

Slide 39 text

Errors happen Errors happen In development On live systems

Slide 40

Slide 40 text

Types of error §  E_USER_* §  E_STRICT §  E_NOTICE §  E_DEPRECATED §  E_WARNING §  E_RECOVERABLE_ERROR §  E_ERROR §  E_COMPILE_ERROR §  E_COMPILE_WARNING §  E_PARSE Catchable (E_ERROR requires a hack) Non-catchable

Slide 41

Slide 41 text

Types of request §  Web page §  AJAX §  CLI / standalone

Slide 42

Slide 42 text

Plan an appropriate response in each possible case

Slide 43

Slide 43 text

Possible responses §  Ignore it §  Display it (and stop execution) •  Customer facing or developer facing? •  HTML or plain text? •  Has any output already been sent to the browser? §  Log it •  Locally or remotely? §  Report it •  Send debug data to a bug tracker

Slide 44

Slide 44 text

Ignoring Try not to do that.

Slide 45

Slide 45 text

Logging §  Writing to a local file •  log_errors directive •  error_log directive •  Or: DIY solution to log more detailed info •  We write one file per error occurrence, plus a summary line §  Sending to a remote service •  Useful for multi-server setups •  Aggregate error occurrences from lots of servers •  UDP good for this - fire and forget •  Listen on more than one logging server for HA

Slide 46

Slide 46 text

Implementation

Slide 47

Slide 47 text

Defining a custom error handler set_error_handler(array('ErrorHandler',  'reportError'));   set_excep8on_handler(array('ErrorHandler',  'reportExcep8on'));   register_shutdown_func8on  (array(  'ErrorHandlerV5',  'fatalErrorShutdownHandler'));   §  Report both errors and unhandled exceptions

Slide 48

Slide 48 text

Setting up action rules if  ($_SERVER['IS_DEV'])  {          self::$ac8on  =  array(                  'log'  =>  E_ALL  |  E_STRICT;                  'stop'  =>  E_ALL  |  E_STRICT;                  'index'  =>  0;                  'report'  =>  0;          );   }  else  {          self::$ac8on  =  array(                  'log'  =>  E_ALL  ^  (E_NOTICE  |  E_DEPRECATED  |  E_USER_DEPRECATED),                  'stop'  =>  E_USER_ERROR  |  E_ERROR  |  E_WARNING,                  'index'  =>  E_ALL  ^  (E_NOTICE  |  E_DEPRECATED  |  E_USER_DEPRECATED),                  'report'  =>  E_ALL  ^  (E_NOTICE  |  E_DEPRECATED  |  E_USER_DEPRECATED)          );   }   §  Action rules will map error types to actions for dev and live environments §  $_SERVER['IS_DEV'] - Environment variable set on web server

Slide 49

Slide 49 text

The error handler public  sta8c  func8on  reportError($errno,  $errstr,  $errfile,  $errline,  $context=array())  {                    //  If  the  error  has  been  suppressed  using  the  @  operator,  return                  if  (error_repor8ng()  ==  0)  return;                    //  If  there  are  no  ac8ons  defined  for  this  kind  of  error,  return                  if  (!($errno  &  (self::$ac8on['log']  |  self::$ac8on['stop']  |  self::$ac8on['index']  |  self::$ac8on['report'])))                    return;                      $backtrace  =  debug_backtrace();     §  This is only called for errors, not Exceptions §  Exceptions pass only an object, so we need to extract the data from the Exception object §  Default Exception class does not capture context §  So we need a custom Exception to capture context

Slide 50

Slide 50 text

Converting exceptions class  AssankaExcep8on  extends  Excep8on  {            public  $context;          public  func8on  __construct($message  =  null,  $code  =  0,  Excep8on  $previous  =  null,  $context=null)  {                  parent::__construct($message,  $code,  $previous);                  $this-­‐>context  =  $context;          }            public  func8on  getContext()  {                  return  $this-­‐>context;          }   }     public  sta8c  func8on  reportExcep8on($ex)  {                  self::reportError(E_ERROR,  $ex-­‐>getMessage(),  $ex-­‐>getFile(),  $ex-­‐>getLine(),  $ex);   }    

Slide 51

Slide 51 text

Converting exceptions //  Generate  a  backtrace   if  (is_object($context)  and  method_exists($context,  'getTrace'))  {          $backtrace  =  $context-­‐>getTrace();   }  else  {          $backtrace  =  debug_backtrace();   }     §  Entire Exception object passed to the error handler as the context §  Backtrace must be captured from the standard Exception method §  Our context data is still within the custom Exception, and will be enumerated with the rest of the vars in the debug report §  Standard Exceptions thereby also supported (without context)

Slide 52

Slide 52 text

Deal with E_ERROR §  Can catch E_ERROR, but not using set_error_handler §  Use register_shutdown_function to call a function before exit §  This is called even if the exit is due to a fatal error public  sta8c  func8on  fatalErrorShutdownHandler()  {                  $error  =  error_get_last();                  if  (!empty($error)  and  $error['type']  ===  E_ERROR)  {                          self::reportError(E_ERROR,  $error['message'],  $error['file'],  $error['line']);                  }   }  

Slide 53

Slide 53 text

Creating a hash //  Remove  includes  from  backtrace  before  hashing   $hashtrace  =  $backtrace;   $stopat  =  array("require",  "include",  "require_once",  "include_once");   for  ($i=0;  $i

Slide 54

Slide 54 text

Logging remotely (indexing) if  ($errno  &  self::$ac8on['index'])  {          $senddata  =  array("hash"=>$hash,  "server"=>self::$server,  "errno"=>$errno,  "errstr"=>$errstr,                  "errfile"=>$errfile,  "errline"=>$errline,  "scriptname"=>$_SERVER["SCRIPT_NAME"]);          $senddata  =  json_encode($senddata);          $sock  =  socket_create(AF_INET,  SOCK_DGRAM,  SOL_UDP);          socket_set_nonblock($sock);          socket_set_op8on($sock,  SOL_SOCKET,  SO_BROADCAST,  1);          socket_sendto($sock,  $senddata,  strlen($senddata),  0,                  self::UDP_MONITOR_HOST,  self::UDP_MONITOR_PORT);          socket_close($sock);   }     §  UDP server daemon increments count §  Allows trend monitoring with external monitoring system •  Nagios, Zabbix, Munin (we use Zabbix)

Slide 55

Slide 55 text

Creating debug data §  Create debug data tree •  $_SERVER, $_GET, $_SESSION, backtrace, context, globals etc §  Iterate over it recursively §  Abbreviate and simplify •  Objects become arrays •  Shorten long values (and large arrays) §  Identify references •  Set _errorhandler_objid property of all arrays and objects as they are processed •  If it’s already set, this is a reference to a value we’ve already indexed.

Slide 56

Slide 56 text

Sending report to bug tracker §  Serialise abbreviated data as JSON §  Save to a file §  Fork a process (popen) to upload it to the bug tracker •  Ours sleeps if one is already running, waits for a gap §  Upload using HTTP POST (cURL)

Slide 57

Slide 57 text

Bug tracker receiving the report §  DON’T use the same error handler to handle errors from the bug tracker itself §  Respond quickly to the reporting server – cut them off with:

Slide 58

Slide 58 text

Development tips §  Always develop with highest level of error reporting §  Don’t use @ to suppress errors §  Handle unexpected inputs with in-application errors, don’t fall back to an error handler. §  Use trigger_error to fire errors intentionally (eg for use of deprecated code)

Slide 59

Slide 59 text

Our most common errors §  MySQL lock wait timeout §  Unavailability of third party web services §  File system permissions §  Out of memory §  Unexpected input / inadequate input validation §  Manually triggered alerts