Simple Testing Can Prevent
Most Critical Failures:
An Analysis of Production Failures in
Distributed Data-Intensive Systems
Papers We Love New York - June 2016
Slide 2
Slide 2 text
Caitie McCaffrey
@caitie
Distributed Systems Engineer
CaitieM.com
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
Analyzed Failures in Real
World Systems
Slide 6
Slide 6 text
“A majority (77%) of
failures require more
than one input event to
manifest, but most of
the failures (90%)
require no more than 3”
Complexity of Failures
Slide 7
Slide 7 text
“The specific order of events is
important in 88% of the failures that
require multiple events
Complexity of Failures
Slide 8
Slide 8 text
“3 Nodes or less can
reproduce 98% of Failures”
Complexity of Failures
Slide 9
Slide 9 text
Unit Tests
“A majority of production failures
(77%) can be reproduced by a unit
test”
Slide 10
Slide 10 text
Top Down Fault Injection
& State Space
Exploration is Expensive
Slide 11
Slide 11 text
Logging
• 76% of the failures print explicit failure-
related error messages
• For 84% of the failures, all of the triggering
events are logged
• Logs are noisy: each failure prints 824 log
messages (median)
Slide 12
Slide 12 text
Catastrophic Failures
Slide 13
Slide 13 text
Error Handling
• 92% of failures were the result of incorrect
handling of non-fatal errors
• 58% of faults could have been detected via
simple testing
• 35% of failures caused by bad practices in
error handling code
Slide 14
Slide 14 text
• Error Handling Code is simply empty or only
contains a Log statement
• Error Handler aborts cluster on an overly
general exception
• Error Handler contains comments like FIXME
or TODO
Bad Practices
Slide 15
Slide 15 text
Aspirator
Performs static analysis of Java bytecode to
detect:
• error handler is empty
• error handler over-catches exceptions
and aborts
• error handler contains phrases like
“TODO” or “FIXME”
Slide 16
Slide 16 text
• 500 New Bugs & Bad Practices
• 115 Fasle Positives
• 171 bugs reported
• 143 bugs confirmed or fixed
Aspirator Results
Slide 17
Slide 17 text
-developer
“I fail to see the reason to handle every
exception”
Developer Reactions
Slide 18
Slide 18 text
“It is often much harder to reason about the
correctness of a system’s abnormal path than
its normal execution path ”
Slide 19
Slide 19 text
Moving Forward
• Use a tool like Aspirator that is capable of
identifying trivial bugs
• Enforce code reviews of error handling code
• High code coverage on error handling code