Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PWL NY: Simple Testing Can Prevent Most Critica...

PWL NY: Simple Testing Can Prevent Most Critical Failures

Caitie McCaffrey

June 14, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. Simple Testing Can Prevent Most Critical Failures: An Analysis of

    Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
  2. “A majority (77%) of failures require more than one input

    event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
  3. “The specific order of events is important in 88% of

    the failures that require multiple events Complexity of Failures
  4. Logging • 76% of the failures print explicit failure- related

    error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
  5. Error Handling • 92% of failures were the result of

    incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
  6. • Error Handling Code is simply empty or only contains

    a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
  7. Aspirator Performs static analysis of Java bytecode to detect: •

    error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
  8. • 500 New Bugs & Bad Practices • 115 Fasle

    Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
  9. “It is often much harder to reason about the correctness

    of a system’s abnormal path than its normal execution path ”
  10. Moving Forward • Use a tool like Aspirator that is

    capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code