Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critica...
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Caitie McCaffrey
June 14, 2016
Technology
510
8
Share
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
410
Argus Papers We Love
caitiem20
14
1.3k
The Verification of a Distributed System
caitiem20
22
2.3k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
890
The Verification of a Distributed System
caitiem20
12
850
The Verification of a Distributed System
caitiem20
6
840
A Brief History of Distributed Programming: RPC
caitiem20
31
6.8k
Building Scalable Stateful Services
caitiem20
12
1.8k
Other Decks in Technology
See All in Technology
ソフトウェアサプライチェーン攻撃対策として今からサクッとできること
flatt_security
2
130
DI コンテナ自動生成ツールを実装してみた / intro-autodi
uhzz
0
870
脅威をエンジニアリングの糧にして:恐怖を乗り越えた先にあったもの / Turn threats into fuel for engineering: what lay beyond overcoming fear
nrslib
1
260
Amazon Bedrock 経由の Claude Cowork を試してみよう・MCP にも繋いでみよう
sugimomoto
0
170
エンジニアは生成AIと どのように向き合うべきか? ことばの意味という観点から
verypluming
2
170
AIが変えた"品質の守り方"
kkakizaki
4
1.7k
Anthropic AIネイティブ・スタートアップ構築のプレイブック を理解する
nagatsu
0
180
TSKaigi 2026 - 型プラグインシステムの実装に使われるテクニック
teamlab
PRO
2
390
A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"
xpmatteo
0
410
Claude Codeですべての日常業務を爆速化しよう!
minorun365
PRO
14
12k
Claude Code x Accounting
kawaguti
PRO
1
320
TypeScriptとAngular Signal で実現する保守性の高いアプリケーション設計 - 3層アーキテクチャによる責務分離の実践(たつかわ) https://2026.tskaigi.org/talks/10
nealle
1
340
Featured
See All Featured
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
360
AI: The stuff that nobody shows you
jnunemaker
PRO
7
660
Google's AI Overviews - The New Search
badams
0
1k
Reflections from 52 weeks, 52 projects
jeffersonlam
356
21k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
370
Statistics for Hackers
jakevdp
799
230k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.2k
Optimizing for Happiness
mojombo
378
71k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
570
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
790
The SEO identity crisis: Don't let AI make you average
varn
0
470
Rails Girls Zürich Keynote
gr2m
96
14k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie