Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critica...
Search
Caitie McCaffrey
June 14, 2016
Technology
8
490
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
380
Argus Papers We Love
caitiem20
14
1.2k
The Verification of a Distributed System
caitiem20
22
2.3k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
860
The Verification of a Distributed System
caitiem20
12
820
The Verification of a Distributed System
caitiem20
6
810
A Brief History of Distributed Programming: RPC
caitiem20
31
6.8k
Building Scalable Stateful Services
caitiem20
12
1.8k
Other Decks in Technology
See All in Technology
Kusakabe_面白いダッシュボードの表現方法
ykka
0
320
【Oracle Cloud ウェビナー】ランサムウェアが突く「侵入の隙」とバックアップの「死角」 ~ 過去の教訓に学ぶ — 侵入前提の防御とデータ保護 ~
oracle4engineer
PRO
0
150
AWSと生成AIで学ぶ!実行計画の読み解き方とSQLチューニングの実践
yakumo
2
610
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
10k
Web Intelligence and Visual Media Analytics
weblyzard
PRO
1
6.8k
純粋なイミュータブルモデルを設計してからイベントソーシングと組み合わせるDeciderの実践方法の紹介 /Introducing Decider Pattern with Event Sourcing
tomohisa
1
1.2k
OCI技術資料 : OS管理ハブ 概要
ocise
2
4.1k
プロンプトエンジニアリングを超えて:自由と統制のあいだでつくる Platform × Context Engineering
yuriemori
0
480
形式手法特論:コンパイラの「正しさ」は証明できるか? #burikaigi / BuriKaigi 2026
ytaka23
17
6.3k
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
17k
Kiro Power - Amazon Bedrock AgentCore を学ぶ、もう一つの方法
r3_yamauchi
PRO
0
110
AWS Network Firewall Proxyで脱Squid運用⁈
nnydtmg
1
120
Featured
See All Featured
Color Theory Basics | Prateek | Gurzu
gurzu
0
180
Balancing Empowerment & Direction
lara
5
840
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
190
Digital Ethics as a Driver of Design Innovation
axbom
PRO
0
140
Designing for humans not robots
tammielis
254
26k
How To Speak Unicorn (iThemes Webinar)
marktimemedia
1
370
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
1.8k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
250
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
1
1.3k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
54
49k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.7k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie