Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critica...
Search
Caitie McCaffrey
June 14, 2016
Technology
8
470
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
350
Argus Papers We Love
caitiem20
14
1.2k
The Verification of a Distributed System
caitiem20
22
2.3k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
830
The Verification of a Distributed System
caitiem20
12
800
The Verification of a Distributed System
caitiem20
6
790
A Brief History of Distributed Programming: RPC
caitiem20
31
6.7k
Building Scalable Stateful Services
caitiem20
12
1.7k
Other Decks in Technology
See All in Technology
2つのフロントエンドと状態管理
mixi_engineers
PRO
3
110
現場で効くClaude Code ─ 最新動向と企業導入
takaakikakei
1
250
ブロックテーマ時代における、テーマの CSS について考える Toro_Unit / 2025.09.13 @ Shinshu WordPress Meetup
torounit
0
130
サラリーマンの小遣いで作るtoCサービス - Cloudflare Workersでスケールする開発戦略
shinaps
2
450
[ JAWS-UG 東京 CommunityBuilders Night #2 ]SlackとAmazon Q Developerで 運用効率化を模索する
sh_fk2
3
430
20250913_JAWS_sysad_kobe
takuyay0ne
2
220
DevIO2025_継続的なサービス開発のための技術的意思決定のポイント / how-to-tech-decision-makaing-devio2025
nologyance
1
400
LLMを搭載したプロダクトの品質保証の模索と学び
qa
0
1.1k
S3アクセス制御の設計ポイント
tommy0124
3
200
「Linux」という言葉が指すもの
sat
PRO
4
130
Agile PBL at New Grads Trainings
kawaguti
PRO
1
430
ZOZOマッチのアーキテクチャと技術構成
zozotech
PRO
4
1.6k
Featured
See All Featured
It's Worth the Effort
3n
187
28k
A designer walks into a library…
pauljervisheath
207
24k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
BBQ
matthewcrist
89
9.8k
The Invisible Side of Design
smashingmag
301
51k
The Cult of Friendly URLs
andyhume
79
6.6k
Typedesign – Prime Four
hannesfritz
42
2.8k
Embracing the Ebb and Flow
colly
87
4.8k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.6k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.9k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
Raft: Consensus for Rubyists
vanstee
140
7.1k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie