Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critical Failures
Search
Caitie McCaffrey
June 14, 2016
Technology
8
390
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
319
20k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
200
Argus Papers We Love
caitiem20
13
1.1k
The Verification of a Distributed System
caitiem20
22
2.1k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
700
The Verification of a Distributed System
caitiem20
12
670
The Verification of a Distributed System
caitiem20
6
680
A Brief History of Distributed Programming: RPC
caitiem20
31
6.2k
Building Scalable Stateful Services
caitiem20
12
1.3k
Other Decks in Technology
See All in Technology
Tebiki株式会社 エンジニア採用資料
tebiki
0
4k
日本におけるデータエンジニアリングのこれまでとこれから
foursue
8
1.4k
2024/4/26 コンピュータ歴史博物館解説告知
toshi_atsumi
0
180
Pedestrian-Centric大規模交通安全映像解析向けWoven Traffic Safety (WTS) データセットの紹介
kbuster
0
140
疲弊しない!AWSセキュリティ統制の考え方 #devio_osakaday1
masahirokawahara
6
5.7k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
1
630
なぜ NOT A HOTEL が Web3 に取り組むのか - NOT A HOTEL TECH TALK
ynunokawa
0
160
Let's get started with Ruby && Rails Tips
sinsoku
0
190
PHP"オレ"カンファレンスの告知
ysknsid25
0
240
2024-04-06 AMeDAS to Lagoon SORACOM UG 2024-04-06
anysonica
0
120
反実仮想機械学習とは何か
usaito
PRO
3
450
0→1開発における技術選定において一番大切なこと
bicstone
1
310
Featured
See All Featured
The Invisible Customer
myddelton
114
12k
10 Git Anti Patterns You Should be Aware of
lemiorhan
645
57k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
501
140k
Build your cross-platform service in a week with App Engine
jlugia
224
17k
From Idea to $5000 a Month in 5 Months
shpigford
377
45k
Building Adaptive Systems
keathley
29
1.8k
Reflections from 52 weeks, 52 projects
jeffersonlam
343
19k
Practical Orchestrator
shlominoach
180
9.7k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
8
8.3k
Teambox: Starting and Learning
jrom
127
8.4k
How GitHub Uses GitHub to Build GitHub
holman
468
290k
Building an army of robots
kneath
300
41k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie