Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critica...
Search
Caitie McCaffrey
June 14, 2016
Technology
8
440
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
330
21k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
290
Argus Papers We Love
caitiem20
13
1.2k
The Verification of a Distributed System
caitiem20
22
2.2k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
790
The Verification of a Distributed System
caitiem20
12
750
The Verification of a Distributed System
caitiem20
6
750
A Brief History of Distributed Programming: RPC
caitiem20
31
6.5k
Building Scalable Stateful Services
caitiem20
12
1.5k
Other Decks in Technology
See All in Technology
Amazon Aurora バージョンアップについて、改めて理解する ~バージョンアップ手法と文字コードへの影響~
smt7174
1
420
[TechNight #86] Oracle GoldenGate - 23ai 最新情報&プロジェクトからの学び
oracle4engineer
PRO
1
220
オーティファイ会社紹介資料 / Autify Company Deck
autifyhq
10
120k
20250208_OpenAIDeepResearchがやばいという話
doradora09
PRO
0
140
ソフトウェアアーキテクトのための意思決定術: Software Architecture and Decision-Making
snoozer05
PRO
20
4.4k
EDRからERM: PFN-SIRTが関わるセキュリティとリスクへの取り組み
pfn
PRO
0
140
Autify Company Deck
autifyhq
2
41k
5分で紹介する生成AIエージェントとAmazon Bedrock Agents / 5-minutes introduction to generative AI agents and Amazon Bedrock Agents
hideakiaoyagi
0
140
まだ間に合う! エンジニアのための生成AIアプリ開発入門 on AWS
minorun365
PRO
4
480
Next Step: Play Time!
trishagee
2
140
Amazon GuardDuty Malware Protection for Amazon S3のここがすごい!
ryder472
1
110
生成AIの利活用を加速させるための取り組み「prAIrie-dog」/ Shibuya_AI_1
visional_engineering_and_design
1
120
Featured
See All Featured
[RailsConf 2023] Rails as a piece of cake
palkan
53
5.2k
Building Adaptive Systems
keathley
39
2.4k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.4k
How GitHub (no longer) Works
holman
313
140k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Designing Experiences People Love
moore
139
23k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
29
2.2k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
6
530
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Docker and Python
trallard
44
3.2k
It's Worth the Effort
3n
184
28k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
31
2.1k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie