Understanding and Detecting Software Upgrade Failures in Distributed Systems

Understanding and Detecting Software Upgrade Failures in Distributed Systems Yongle
Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan from ACM Symposium on Operating Systems Principles 2021 Shinwoo Kim cs.pitt.edu/~shk148 CS3551: Advanced Topics in Distributed Information Systems

Background: Upgrading software is hard 11/5/24 Understanding and Detecting Software
Upgrade Failures in Distributed Systems 2

Background: Upgrading software is hard • Full-stop upgrades incur an
outage in which users cannot access services during the upgrade process 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 2

Background: Upgrading software is hard • Full-stop upgrades incur an
outage in which users cannot access services during the upgrade process • Rolling upgrades allows nodes to take turns in receiving the upgrades • But is slow (hours ~ days) • Sometimes fast updates are desired to roll-out patches, hot-fixes, etc. 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 2 Canary Deployment Version 1 Version 2

Problem: Software Upgrade Failures 11/5/24 Understanding and Detecting Software Upgrade
Failures in Distributed Systems 3

Problem: Software Upgrade Failures (def). failures that occur only during
a software upgrade 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3

a software upgrade • Happens due to interaction between two versions of a software • Not due to changes in configuration • Not bugs found in a newer version (that can occur on clean install of new version) • Does not occur in ‘normal’ scenarios. 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3

a software upgrade • Happens due to interaction between two versions of a software • Not due to changes in configuration • Not bugs found in a newer version (that can occur on clean install of new version) • Does not occur in ‘normal’ scenarios. • Problematic because • Impacts can large scale & persistent • Failures are difficult to mask from users • Not tested in traditional & state-of-the-art testing framework 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3

Contribution: This paper analyzes upgrade failures in real systems 11/5/24
Understanding and Detecting Software Upgrade Failures in Distributed Systems 4

Contribution: This paper analyzes upgrade failures in real systems •
In-depth analysis of 123 real-world upgrade failures 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4

In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4

In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4 • Use key findings from analysis to develop two new tools

In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4 • Use key findings from analysis to develop two new tools • Duptester

In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4 • Use key findings from analysis to develop two new tools • Duptester • Dupchecker

Key Findings Based on analysis of 123 real-world upgrade failures
11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 5

Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures
in Distributed Systems 6

in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures

in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group).

in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group). • 70% of upgrade failures have easy-to-observe symptoms like node crashes or fatal exceptions.

in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group). • 70% of upgrade failures have easy-to-observe symptoms like node crashes or fatal exceptions. • 63% of upgrade bugs were not caught before release.

in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption.

in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types.

in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data.

in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data. • ≈ 2/3 of data semantics incompatibilities caused by incomplete version checking and handling.

in Distributed Systems 8 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data. • ≈ 2/3 of data semantics incompatibilities caused by incomplete version checking and handling. Examples. Adding a new required field in data in a new version → Serialization library cannot find this field when using data from old version → Fix by making field optional Adding a field in a middle of an Enum increments the indices of all later members → Fix by adding padding between each field for future-proofing Deserialization and Serialization works, but data is interpreted differently Kafka 2.1.0 assumes retentionTime = DEFAULT → expireTimestamp = None Kafka 0.11 does not follow this assumption [KAFKA-7403]

in Distributed Systems 9 • Triggering upgrade failures are relatively easy

in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions

in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger

in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger • Close to 90% of the upgrade failures are deterministic, not requiring any special timing to trigger.

in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger • Close to 90% of the upgrade failures are deterministic, not requiring any special timing to trigger. • Many can be triggered using existing tests

in Distributed Systems 10 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data. • ≈ 2/3 of data semantics incompatibilities caused by incomplete version checking and handling. • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger • Close to 90% of the upgrade failures are deterministic, not requiring any special timing to trigger. • Many can be triggered using existing tests • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group). • 70% of upgrade failures have easy-to-observe symptoms like node crashes or fatal exceptions. • 63% of upgrade bugs were not caught before release.

Approach & Techniques: Testing and Detecting Upgrade Failures • DUPTester
• Adapts and utilizes existing stress testing and unit test cases of each distributed system to systematically test the system upgrade procedure • Stress-tests reuse is easy, since it’s a sequence of client-side commands • Unit tests require translation into client-side scripts (may not always work) • Simulates 3-node cluster using container orchestration under: • Full-stop upgrade (run old version to completion, gracefully shuts down, and run new version on data produced by old version) • Rolling upgrade (runs new version before the rolling upgrade on old version finishes) • New node joining (several nodes running new version joins cluster of old versions) • Upgrades simulated by replacing container • DUPChecker • Statically analyzes data syntax defined using standard serialization libraries and detect incompatibility across versions • Focuses on Protocol Buffer and Apache Thrift 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 11 Can trigger failures regardless of cause, as long as workload is covered Triggers only incompatibility- based failures, but can predict exact symptom DUP: Distributed system UPgrade

Results: DUP{tester,checker} results • DUPTester revealed 20 previously unknown upgrade
failures • DUPChcker revealed 800+ (300+ verified) previously unknown upgrade failures 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 12

Paper Review: Strengths & Weaknesses Strengths • Evidence-driven approach •
Practical applicability • Problem is of crucial importance in real-world systems Weaknesses • Unclear as to methodology in analysis of DUPs • “independently by at least two inspectors. All inspectors used the same detailed written classification methodology, and any disagreement is discussed in the end to reach a consensus.” • Limited “novelty” in ideas presented • Limitations of conducting manual study • Under-reporting, selection bias, physical limitations of being human • Limited analysis of DUPTester and DUPChcker 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 13

Paper Review: Discussion Points • Application of other testing methodology
for triggering/discovering upgrade failure? • fuzz-testing, regression testing, other SWE tools? • Better data serialization • How to enforce compatibility? • Robust upgrade procedures • Netflix uses real-time data for testing 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 14

Understanding and Detecting Software Upgrade Fa...

Understanding and Detecting Software Upgrade Failures in Distributed Systems

Other Decks in Research

Featured

Transcript