Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding and Detecting Software Upgrade Fa...

Shinwoo Kim
November 05, 2024

Understanding and Detecting Software Upgrade Failures in Distributed Systems

Paper discussion of "Understanding and Detecting Software Upgrade Failures in Distributed Systems" by Zhang et al. for a CS3551: Advanced Topics in Distributed Information Systems at the University of Pittsburgh.

The paper was originally presented in 2021 ACM Symposium on Operating Systems Principles.

CS3551 is a PhD level seminar focusing on distributed systems. In Fall 2024, the special topics for the seminar were building distributed systems that are reliable.

Shinwoo Kim

November 05, 2024
Tweet

Other Decks in Research

Transcript

  1. Understanding and Detecting Software Upgrade Failures in Distributed Systems Yongle

    Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan from ACM Symposium on Operating Systems Principles 2021 Shinwoo Kim cs.pitt.edu/~shk148 CS3551: Advanced Topics in Distributed Information Systems
  2. Background: Upgrading software is hard • Full-stop upgrades incur an

    outage in which users cannot access services during the upgrade process 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 2
  3. Background: Upgrading software is hard • Full-stop upgrades incur an

    outage in which users cannot access services during the upgrade process • Rolling upgrades allows nodes to take turns in receiving the upgrades • But is slow (hours ~ days) • Sometimes fast updates are desired to roll-out patches, hot-fixes, etc. 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 2 Canary Deployment Version 1 Version 2
  4. Problem: Software Upgrade Failures (def). failures that occur only during

    a software upgrade 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3
  5. Problem: Software Upgrade Failures (def). failures that occur only during

    a software upgrade • Happens due to interaction between two versions of a software • Not due to changes in configuration • Not bugs found in a newer version (that can occur on clean install of new version) • Does not occur in ‘normal’ scenarios. 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3
  6. Problem: Software Upgrade Failures (def). failures that occur only during

    a software upgrade • Happens due to interaction between two versions of a software • Not due to changes in configuration • Not bugs found in a newer version (that can occur on clean install of new version) • Does not occur in ‘normal’ scenarios. • Problematic because • Impacts can large scale & persistent • Failures are difficult to mask from users • Not tested in traditional & state-of-the-art testing framework 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3
  7. Problem: Software Upgrade Failures (def). failures that occur only during

    a software upgrade • Happens due to interaction between two versions of a software • Not due to changes in configuration • Not bugs found in a newer version (that can occur on clean install of new version) • Does not occur in ‘normal’ scenarios. • Problematic because • Impacts can large scale & persistent • Failures are difficult to mask from users • Not tested in traditional & state-of-the-art testing framework 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 3
  8. Contribution: This paper analyzes upgrade failures in real systems 11/5/24

    Understanding and Detecting Software Upgrade Failures in Distributed Systems 4
  9. Contribution: This paper analyzes upgrade failures in real systems •

    In-depth analysis of 123 real-world upgrade failures 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4
  10. Contribution: This paper analyzes upgrade failures in real systems •

    In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4
  11. Contribution: This paper analyzes upgrade failures in real systems •

    In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4
  12. Contribution: This paper analyzes upgrade failures in real systems •

    In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4 • Use key findings from analysis to develop two new tools
  13. Contribution: This paper analyzes upgrade failures in real systems •

    In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4 • Use key findings from analysis to develop two new tools • Duptester
  14. Contribution: This paper analyzes upgrade failures in real systems •

    In-depth analysis of 123 real-world upgrade failures • In 8 widely used distributed systems 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 4 • Use key findings from analysis to develop two new tools • Duptester • Dupchecker
  15. Key Findings Based on analysis of 123 real-world upgrade failures

    11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 5
  16. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures
  17. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group).
  18. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group). • 70% of upgrade failures have easy-to-observe symptoms like node crashes or fatal exceptions.
  19. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 6 • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group). • 70% of upgrade failures have easy-to-observe symptoms like node crashes or fatal exceptions. • 63% of upgrade bugs were not caught before release.
  20. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption.
  21. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types.
  22. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data.
  23. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 7 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data. • ≈ 2/3 of data semantics incompatibilities caused by incomplete version checking and handling.
  24. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 8 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data. • ≈ 2/3 of data semantics incompatibilities caused by incomplete version checking and handling. Examples. Adding a new required field in data in a new version → Serialization library cannot find this field when using data from old version → Fix by making field optional Adding a field in a middle of an Enum increments the indices of all later members → Fix by adding padding between each field for future-proofing Deserialization and Serialization works, but data is interpreted differently Kafka 2.1.0 assumes retentionTime = DEFAULT → expireTimestamp = None Kafka 0.11 does not follow this assumption [KAFKA-7403]
  25. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 9 • Triggering upgrade failures are relatively easy
  26. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions
  27. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger
  28. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger • Close to 90% of the upgrade failures are deterministic, not requiring any special timing to trigger.
  29. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 9 • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger • Close to 90% of the upgrade failures are deterministic, not requiring any special timing to trigger. • Many can be triggered using existing tests
  30. Results: Findings, Summarized 11/5/24 Understanding and Detecting Software Upgrade Failures

    in Distributed Systems 10 • Most upgrade failures are caused by two versions that hold incompatible data syntax or semantics assumption. • ≈ 20% of syntax incompatibilities from data syntax defined by serialization libraries or Enum data types. • ≈ 80% of data syntax incompatibilities caused by missing/incomplete deserialization functions for system-specific data. • ≈ 2/3 of data semantics incompatibilities caused by incomplete version checking and handling. • Triggering upgrade failures are relatively easy • All but 14 upgrade failures can be triggered by consecutive major or minor versions • All upgrade failures require no more than 3 nodes to trigger • Close to 90% of the upgrade failures are deterministic, not requiring any special timing to trigger. • Many can be triggered using existing tests • Upgrade failures have significantly higher priority than regular failures • 67% of upgrade failures are catastrophic (affecting all/majority of users instead a small group). • 70% of upgrade failures have easy-to-observe symptoms like node crashes or fatal exceptions. • 63% of upgrade bugs were not caught before release.
  31. Approach & Techniques: Testing and Detecting Upgrade Failures • DUPTester

    • Adapts and utilizes existing stress testing and unit test cases of each distributed system to systematically test the system upgrade procedure • Stress-tests reuse is easy, since it’s a sequence of client-side commands • Unit tests require translation into client-side scripts (may not always work) • Simulates 3-node cluster using container orchestration under: • Full-stop upgrade (run old version to completion, gracefully shuts down, and run new version on data produced by old version) • Rolling upgrade (runs new version before the rolling upgrade on old version finishes) • New node joining (several nodes running new version joins cluster of old versions) • Upgrades simulated by replacing container • DUPChecker • Statically analyzes data syntax defined using standard serialization libraries and detect incompatibility across versions • Focuses on Protocol Buffer and Apache Thrift 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 11 Can trigger failures regardless of cause, as long as workload is covered Triggers only incompatibility- based failures, but can predict exact symptom DUP: Distributed system UPgrade
  32. Results: DUP{tester,checker} results • DUPTester revealed 20 previously unknown upgrade

    failures • DUPChcker revealed 800+ (300+ verified) previously unknown upgrade failures 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 12
  33. Paper Review: Strengths & Weaknesses Strengths • Evidence-driven approach •

    Practical applicability • Problem is of crucial importance in real-world systems Weaknesses • Unclear as to methodology in analysis of DUPs • “independently by at least two inspectors. All inspectors used the same detailed written classification methodology, and any disagreement is discussed in the end to reach a consensus.” • Limited “novelty” in ideas presented • Limitations of conducting manual study • Under-reporting, selection bias, physical limitations of being human • Limited analysis of DUPTester and DUPChcker 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 13
  34. Paper Review: Discussion Points • Application of other testing methodology

    for triggering/discovering upgrade failure? • fuzz-testing, regression testing, other SWE tools? • Better data serialization • How to enforce compatibility? • Robust upgrade procedures • Netflix uses real-time data for testing 11/5/24 Understanding and Detecting Software Upgrade Failures in Distributed Systems 14