ICSE18 - Deflaker

ICSE18 - Deflaker

Slides for talk at ICSE18 on Automatically Detecting Flaky Tests.

8e81db9f29d2543ada5fac546f99e023?s=128

Michael Hilton

May 31, 2018
Tweet

Transcript

  1. ICSE May 31, 2018 DeFlaker: Automatically Detecting Flaky Tests Jonathan

    Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University http://www.deflaker.org Fork D eFlaker on G itH ub
  2. deflaker.org ICSE May 31, 2018 Regression Testing Test A “Test

    is OK!” Test B “Test failed!” Makes changes to code Developer Runs Tests Test B Test { “Ah-ha! I found that I introduced a regression!”
  3. deflaker.org ICSE May 31, 2018 Flaky Tests Test A Test

    B Test B “Test is OK!” “Test failed!” “Test is flaky!” Makes changes to code Developer Runs Tests Test B Test { Flaky test: a test which can pass or fail on the same version of code #!? Original run Rerun
  4. deflaker.org ICSE May 31, 2018 Flaky Tests Mean Flaky Builds

    = [Labuschagne, Inozemtseva, and Holmes ’17] Flaky 13% Not Flaky 87% Analysis of 935 failing builds on Travis CI
  5. deflaker.org ICSE May 31, 2018 Rerunning Tests Test A Test

    B Test B Test A Test A “Did this pass by chance?” “Did this fail by chance?” Should I always rerun all of my tests? How many times? How long should I wait to rerun them? “I think it truly passed?” “It is flaky!” Test B
  6. deflaker.org ICSE May 31, 2018 Example Flaky Test Test fails!

    Server startup complete Start server Make request to server Wait 3 seconds for server to start Start Test Too late!
  7. deflaker.org ICSE May 31, 2018 Problems with Rerun • Why

    would the test outcome change right away when we rerun it? • Maybe the machine running the test is overloaded? • Rerunning takes time and resources! • Even Google can’t rerun all failing tests https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html • What else can we do to identify flaky tests without reruns? Start Test Test fails! Server startup complete Make request to server Start server Wait 3 seconds for server to start Too late!
  8. deflaker.org ICSE May 31, 2018 Detecting Flaky Tests Test B

    Test B Project version 1 Project version 2 Test outcome changes, code changed: maybe flaky, maybe not Test B Project version 1 Test B Project version 1 Test outcome changes, code NOT changed: definitely flaky!
  9. deflaker.org ICSE May 31, 2018 Detecting Flaky Tests • A

    test is flaky if it has different outcomes executing the same code • Even if project code changes: does the test run any changed code? Test B Test B Project version 1 Project version 2 Code changed, test didn’t run changes, hence test is flaky
  10. ICSE May 31, 2018 DeFlaker

  11. deflaker.org ICSE May 31, 2018 DeFlaker • Answers the question

    “does a test run any changed code?” • Tracks coverage (of both SUT and test code) • Two key challenges: • What coverage granularity should we track? • Tracking coverage of all code can be expensive
  12. deflaker.org ICSE May 31, 2018 Hybrid Differential Coverage • Prior

    work: (precise) statement-level coverage of all code, or (coarse) class-level coverage of all code, possibly generating reports per-test • Hybrid: • If only statement changes to a class: track statements • If non-statement changes to class: track class level • Differential: Only coverage of changed statements/classes is tracked (no need to track all statements/classes)
  13. deflaker.org ICSE May 31, 2018 Coverage Collection (During Test Execution)

    Reporting (After Test Execution) DeFlaker Overview DeFlaker tracks hybrid differential coverage — only tracking code that changed since the last execution, blending class-level and statement-level coverage List of likely flaky tests Previous test results Changes executed by each test Statements and classes to monitor at runtime Old version of codebase New version of codebase Differential Coverage Analysis (Before Test Execution)
  14. deflaker.org ICSE May 31, 2018 DeFlaker: Installation <extensions> <extension> <groupId>org.deflaker</groupId>

    <artifactId>deflaker-maven-extension</artifactId> <version>1.4</version> </extension> </extensions> Maven extension automates the otherwise tedious task of adding multiple plugins and configuring Surefire
  15. deflaker.org ICSE May 31, 2018 Evaluation • How many flaky

    tests are found by DeFlaker/rerun? • What is the performance overhead of DeFlaker? • What is the benefit of DeFlaker’s hybrid coverage over just class-level coverage? • 2 methodologies
  16. deflaker.org ICSE May 31, 2018 Controlled Evaluation Environment • “Traditional”

    methodology for evaluating testing tools • Clone 26 projects from GitHub • Build/test 5,966 versions of these projects in our own environment (5 CPU- Years) Clone projects Run mvn install Problem: “Interesting” projects often have pre-requisites to build
  17. deflaker.org ICSE May 31, 2018 Live Evaluation Environment • Use

    developers’ existing continuous integration pipeline to run our tool in the exact same environment that they normally use • 614 total versions of 96 projects • Allows for evaluation with much more complex target projects, increasing diversity and reducing confounding factors • Simplifies bug reporting
  18. deflaker.org ICSE May 31, 2018 Detecting Flaky Tests • Found

    5,328 new failures • Rerunning failed tests can confirm the presence of a flaky test but not prove the absence of flakiness • Considered rerunning from 1-5 times, different rerun strategies • Varying overhead between strategies 0 1500 3000 4500 6000 0 1 2 3 4 5 Number of Flaky Tests Confirmed Number of Reruns DeFlaker (NO reruns needed!) Surefire + Fork Surefire + Fork + Reboot Surefire Flaky Detection Strategy: Suggestion: If rerunning tests, focus more on changing the environment between reruns than on the number of reruns (Or just use DeFlaker!) 23% 83% 100% 96%
  19. deflaker.org ICSE May 31, 2018 Coverage Collection Overhead 0% 25%

    50% 75% 100% achilles ambari assertj-core checkstyle commons-exec dropwizard hector httpcore jackrabbit-oak killbill ninja spring-boot tachyon togglz undertow wro4j zxing DeFlaker Ekstazi JaCoCo Excluded projects with high variance in testing time; Executed 10 times per-revision for 10 revisions per-project. DeFlaker: 5% Ekstazi: 39% JaCoCo: 58%
  20. deflaker.org ICSE May 31, 2018 Class vs Hybrid Coverage •

    Compared hybrid vs class coverage on 96 persistent, known flaky tests in 5,966 SHAs • Hybrid identifies 11 percentage points more known-flaky tests 77% 88% 0% 25% 50% 75% 100% Class Hybrid Percentage of known flaky tests reported flaky
  21. deflaker.org ICSE May 31, 2018 DeFlaker: Live Evaluation Project #

    Builds New Fails Confirmed Flakes False Pos Reported Addressed Achilles 5 2 2 0 2 2 checkstyle 96 1 1 0 1 1 geoserver 60 39 39 0 1 0 jackrabbit-oak 99 5 5 0 2 1 jmeter-plugins 19 1 1 0 0 0 killbill 31 26 26 0 1 0 nutz 87 1 1 0 1 1 presto 203 11 11 0 7 2 quickml 2 2 2 0 2 0 togglz 12 3 3 0 2 2 Total 614 91 91 0 19 9
  22. DeFlaker: Automatically Detecting Flaky Tests Jonathan Bell, Owolabi Legunsen, Michael

    Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University http://www.deflaker.org/ Darko Marinov's group is supported by NSF grants CCF-1409423, CCF-1421503, CNS-1646305, and CNS-1740916; and gifts from Google and Qualcomm. Fork D eFlaker on G itH ub