Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University http://www.deflaker.org Fork D eFlaker on G itH ub
B Test B “Test is OK!” “Test failed!” “Test is flaky!” Makes changes to code Developer Runs Tests Test B Test { Flaky test: a test which can pass or fail on the same version of code #!? Original run Rerun
B Test B Test A Test A “Did this pass by chance?” “Did this fail by chance?” Should I always rerun all of my tests? How many times? How long should I wait to rerun them? “I think it truly passed?” “It is flaky!” Test B
would the test outcome change right away when we rerun it? • Maybe the machine running the test is overloaded? • Rerunning takes time and resources! • Even Google can’t rerun all failing tests https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html • What else can we do to identify flaky tests without reruns? Start Test Test fails! Server startup complete Make request to server Start server Wait 3 seconds for server to start Too late!
Test B Project version 1 Project version 2 Test outcome changes, code changed: maybe flaky, maybe not Test B Project version 1 Test B Project version 1 Test outcome changes, code NOT changed: definitely flaky!
test is flaky if it has different outcomes executing the same code • Even if project code changes: does the test run any changed code? Test B Test B Project version 1 Project version 2 Code changed, test didn’t run changes, hence test is flaky
“does a test run any changed code?” • Tracks coverage (of both SUT and test code) • Two key challenges: • What coverage granularity should we track? • Tracking coverage of all code can be expensive
work: (precise) statement-level coverage of all code, or (coarse) class-level coverage of all code, possibly generating reports per-test • Hybrid: • If only statement changes to a class: track statements • If non-statement changes to class: track class level • Differential: Only coverage of changed statements/classes is tracked (no need to track all statements/classes)
Reporting (After Test Execution) DeFlaker Overview DeFlaker tracks hybrid differential coverage — only tracking code that changed since the last execution, blending class-level and statement-level coverage List of likely flaky tests Previous test results Changes executed by each test Statements and classes to monitor at runtime Old version of codebase New version of codebase Differential Coverage Analysis (Before Test Execution)
tests are found by DeFlaker/rerun? • What is the performance overhead of DeFlaker? • What is the benefit of DeFlaker’s hybrid coverage over just class-level coverage? • 2 methodologies
methodology for evaluating testing tools • Clone 26 projects from GitHub • Build/test 5,966 versions of these projects in our own environment (5 CPU- Years) Clone projects Run mvn install Problem: “Interesting” projects often have pre-requisites to build
developers’ existing continuous integration pipeline to run our tool in the exact same environment that they normally use • 614 total versions of 96 projects • Allows for evaluation with much more complex target projects, increasing diversity and reducing confounding factors • Simplifies bug reporting
5,328 new failures • Rerunning failed tests can confirm the presence of a flaky test but not prove the absence of flakiness • Considered rerunning from 1-5 times, different rerun strategies • Varying overhead between strategies 0 1500 3000 4500 6000 0 1 2 3 4 5 Number of Flaky Tests Confirmed Number of Reruns DeFlaker (NO reruns needed!) Surefire + Fork Surefire + Fork + Reboot Surefire Flaky Detection Strategy: Suggestion: If rerunning tests, focus more on changing the environment between reruns than on the number of reruns (Or just use DeFlaker!) 23% 83% 100% 96%
Compared hybrid vs class coverage on 96 persistent, known flaky tests in 5,966 SHAs • Hybrid identifies 11 percentage points more known-flaky tests 77% 88% 0% 25% 50% 75% 100% Class Hybrid Percentage of known flaky tests reported flaky
Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University http://www.deflaker.org/ Darko Marinov's group is supported by NSF grants CCF-1409423, CCF-1421503, CNS-1646305, and CNS-1740916; and gifts from Google and Qualcomm. Fork D eFlaker on G itH ub