ICSE May 31, 2018 DeFlaker: Automatically Detecting Flaky Tests Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University http://www.deflaker.org Fork D eFlaker on G itH ub
deflaker.org ICSE May 31, 2018 Regression Testing Test A “Test is OK!” Test B “Test failed!” Makes changes to code Developer Runs Tests Test B Test { “Ah-ha! I found that I introduced a regression!”
deflaker.org ICSE May 31, 2018 Flaky Tests Test A Test B Test B “Test is OK!” “Test failed!” “Test is flaky!” Makes changes to code Developer Runs Tests Test B Test { Flaky test: a test which can pass or fail on the same version of code #!? Original run Rerun
deflaker.org ICSE May 31, 2018 Flaky Tests Mean Flaky Builds = [Labuschagne, Inozemtseva, and Holmes ’17] Flaky 13% Not Flaky 87% Analysis of 935 failing builds on Travis CI
deflaker.org ICSE May 31, 2018 Rerunning Tests Test A Test B Test B Test A Test A “Did this pass by chance?” “Did this fail by chance?” Should I always rerun all of my tests? How many times? How long should I wait to rerun them? “I think it truly passed?” “It is flaky!” Test B
deflaker.org ICSE May 31, 2018 Example Flaky Test Test fails! Server startup complete Start server Make request to server Wait 3 seconds for server to start Start Test Too late!
deflaker.org ICSE May 31, 2018 Problems with Rerun • Why would the test outcome change right away when we rerun it? • Maybe the machine running the test is overloaded? • Rerunning takes time and resources! • Even Google can’t rerun all failing tests https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html • What else can we do to identify flaky tests without reruns? Start Test Test fails! Server startup complete Make request to server Start server Wait 3 seconds for server to start Too late!
deflaker.org ICSE May 31, 2018 Detecting Flaky Tests Test B Test B Project version 1 Project version 2 Test outcome changes, code changed: maybe flaky, maybe not Test B Project version 1 Test B Project version 1 Test outcome changes, code NOT changed: definitely flaky!
deflaker.org ICSE May 31, 2018 Detecting Flaky Tests • A test is flaky if it has different outcomes executing the same code • Even if project code changes: does the test run any changed code? Test B Test B Project version 1 Project version 2 Code changed, test didn’t run changes, hence test is flaky
deflaker.org ICSE May 31, 2018 DeFlaker • Answers the question “does a test run any changed code?” • Tracks coverage (of both SUT and test code) • Two key challenges: • What coverage granularity should we track? • Tracking coverage of all code can be expensive
deflaker.org ICSE May 31, 2018 Hybrid Differential Coverage • Prior work: (precise) statement-level coverage of all code, or (coarse) class-level coverage of all code, possibly generating reports per-test • Hybrid: • If only statement changes to a class: track statements • If non-statement changes to class: track class level • Differential: Only coverage of changed statements/classes is tracked (no need to track all statements/classes)
deflaker.org ICSE May 31, 2018 Coverage Collection (During Test Execution) Reporting (After Test Execution) DeFlaker Overview DeFlaker tracks hybrid differential coverage — only tracking code that changed since the last execution, blending class-level and statement-level coverage List of likely flaky tests Previous test results Changes executed by each test Statements and classes to monitor at runtime Old version of codebase New version of codebase Differential Coverage Analysis (Before Test Execution)
deflaker.org ICSE May 31, 2018 Evaluation • How many flaky tests are found by DeFlaker/rerun? • What is the performance overhead of DeFlaker? • What is the benefit of DeFlaker’s hybrid coverage over just class-level coverage? • 2 methodologies
deflaker.org ICSE May 31, 2018 Controlled Evaluation Environment • “Traditional” methodology for evaluating testing tools • Clone 26 projects from GitHub • Build/test 5,966 versions of these projects in our own environment (5 CPU- Years) Clone projects Run mvn install Problem: “Interesting” projects often have pre-requisites to build
deflaker.org ICSE May 31, 2018 Live Evaluation Environment • Use developers’ existing continuous integration pipeline to run our tool in the exact same environment that they normally use • 614 total versions of 96 projects • Allows for evaluation with much more complex target projects, increasing diversity and reducing confounding factors • Simplifies bug reporting
deflaker.org ICSE May 31, 2018 Detecting Flaky Tests • Found 5,328 new failures • Rerunning failed tests can confirm the presence of a flaky test but not prove the absence of flakiness • Considered rerunning from 1-5 times, different rerun strategies • Varying overhead between strategies 0 1500 3000 4500 6000 0 1 2 3 4 5 Number of Flaky Tests Confirmed Number of Reruns DeFlaker (NO reruns needed!) Surefire + Fork Surefire + Fork + Reboot Surefire Flaky Detection Strategy: Suggestion: If rerunning tests, focus more on changing the environment between reruns than on the number of reruns (Or just use DeFlaker!) 23% 83% 100% 96%
deflaker.org ICSE May 31, 2018 Class vs Hybrid Coverage • Compared hybrid vs class coverage on 96 persistent, known flaky tests in 5,966 SHAs • Hybrid identifies 11 percentage points more known-flaky tests 77% 88% 0% 25% 50% 75% 100% Class Hybrid Percentage of known flaky tests reported flaky
DeFlaker: Automatically Detecting Flaky Tests Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University http://www.deflaker.org/ Darko Marinov's group is supported by NSF grants CCF-1409423, CCF-1421503, CNS-1646305, and CNS-1740916; and gifts from Google and Qualcomm. Fork D eFlaker on G itH ub