Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICSE18 - Deflaker

ICSE18 - Deflaker

Slides for talk at ICSE18 on Automatically Detecting Flaky Tests.

Michael Hilton

May 31, 2018
Tweet

More Decks by Michael Hilton

Other Decks in Programming

Transcript

  1. ICSE May 31, 2018
    DeFlaker: Automatically
    Detecting Flaky Tests
    Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi,
    Tifany Yung and Darko Marinov
    George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University
    http://www.deflaker.org
    Fork
    D
    eFlaker on
    G
    itH
    ub

    View Slide

  2. deflaker.org
    ICSE May 31, 2018
    Regression Testing
    Test A
    “Test is OK!”
    Test B
    “Test failed!”
    Makes changes
    to code
    Developer
    Runs Tests
    Test
    B
    Test
    { “Ah-ha! I found that I
    introduced a
    regression!”

    View Slide

  3. deflaker.org
    ICSE May 31, 2018
    Flaky Tests
    Test A
    Test B Test B
    “Test is OK!”
    “Test failed!”
    “Test is flaky!”
    Makes changes
    to code
    Developer
    Runs Tests
    Test
    B
    Test
    {
    Flaky test: a test which can pass or fail on the same version of code
    #!?
    Original run Rerun

    View Slide

  4. deflaker.org
    ICSE May 31, 2018
    Flaky Tests Mean Flaky Builds
    =
    [Labuschagne, Inozemtseva, and Holmes ’17]
    Flaky
    13%
    Not
    Flaky
    87%
    Analysis of 935 failing builds on Travis CI

    View Slide

  5. deflaker.org
    ICSE May 31, 2018
    Rerunning Tests
    Test A
    Test B Test B
    Test A Test A
    “Did this pass by
    chance?”
    “Did this fail by
    chance?”
    Should I always rerun all of my tests?
    How many times?
    How long should I wait to rerun them?
    “I think it truly passed?”
    “It is flaky!”
    Test B

    View Slide

  6. deflaker.org
    ICSE May 31, 2018
    Example Flaky Test
    Test fails!
    Server startup
    complete
    Start server
    Make request to
    server
    Wait 3 seconds
    for server to start
    Start Test
    Too late!

    View Slide

  7. deflaker.org
    ICSE May 31, 2018
    Problems with Rerun
    • Why would the test outcome change right
    away when we rerun it?
    • Maybe the machine running the test is
    overloaded?
    • Rerunning takes time and resources!
    • Even Google can’t rerun all failing tests
    https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
    • What else can we do to identify flaky tests
    without reruns?
    Start Test
    Test fails!
    Server startup
    complete
    Make request to
    server
    Start server
    Wait 3 seconds for
    server to start
    Too late!

    View Slide

  8. deflaker.org
    ICSE May 31, 2018
    Detecting Flaky Tests
    Test B Test B
    Project
    version 1
    Project
    version 2
    Test outcome changes, code
    changed: maybe flaky, maybe not
    Test B
    Project
    version 1
    Test B
    Project
    version 1
    Test outcome changes, code NOT
    changed: definitely flaky!

    View Slide

  9. deflaker.org
    ICSE May 31, 2018
    Detecting Flaky Tests
    • A test is flaky if it has different outcomes executing the same code
    • Even if project code changes: does the test run any changed code?
    Test B Test B
    Project
    version 1
    Project
    version 2
    Code changed,
    test didn’t run changes,
    hence test is flaky

    View Slide

  10. ICSE May 31, 2018
    DeFlaker

    View Slide

  11. deflaker.org
    ICSE May 31, 2018
    DeFlaker
    • Answers the question “does a test run any changed code?”
    • Tracks coverage (of both SUT and test code)
    • Two key challenges:
    • What coverage granularity should we track?
    • Tracking coverage of all code can be expensive

    View Slide

  12. deflaker.org
    ICSE May 31, 2018
    Hybrid Differential Coverage
    • Prior work: (precise) statement-level coverage of all code, or (coarse)
    class-level coverage of all code, possibly generating reports per-test
    • Hybrid:
    • If only statement changes to a class: track statements
    • If non-statement changes to class: track class level
    • Differential: Only coverage of changed statements/classes is tracked
    (no need to track all statements/classes)

    View Slide

  13. deflaker.org
    ICSE May 31, 2018
    Coverage Collection
    (During Test Execution)
    Reporting
    (After Test Execution)
    DeFlaker Overview
    DeFlaker tracks hybrid differential coverage — only tracking code that changed
    since the last execution, blending class-level and statement-level coverage
    List of likely flaky
    tests
    Previous test
    results
    Changes
    executed by
    each test
    Statements
    and classes to
    monitor at
    runtime
    Old version of
    codebase
    New version
    of codebase
    Differential Coverage Analysis
    (Before Test Execution)

    View Slide

  14. deflaker.org
    ICSE May 31, 2018
    DeFlaker: Installation


    org.deflaker
    deflaker-maven-extension
    1.4


    Maven extension automates the otherwise tedious task
    of adding multiple plugins and configuring Surefire

    View Slide

  15. deflaker.org
    ICSE May 31, 2018
    Evaluation
    • How many flaky tests are found by DeFlaker/rerun?
    • What is the performance overhead of DeFlaker?
    • What is the benefit of DeFlaker’s hybrid coverage over just class-level
    coverage?
    • 2 methodologies

    View Slide

  16. deflaker.org
    ICSE May 31, 2018
    Controlled Evaluation Environment
    • “Traditional” methodology for evaluating testing tools
    • Clone 26 projects from GitHub
    • Build/test 5,966 versions of these projects in our own environment (5 CPU-
    Years)
    Clone projects Run mvn install
    Problem: “Interesting” projects often have pre-requisites to build

    View Slide

  17. deflaker.org
    ICSE May 31, 2018
    Live Evaluation Environment
    • Use developers’ existing continuous
    integration pipeline to run our tool in the
    exact same environment that they normally
    use
    • 614 total versions of 96 projects
    • Allows for evaluation with much more
    complex target projects, increasing diversity
    and reducing confounding factors
    • Simplifies bug reporting

    View Slide

  18. deflaker.org
    ICSE May 31, 2018
    Detecting Flaky Tests
    • Found 5,328 new failures
    • Rerunning failed tests can
    confirm the presence of a
    flaky test but not prove
    the absence of flakiness
    • Considered rerunning
    from 1-5 times, different
    rerun strategies
    • Varying overhead
    between strategies
    0
    1500
    3000
    4500
    6000
    0 1 2 3 4 5
    Number of Flaky Tests Confirmed
    Number of Reruns
    DeFlaker (NO reruns needed!)
    Surefire + Fork
    Surefire + Fork + Reboot
    Surefire
    Flaky Detection Strategy:
    Suggestion: If rerunning tests, focus more on changing the
    environment between reruns than on the number of reruns
    (Or just use DeFlaker!)
    23%
    83%
    100%
    96%

    View Slide

  19. deflaker.org
    ICSE May 31, 2018
    Coverage Collection Overhead
    0% 25% 50% 75% 100%
    achilles
    ambari
    assertj-core
    checkstyle
    commons-exec
    dropwizard
    hector
    httpcore
    jackrabbit-oak
    killbill
    ninja
    spring-boot
    tachyon
    togglz
    undertow
    wro4j
    zxing
    DeFlaker Ekstazi JaCoCo
    Excluded projects with high variance in testing time; Executed 10 times per-revision for 10 revisions per-project.
    DeFlaker: 5% Ekstazi: 39% JaCoCo: 58%

    View Slide

  20. deflaker.org
    ICSE May 31, 2018
    Class vs Hybrid Coverage
    • Compared hybrid vs class coverage
    on 96 persistent, known flaky tests in
    5,966 SHAs
    • Hybrid identifies 11 percentage
    points more known-flaky tests
    77%
    88%
    0%
    25%
    50%
    75%
    100%
    Class Hybrid
    Percentage of known flaky tests
    reported flaky

    View Slide

  21. deflaker.org
    ICSE May 31, 2018
    DeFlaker: Live Evaluation
    Project # Builds New Fails Confirmed Flakes False Pos Reported Addressed
    Achilles 5 2 2 0 2 2
    checkstyle 96 1 1 0 1 1
    geoserver 60 39 39 0 1 0
    jackrabbit-oak 99 5 5 0 2 1
    jmeter-plugins 19 1 1 0 0 0
    killbill 31 26 26 0 1 0
    nutz 87 1 1 0 1 1
    presto 203 11 11 0 7 2
    quickml 2 2 2 0 2 0
    togglz 12 3 3 0 2 2
    Total 614 91 91 0 19 9

    View Slide

  22. DeFlaker: Automatically Detecting Flaky Tests
    Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung and Darko Marinov
    George Mason University, University of Illinois at Urbana-Champaign and Carnegie Mellon University
    http://www.deflaker.org/
    Darko Marinov's group is supported by NSF grants CCF-1409423, CCF-1421503, CNS-1646305, and CNS-1740916; and gifts from Google and Qualcomm.
    Fork
    D
    eFlaker on
    G
    itH
    ub

    View Slide