Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality

Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality

Research paper on software microbenchmark reconfiguration to reduce execution time while maintaining result quality by Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. Presented at the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '20), November 8–13, 2020, Virtual Event, USA.

09461db7b3c462bea45201b44a9eff58?s=128

Christoph Laaber

November 01, 2020
Tweet

Transcript

  1. Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result

    Quality Christoph Laaber, Stefan Würsten, Harald C. Gall, Philipp Leitner software evolution & architecture lab Research Papers 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ‘20), November 8 – 13, 2020, Virtual Event, USA @ChristophLaaber laaber@ifi.uzh.ch http://t.uzh.ch/13k
  2. Libraries and Frameworks Why Software Performance Matters! Industry Latency Revenue

    use Research Harder to Debug Longer Undiscovered [Jin et al., PLDI’12] [Zaman et al., MSR’12] [Liu et al., ICSE’14] Longer to Fix impact Christoph Laaber, laaber@ifi.uzh.ch 2
  3. Industry Latency Revenue Research Harder to Debug Longer Undiscovered [Jin

    et al., PLDI’12] [Zaman et al., MSR’12] [Liu et al., ICSE’14] Longer to Fix Libraries and Frameworks use impact One Potential Solution Software Microbenchmarks Christoph Laaber, laaber@ifi.uzh.ch 3
  4. What are Software Microbenchmarks? Execution Configuration Implementation Performance Test Unit

    test equivalent Granularity: statement method Christoph Laaber, laaber@ifi.uzh.ch 4
  5. Performance Test Unit test equivalent Granularity: statement method What are

    Software Microbenchmarks? 1s wi1 wi2 wi3 wi4 wi5 Christoph Laaber, laaber@ifi.uzh.ch 5
  6. Performance Test Unit test equivalent Granularity: statement method What are

    Software Microbenchmarks? wi1 wi2 wi3 wi4 wi5 i1 i2 i3 i4 i5 20 ns 18 ns 23 ns 21 ns 20 ns Christoph Laaber, laaber@ifi.uzh.ch 6
  7. Performance Test Unit test equivalent Granularity: statement method What are

    Software Microbenchmarks? wi1 wi2 wi3 wi4 wi5 f1 i1 i2 i3 i4 i5 f2 f3 Christoph Laaber, laaber@ifi.uzh.ch 7
  8. Performance Test Unit test equivalent Granularity: statement method What are

    Software Microbenchmarks? wi1 wi2 wi3 wi4 wi5 f1 i1 i2 i3 i4 i5 f2 f3 density result values Latency Throughput Stability Christoph Laaber, laaber@ifi.uzh.ch 8
  9. Challenges [Huang et al., ICSE’14] Long benchmark suite runtimes Up

    to multiple hours or even days. [Laaber et al., MSR’18] [Stefan et al., ICPE’17] High performance variability, measurement bias, and many unstable microbenchmarks [Laaber et al., MSR’18] [Maricq et al., OSDI’18] [Mytkowicz et al., ASPLOS’09] Pre-Study: 110 (15%) Github projects with runtimes > 3h Christoph Laaber, laaber@ifi.uzh.ch 9
  10. Configuration Tradeoff Few Repetitions Many Repetitions Unstable Stable 2 0

    5 20 50 50 ) ) ) Fast Slow Stability Runtime Christoph Laaber, laaber@ifi.uzh.ch 11
  11. Observations and Idea wi1 wi2 wi3 wi4 wi5 f1 i1

    i2 i3 i4 i5 f2 f3 X X X X X X 1. Different forks might be in steady-state at different points Stable Christoph Laaber, laaber@ifi.uzh.ch 12
  12. Observations and Idea wi1 wi2 wi3 wi4 wi5 f1 i1

    i2 i3 i4 i5 f2 f3 Stable 2. Unnecessary forks 1. Different forks might be in steady-state at different points Christoph Laaber, laaber@ifi.uzh.ch 13
  13. Observations and Idea wi1 wi2 wi3 wi4 wi5 f1 i1

    i2 i3 i4 i5 f2 f3 Stable 2. Unnecessary forks X X X X X 3. Manual configuration is required for every benchmark and execution environment 1. Different forks might be in steady-state at different points Christoph Laaber, laaber@ifi.uzh.ch 14
  14. 1. Different forks might be in steady-state at different points

    3. Manual configuration is required for every benchmark and execution environment Observations and Idea wi1 wi2 wi3 wi4 wi5 f1 i1 i2 i3 i4 i5 f2 f3 Stable 2. Unnecessary forks X X X X X Dynamic, data-driven decision when to stop microbenchmark executions Christoph Laaber, laaber@ifi.uzh.ch 15
  15. Approach -- Static Configuration (JMH) wi 6 wi 7 wi

    8 wi 9 wi 10 f 1 i 1 i 2 i 3 i 4 i 5 f 2 f 3 wi 1 wi 2 wi 3 wi 4 wi 5 i 6 i 7 i 8 i 9 i 10 f 4 f 5 Christoph Laaber, laaber@ifi.uzh.ch 16
  16. Approach -- Dynamic Reconfiguration f1 i1 i2 i3 i4 i5

    Minimum number of warmup iterations Christoph Laaber, laaber@ifi.uzh.ch 17
  17. Approach -- Dynamic Reconfiguration f1 i1 i2 i3 i4 i5

    Stoppage Point Stable Unstable ? Sliding Window Stoppage Criteria: 1. Coefficient of variation 2. Relative confidence interval width 3. Kullback-Leibler Divergence [He et. al, FSE’19] Christoph Laaber, laaber@ifi.uzh.ch 18
  18. Approach -- Dynamic Reconfiguration i6 f1 i1 i2 i3 i4

    i5 Sliding Window Stoppage Point Stable Unstable ? Christoph Laaber, laaber@ifi.uzh.ch 19
  19. Approach -- Dynamic Reconfiguration i6 i7 f1 i1 i2 i3

    i4 i5 Sliding Window Stoppage Point Stable Unstable ? Christoph Laaber, laaber@ifi.uzh.ch 20
  20. Approach -- Dynamic Reconfiguration i6 i7 i8 i9 i10 f1

    i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 Fixed number of measurement iterations Christoph Laaber, laaber@ifi.uzh.ch 21
  21. Approach -- Dynamic Reconfiguration i6 i7 i8 i9 i10 f1

    i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 Skipped iterations Christoph Laaber, laaber@ifi.uzh.ch 22
  22. Approach -- Dynamic Reconfiguration i6 i7 i8 i9 i10 f1

    i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 f2 f2 Minimum number of forks Christoph Laaber, laaber@ifi.uzh.ch 23
  23. Approach -- Dynamic Reconfiguration i6 i7 i8 i9 i10 f1

    i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 f2 f2 Stoppage Point Stable Unstable ? Stoppage Criteria: 1. Coefficient of variation 2. Relative confidence interval width 3. Kullback-Leibler Divergence [He et. al, FSE’19] Christoph Laaber, laaber@ifi.uzh.ch 24
  24. Approach -- Dynamic Reconfiguration i6 i7 i8 i9 i10 f1

    i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 f2 f2 f3 f4 Stoppage Point Stable Unstable ? End of Execution! Christoph Laaber, laaber@ifi.uzh.ch 25
  25. Approach -- Dynamic Reconfiguration i 6 i 7 i 8

    i 9 i 10 f 1 i 11 i 12 i 13 i 14 i 15 i 1 i 2 i 3 i 4 i 5 i 16 i 17 i 18 i 19 i 20 f 2 f 2 f 3 f 4 f 5 Skipped iterations + skipped forks Christoph Laaber, laaber@ifi.uzh.ch 26
  26. Evaluation -- Research Questions Static Configuration (JMH) Dynamic Reconfiguration =

    ? How much time can be saved by dynamically reconfiguring software microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Christoph Laaber, laaber@ifi.uzh.ch 27
  27. Evaluation -- Methodology Execute All Benchmarks Study Objects Sample from

    Execution Data = ? RQ1: Stability RQ2: Runtime Savings 10 open-source Java / JMH projects # benchmarks: 31 – 1,381 Runtimes: 4h – 192h Static Configuration Bare-metal server 3 stoppage criteria: CV, KLD, RCIW JMH default configuration Dynamic Reconfiguration Christoph Laaber, laaber@ifi.uzh.ch 28
  28. How much time can be saved by dynamically reconfiguring software

    microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Christoph Laaber, laaber@ifi.uzh.ch 29
  29. RQ 1: Stability -- Method A/A Tests Mean Change Rate

    density Execution Time Static Configuration Dynamic Reconfiguration = ? Bootstrap Confidence Interval of the Ratio of Means Equal or Different x % difference Christoph Laaber, laaber@ifi.uzh.ch 30
  30. RQ 1: Stability -- Results Stoppage Criteria Coefficient of variation

    Relative confidence interval width Kullback-Leibler divergence 78.8 % 87.6 % 79.6 % Equal A/A Tests Mean Change Rate 3.1 % 1.4 % 2.4 % Dynamic Reconfiguration hardly changes result stability often within measurement noise [Georges et al., OOPSLA ‘07] Christoph Laaber, laaber@ifi.uzh.ch 31
  31. How much time can be saved by dynamically reconfiguring software

    microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Stable results Christoph Laaber, laaber@ifi.uzh.ch 33
  32. How much time can be saved by dynamically reconfiguring software

    microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Stable results Christoph Laaber, laaber@ifi.uzh.ch 34
  33. RQ 2: Runtime -- Method Runtime Overhead Execute Approaches 1.

    Static Configuration (JMH) 2. Dynamic Reconfiguration + CV 3. Dynamic Reconfiguration + KLD 4. Dynamic Reconfiguration + RCIW Estimate Time Savings o = Dynamic Reconf. Static Conf. All benchmarks of Log4j 3 stoppage criteria: CV, KLD, RCIW Dynamic Reconfiguration Suite Runtimes Christoph Laaber, laaber@ifi.uzh.ch 35
  34. RQ 2: Runtime -- Results 0.88 % 10.92 % 4.32

    % Runtime Overhead Time Savings 82.0 % 66.2 % 79.5 % Dynamic Reconfiguration substantially reduces runtime despite the overhead Stoppage Criteria Coefficient of variation Relative confidence interval width Kullback-Leibler divergence Christoph Laaber, laaber@ifi.uzh.ch 36
  35. How much time can be saved by dynamically reconfiguring software

    microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Stable results Substantial runtime savings despite the analysis overhead Christoph Laaber, laaber@ifi.uzh.ch 37
  36. What have we learned? Static configuration wastes precious benchmarking time

    It is worth the analysis overhead to reduce overall runtime Stoppage criteria choice depends on desired benchmark stability OSS suites have long runtimes and use default configuration Christoph Laaber, laaber@ifi.uzh.ch 38
  37. Research Recommendations Automatically select approach hyper parameters Combine Dynamic Reconfiguration

    with regression testing Christoph Laaber, laaber@ifi.uzh.ch 39
  38. @ChristophLaaber laaber@ifi.uzh.ch Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without

    Sacrificing Result Quality Christoph Laaber, Stefan Würsten, Harald C. Gall, Philipp Leitner software evolution & architecture lab https://doi.org/10.1145/3368089.3409683 http://t.uzh.ch/13k Approach -- Dynamic Reconfiguration 26 i 6 i 7 i 8 i 9 i 10 f 1 i 11 i 12 i 13 i 14 i 15 i 1 i 2 i 3 i 4 i 5 i 16 i 17 i 18 i 19 i 20 f 2 f 2 f 3 f 4 f 5 Skipped iterations + skipped forks Challenges 9 [Huang et al., ICSE’14] Long benchmark suite runtimes Up to multiple hours or even days. [Laaber et al., MSR’18] [Stefan et al., ICPE’17] High performance variability, measurement bias, and many unstable microbenchmarks [Laaber et al., MSR’18] [Maricq et al., OSDI’18] [Mytkowicz et al., ASPLOS’09] Pre-Study: 110 (15%) Github projects with runtimes > 3h Evaluation -- Research Questions 27 Static Configuration (JMH) Dynamic Reconfiguration = ? How much time can be saved by dynamically reconfiguring software microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 RQ 1: Stability -- Results 31 Stoppage Criteria Coefficient of variation Relative confidence interval width Kullback-Leibler divergence 78.8 % 87.6 % 79.6 % Equal A/A Tests Mean Change Rate 3.1 % 1.4 % 2.4 % Dynamic Reconfiguration hardly changes result stability often within measurement noise [Georges et al., OOPSLA ‘07] RQ 2: Runtime -- Results 36 0.88 % 10.92 % 4.32 % Runtime Overhead Time Savings 82.0 % 66.2 % 79.5 % Dynamic Reconfiguration substantially reduces runtime despite the overhead Stoppage Criteria Coefficient of variation Relative confidence interval width Kullback-Leibler divergence Configuration Tradeoff 11 Few Repetitions Many Repetitions Unstable Stable 2 0 5 20 50 50 ) ) ) Fast Slow Stability Runtime
  39. Paper, Scripts, and Data https://doi.org/10.6084/m9.figshare.11944875 Replication package: Paper: https://doi.org/10.1145/3368089.3409683 Preprint:

    http://t.uzh.ch/13k https://github.com/sealuzh/jmh Tool: Dynamically Reconfiguring So ware Microbenchmarks: Reducing Execution Time without Sacrificing Result ality Christoph Laaber University of Zurich Zurich, Switzerland laaber@i .uzh.ch Stefan Würsten University of Zurich Zurich, Switzerland stefan.wuersten@uzh.ch Harald C. Gall University of Zurich Zurich, Switzerland gall@i .uzh.ch Philipp Leitner Chalmers | University of Gothenburg Gothenburg, Sweden philipp.leitner@chalmers.se ABSTRACT Executing software microbenchmarks, a form of small-scale perfor- mance tests predominantly used for libraries and frameworks, is a costly endeavor. Full benchmark suites take up to multiple hours or days to execute, rendering frequent checks, e.g., as part of con- tinuous integration (CI), infeasible. However, altering benchmark con gurations to reduce execution time without considering the impact on result quality can lead to benchmark results that are not representative of the software’s true performance. We propose the rst technique to dynamically stop software mi- crobenchmark executions when their results are su ciently stable. Our approach implements three statistical stoppage criteria and is capable of reducing Java Microbenchmark Harness (JMH) suite execution times by 48.4% to 86.0%. At the same time it retains the same result quality for 78.8% to 87.6% of the benchmarks, compared to executing the suite for the default duration. The proposed approach does not require developers to manually craft custom benchmark con gurations; instead, it provides auto- mated mechanisms for dynamic recon guration. Hence, making dynamic recon guration highly e ective and e cient, potentially paving the way to inclusion of JMH microbenchmarks in CI. CCS CONCEPTS • General and reference → Measurement; Performance; • Soft- ware and its engineering → Software performance; Software testing and debugging. KEYWORDS performance testing, software benchmarking, JMH, con guration ACM Reference Format: Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. Dynamically Recon guring Software Microbenchmarks: Reducing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7043-1/20/11...$15.00 https://doi.org/10.1145/3368089.3409683 Execution Time without Sacri cing Result Quality. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’20), November 8– 13, 2020, Virtual Event, USA. ACM, New York, NY, USA, 13 pages. https: //doi.org/10.1145/3368089.3409683 1 INTRODUCTION Performance testing enables automated assessment of software per- formance in the hope of catching degradations, such as slowdowns, in a timely manner. A variety of techniques exist, spanning from system-scale (e.g., load testing) to method or statement level, such as software microbenchmarking. For functional testing, CI has been a revelation, where (unit) tests are regularly executed to detect func- tional regressions as early as possible [22]. However, performance testing is not yet standard CI practice, although there would be a need for it [6, 36]. A major reason for not running performance tests on every commit is their long runtimes, often consuming multiple hours to days [24, 26, 32]. To lower the time spent in performance testing activities, previ- ous research applied techniques to select which commits to test [24, 45] or which tests to run [3, 14], to prioritize tests that are more likely to expose slowdowns [39], and to stop load tests once they become repetitive [1, 2] or do not improve result accuracy [20]. However, none of these approaches are tailored to and consider characteristics of software microbenchmarks and enable running full benchmark suites, reduce the overall runtime, while still main- taining the same result quality. In this paper, we present the rst approach to dynamically, i.e., during execution, decide when to stop the execution of software microbenchmarks. Our approach —dynamic recon guration— de- termines at di erent checkpoints whether a benchmark execution is stable and if more executions are unlikely to improve the result accuracy. It builds on the concepts introduced by He et al. [20], applies them to software microbenchmarks, and generalizes the approach for any kind of stoppage criteria. To evaluate whether dynamic recon guration enables reducing execution time without sacri cing quality, we perform an experi- mental evaluation on ten Java open-source software (OSS) projects with benchmark suite sizes between 16 and 995 individual bench- marks, ranging from 4.31 to 191.81 hours. Our empirical evaluation comprises of three di erent stoppage criteria, including the one from He et al. [20]. It assesses whether benchmarks executed with 989 Christoph Laaber, laaber@ifi.uzh.ch 41