Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality

Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result
Quality Christoph Laaber, Stefan Würsten, Harald C. Gall, Philipp Leitner software evolution & architecture lab Research Papers 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ‘20), November 8 – 13, 2020, Virtual Event, USA @ChristophLaaber [email protected] http://t.uzh.ch/13k

Libraries and Frameworks Why Software Performance Matters! Industry Latency Revenue
use Research Harder to Debug Longer Undiscovered [Jin et al., PLDI’12] [Zaman et al., MSR’12] [Liu et al., ICSE’14] Longer to Fix impact Christoph Laaber, [email protected] 2

Industry Latency Revenue Research Harder to Debug Longer Undiscovered [Jin
et al., PLDI’12] [Zaman et al., MSR’12] [Liu et al., ICSE’14] Longer to Fix Libraries and Frameworks use impact One Potential Solution Software Microbenchmarks Christoph Laaber, [email protected] 3

What are Software Microbenchmarks? Execution Configuration Implementation Performance Test Unit
test equivalent Granularity: statement method Christoph Laaber, [email protected] 4

Performance Test Unit test equivalent Granularity: statement method What are
Software Microbenchmarks? 1s wi1 wi2 wi3 wi4 wi5 Christoph Laaber, [email protected] 5

Software Microbenchmarks? wi1 wi2 wi3 wi4 wi5 i1 i2 i3 i4 i5 20 ns 18 ns 23 ns 21 ns 20 ns Christoph Laaber, [email protected] 6

Software Microbenchmarks? wi1 wi2 wi3 wi4 wi5 f1 i1 i2 i3 i4 i5 f2 f3 Christoph Laaber, [email protected] 7

Software Microbenchmarks? wi1 wi2 wi3 wi4 wi5 f1 i1 i2 i3 i4 i5 f2 f3 density result values Latency Throughput Stability Christoph Laaber, [email protected] 8

Challenges [Huang et al., ICSE’14] Long benchmark suite runtimes Up
to multiple hours or even days. [Laaber et al., MSR’18] [Stefan et al., ICPE’17] High performance variability, measurement bias, and many unstable microbenchmarks [Laaber et al., MSR’18] [Maricq et al., OSDI’18] [Mytkowicz et al., ASPLOS’09] Pre-Study: 110 (15%) Github projects with runtimes > 3h Christoph Laaber, [email protected] 9

Configuration Tradeoff Few Repetitions Many Repetitions Unstable Stable 2 0
5 20 50 50 ) ) ) Fast Slow Stability Runtime Christoph Laaber, [email protected] 11

Observations and Idea wi1 wi2 wi3 wi4 wi5 f1 i1
i2 i3 i4 i5 f2 f3 X X X X X X 1. Different forks might be in steady-state at different points Stable Christoph Laaber, [email protected] 12

i2 i3 i4 i5 f2 f3 Stable 2. Unnecessary forks 1. Different forks might be in steady-state at different points Christoph Laaber, [email protected] 13

i2 i3 i4 i5 f2 f3 Stable 2. Unnecessary forks X X X X X 3. Manual configuration is required for every benchmark and execution environment 1. Different forks might be in steady-state at different points Christoph Laaber, [email protected] 14

1. Different forks might be in steady-state at different points
3. Manual configuration is required for every benchmark and execution environment Observations and Idea wi1 wi2 wi3 wi4 wi5 f1 i1 i2 i3 i4 i5 f2 f3 Stable 2. Unnecessary forks X X X X X Dynamic, data-driven decision when to stop microbenchmark executions Christoph Laaber, [email protected] 15

Approach -- Static Configuration (JMH) wi 6 wi 7 wi
8 wi 9 wi 10 f 1 i 1 i 2 i 3 i 4 i 5 f 2 f 3 wi 1 wi 2 wi 3 wi 4 wi 5 i 6 i 7 i 8 i 9 i 10 f 4 f 5 Christoph Laaber, [email protected] 16

Approach -- Dynamic Reconfiguration f1 i1 i2 i3 i4 i5
Minimum number of warmup iterations Christoph Laaber, [email protected] 17

Approach -- Dynamic Reconfiguration f1 i1 i2 i3 i4 i5
Stoppage Point Stable Unstable ? Sliding Window Stoppage Criteria: 1. Coefficient of variation 2. Relative confidence interval width 3. Kullback-Leibler Divergence [He et. al, FSE’19] Christoph Laaber, [email protected] 18

Approach -- Dynamic Reconfiguration i6 f1 i1 i2 i3 i4
i5 Sliding Window Stoppage Point Stable Unstable ? Christoph Laaber, [email protected] 19

Approach -- Dynamic Reconfiguration i6 i7 f1 i1 i2 i3
i4 i5 Sliding Window Stoppage Point Stable Unstable ? Christoph Laaber, [email protected] 20

Approach -- Dynamic Reconfiguration i6 i7 i8 i9 i10 f1
i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 Fixed number of measurement iterations Christoph Laaber, [email protected] 21

i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 Skipped iterations Christoph Laaber, [email protected] 22

i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 f2 f2 Minimum number of forks Christoph Laaber, [email protected] 23

i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 f2 f2 Stoppage Point Stable Unstable ? Stoppage Criteria: 1. Coefficient of variation 2. Relative confidence interval width 3. Kullback-Leibler Divergence [He et. al, FSE’19] Christoph Laaber, [email protected] 24

i11 i12 i13 i14 i15 i1 i2 i3 i4 i5 i16 i17 i18 i19 i20 f2 f2 f3 f4 Stoppage Point Stable Unstable ? End of Execution! Christoph Laaber, [email protected] 25

Approach -- Dynamic Reconfiguration i 6 i 7 i 8
i 9 i 10 f 1 i 11 i 12 i 13 i 14 i 15 i 1 i 2 i 3 i 4 i 5 i 16 i 17 i 18 i 19 i 20 f 2 f 2 f 3 f 4 f 5 Skipped iterations + skipped forks Christoph Laaber, [email protected] 26

Evaluation -- Research Questions Static Configuration (JMH) Dynamic Reconfiguration =
? How much time can be saved by dynamically reconfiguring software microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Christoph Laaber, [email protected] 27

Evaluation -- Methodology Execute All Benchmarks Study Objects Sample from
Execution Data = ? RQ1: Stability RQ2: Runtime Savings 10 open-source Java / JMH projects # benchmarks: 31 – 1,381 Runtimes: 4h – 192h Static Configuration Bare-metal server 3 stoppage criteria: CV, KLD, RCIW JMH default configuration Dynamic Reconfiguration Christoph Laaber, [email protected] 28

How much time can be saved by dynamically reconfiguring software
microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Christoph Laaber, [email protected] 29

RQ 1: Stability -- Method A/A Tests Mean Change Rate
density Execution Time Static Configuration Dynamic Reconfiguration = ? Bootstrap Confidence Interval of the Ratio of Means Equal or Different x % difference Christoph Laaber, [email protected] 30

RQ 1: Stability -- Results Stoppage Criteria Coefficient of variation
Relative confidence interval width Kullback-Leibler divergence 78.8 % 87.6 % 79.6 % Equal A/A Tests Mean Change Rate 3.1 % 1.4 % 2.4 % Dynamic Reconfiguration hardly changes result stability often within measurement noise [Georges et al., OOPSLA ‘07] Christoph Laaber, [email protected] 31

microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Stable results Christoph Laaber, [email protected] 33

microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Stable results Christoph Laaber, [email protected] 34

RQ 2: Runtime -- Method Runtime Overhead Execute Approaches 1.
Static Configuration (JMH) 2. Dynamic Reconfiguration + CV 3. Dynamic Reconfiguration + KLD 4. Dynamic Reconfiguration + RCIW Estimate Time Savings o = Dynamic Reconf. Static Conf. All benchmarks of Log4j 3 stoppage criteria: CV, KLD, RCIW Dynamic Reconfiguration Suite Runtimes Christoph Laaber, [email protected] 35

RQ 2: Runtime -- Results 0.88 % 10.92 % 4.32
% Runtime Overhead Time Savings 82.0 % 66.2 % 79.5 % Dynamic Reconfiguration substantially reduces runtime despite the overhead Stoppage Criteria Coefficient of variation Relative confidence interval width Kullback-Leibler divergence Christoph Laaber, [email protected] 36

microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 Stable results Substantial runtime savings despite the analysis overhead Christoph Laaber, [email protected] 37

What have we learned? Static configuration wastes precious benchmarking time
It is worth the analysis overhead to reduce overall runtime Stoppage criteria choice depends on desired benchmark stability OSS suites have long runtimes and use default configuration Christoph Laaber, [email protected] 38

Research Recommendations Automatically select approach hyper parameters Combine Dynamic Reconfiguration
with regression testing Christoph Laaber, [email protected] 39

@ChristophLaaber [email protected] Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without
Sacrificing Result Quality Christoph Laaber, Stefan Würsten, Harald C. Gall, Philipp Leitner software evolution & architecture lab https://doi.org/10.1145/3368089.3409683 http://t.uzh.ch/13k Approach -- Dynamic Reconfiguration 26 i 6 i 7 i 8 i 9 i 10 f 1 i 11 i 12 i 13 i 14 i 15 i 1 i 2 i 3 i 4 i 5 i 16 i 17 i 18 i 19 i 20 f 2 f 2 f 3 f 4 f 5 Skipped iterations + skipped forks Challenges 9 [Huang et al., ICSE’14] Long benchmark suite runtimes Up to multiple hours or even days. [Laaber et al., MSR’18] [Stefan et al., ICPE’17] High performance variability, measurement bias, and many unstable microbenchmarks [Laaber et al., MSR’18] [Maricq et al., OSDI’18] [Mytkowicz et al., ASPLOS’09] Pre-Study: 110 (15%) Github projects with runtimes > 3h Evaluation -- Research Questions 27 Static Configuration (JMH) Dynamic Reconfiguration = ? How much time can be saved by dynamically reconfiguring software microbenchmarks? How does dynamic reconfiguration of software microbenchmarks affect their execution result? RQ 1 RQ 2 RQ 1: Stability -- Results 31 Stoppage Criteria Coefficient of variation Relative confidence interval width Kullback-Leibler divergence 78.8 % 87.6 % 79.6 % Equal A/A Tests Mean Change Rate 3.1 % 1.4 % 2.4 % Dynamic Reconfiguration hardly changes result stability often within measurement noise [Georges et al., OOPSLA ‘07] RQ 2: Runtime -- Results 36 0.88 % 10.92 % 4.32 % Runtime Overhead Time Savings 82.0 % 66.2 % 79.5 % Dynamic Reconfiguration substantially reduces runtime despite the overhead Stoppage Criteria Coefficient of variation Relative confidence interval width Kullback-Leibler divergence Configuration Tradeoff 11 Few Repetitions Many Repetitions Unstable Stable 2 0 5 20 50 50 ) ) ) Fast Slow Stability Runtime

Paper, Scripts, and Data https://doi.org/10.6084/m9.figshare.11944875 Replication package: Paper: https://doi.org/10.1145/3368089.3409683 Preprint:
http://t.uzh.ch/13k https://github.com/sealuzh/jmh Tool: Dynamically Reconfiguring So ware Microbenchmarks: Reducing Execution Time without Sacrificing Result ality Christoph Laaber University of Zurich Zurich, Switzerland laaber@i .uzh.ch Stefan Würsten University of Zurich Zurich, Switzerland [email protected] Harald C. Gall University of Zurich Zurich, Switzerland gall@i .uzh.ch Philipp Leitner Chalmers | University of Gothenburg Gothenburg, Sweden [email protected] ABSTRACT Executing software microbenchmarks, a form of small-scale performance tests predominantly used for libraries and frameworks, is a costly endeavor. Full benchmark suites take up to multiple hours or days to execute, rendering frequent checks, e.g., as part of con- tinuous integration (CI), infeasible. However, altering benchmark con gurations to reduce execution time without considering the impact on result quality can lead to benchmark results that are not representative of the software’s true performance. We propose the rst technique to dynamically stop software microbenchmark executions when their results are su ciently stable. Our approach implements three statistical stoppage criteria and is capable of reducing Java Microbenchmark Harness (JMH) suite execution times by 48.4% to 86.0%. At the same time it retains the same result quality for 78.8% to 87.6% of the benchmarks, compared to executing the suite for the default duration. The proposed approach does not require developers to manually craft custom benchmark con gurations; instead, it provides automated mechanisms for dynamic recon guration. Hence, making dynamic recon guration highly e ective and e cient, potentially paving the way to inclusion of JMH microbenchmarks in CI. CCS CONCEPTS • General and reference → Measurement; Performance; • Soft- ware and its engineering → Software performance; Software testing and debugging. KEYWORDS performance testing, software benchmarking, JMH, con guration ACM Reference Format: Christoph Laaber, Stefan Würsten, Harald C. Gall, and Philipp Leitner. 2020. Dynamically Recon guring Software Microbenchmarks: Reducing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from [email protected]. ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7043-1/20/11...$15.00 https://doi.org/10.1145/3368089.3409683 Execution Time without Sacri cing Result Quality. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’20), November 8– 13, 2020, Virtual Event, USA. ACM, New York, NY, USA, 13 pages. https: //doi.org/10.1145/3368089.3409683 1 INTRODUCTION Performance testing enables automated assessment of software performance in the hope of catching degradations, such as slowdowns, in a timely manner. A variety of techniques exist, spanning from system-scale (e.g., load testing) to method or statement level, such as software microbenchmarking. For functional testing, CI has been a revelation, where (unit) tests are regularly executed to detect functional regressions as early as possible [22]. However, performance testing is not yet standard CI practice, although there would be a need for it [6, 36]. A major reason for not running performance tests on every commit is their long runtimes, often consuming multiple hours to days [24, 26, 32]. To lower the time spent in performance testing activities, previ- ous research applied techniques to select which commits to test [24, 45] or which tests to run [3, 14], to prioritize tests that are more likely to expose slowdowns [39], and to stop load tests once they become repetitive [1, 2] or do not improve result accuracy [20]. However, none of these approaches are tailored to and consider characteristics of software microbenchmarks and enable running full benchmark suites, reduce the overall runtime, while still main- taining the same result quality. In this paper, we present the rst approach to dynamically, i.e., during execution, decide when to stop the execution of software microbenchmarks. Our approach —dynamic recon guration— de- termines at di erent checkpoints whether a benchmark execution is stable and if more executions are unlikely to improve the result accuracy. It builds on the concepts introduced by He et al. [20], applies them to software microbenchmarks, and generalizes the approach for any kind of stoppage criteria. To evaluate whether dynamic recon guration enables reducing execution time without sacri cing quality, we perform an experi- mental evaluation on ten Java open-source software (OSS) projects with benchmark suite sizes between 16 and 995 individual benchmarks, ranging from 4.31 to 191.81 hours. Our empirical evaluation comprises of three di erent stoppage criteria, including the one from He et al. [20]. It assesses whether benchmarks executed with 989 Christoph Laaber, [email protected] 41

Dynamically Reconfiguring Software Microbenchma...

Dynamically Reconfiguring Software Microbenchmarks: Reducing Execution Time without Sacrificing Result Quality

More Decks by Christoph Laaber

Other Decks in Research

Featured

Transcript