Machine Learning meets Software Performance

Machine Learning meets Software Performance A journey from optimization to
transfer learning all the way to counterfactual causal inference ASE 2020 Tutorial Friday 25 Sep Pooyan Jamshidi UofSC Europa Lander NASA Europa Clipper NASA

Artiﬁcial Intelligence and Systems Laboratory (AISys Lab) Machine Learning Computer
Systems Autonomy Learning-enabled Autonomous Systems https://pooyanjamshidi.github.io/AISys/ 2

Research Directions at AISys 3 Theory:  - Transfer Learning  -
Causal Invariances  - Structure Learning  - Concept Learning  - Physics-Informed    Applications:  - Systems  - Autonomy  - Robotics Well-known Physics Big Data Limited known Physics Small Data Causal AI

Team effort 4 Rahul Krishna Columbia Shahriar Iqbal UofSC M.
A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC Vivek Nair Facebook Tim Menzies NCSU

5 Goal: Enable developers/users to ﬁnd the right quality tradeoff

Today’s most popular systems are conﬁgurable 6 built

Empirical observations conﬁrm that systems are becoming increasingly conﬁgurable 8
08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Empirical observations confirm that systems are becoming increasingly configurable 9
nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Conﬁgurations determine the performance behavior 10 void Parrot_setenv(. . .
name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy

11 How do we understand performance behavior of real-world highly-conﬁgurable
systems that scale well… … and enable developers/users to reason about qualities (performance, energy) and to make tradeoff?

Scope: Configuration across stack 12 CPU Memory Controller GPU Lib
API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers

Composed Systems (Single-node) • Online processing of sensory data •
Neural network models • Homogeneous tasks 13

Composed Systems (Multi-node) • Online processing of sensory data •
Graph types models • Heterogeneous tasks 14

Composed Systems (IoT) • They are integrated with cloud services
and we do not have access to those system, we could conﬁgure them to some extent. 15

Distributed (big data) • The components may be assigned to
different hardware nodes without direct control of the users • These are typically configurable, but need expertise to find the best configuration 16

Cyber-physical systems • We may not have direct access to
the hardware directly, so remote debugging is needed. 17

Cloud, Multi-cloud Systems Event-driven systems • Code migrate from one
hardware to another (lots of interactions) 18

Outline 19 Case Study Transfer Learning Theory Building Guided Sampling
Current Research [SEAMS’17] [ASE’17] [FSE’18]

SocialSensor •Identifying trending topics •Identifying user deﬁned topics •Social media
search 20

SocialSensor 21 Content Analysis Orchestrator Crawling Search and Integration Tweets:
[5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet

Challenges 22 Content Analysis Orchestrator Crawling Search and Integration Tweets:
[5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time

23 How can we gain a better performance without using
more resources?

24 Let’s try out diﬀerent system conﬁgurations!

Opportunity: Data processing engines in the pipeline were all conﬁgurable
25 > 100 > 100 > 100 2300

26 More combinations than estimated atoms in the universe

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000
4000 5000 Average write latency ( s) Default conﬁguration was bad, so was the expert’ 27 Default Recommended by an expert Optimal Conﬁguration better better

0 0.5 1 1.5 2 2.5 Throughput (ops/sec) 104 0
50 100 150 200 250 300 Latency (ms) Default conﬁguration was bad, so was the expert’ 28 Default Recommended by an expert Optimal Conﬁguration better better

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000
4000 5000 Average write latency ( s) The default configuration is typically bad and the optimal configuration is noticeably better than median 29 Default Configuration Optimal Configuration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median

100X more user cloud resources reduced 20% outperform expert recommendation

Identifying the root cause of performance faults is difficult •
Code was transplanted from TX1 to TX2 • TX2 is more powerful, but software was 2x slower than TX1 • Three misconfigurations: ◦ Wrong compilation flags for compiling CUDA (didn't use 'dynamic' flag) ◦ Wrong CPU/GPU modes (didn't use TX2 optimized cores) ◦ Wrong Fan mode (didn't change to handle thermal throttling) Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 31

Fixing performance faults is difficult • These were not in
the default settings • Took 1 month to fix in the end... • We need to do this better Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 32

Identifying the root cause of performance faults is difficult 33

Performance distributions are multi-modal and have long tails • Certain
configurations can cause performance to take abnormally large values  • Faulty configurations take the tail values (worse than 99.99th percentile)  • Certain configurations can cause faults on multiple performance objectives.   37

Current Research

Setting the scene 39 ℂ = O1 × O2 ×
⋯ × O19 × O20 Dead code removal Configuration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Configure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quantifiable aspect

A typical approach for understanding the performance behavior is sensitivity
analysis 40 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Learn Training/Sample Set

Performance model could be in any appropriate form of black-box
models 41 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Training/Sample Set

Evaluating a performance model 42 O1 × O2 × ⋯
× O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Learn Training/Sample Set Evaluate Accuracy APE( ̂ f, f ) = | ̂ f(c) − f(c)| f(c) × 100

A performance model contain useful information about inﬂuential options and
interactions 43 f( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 f : ℂ → ℝ

Performance model can then be used to reason about qualities
44 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX endif else PARROT_HAS_SETENV LINUX f(·) = 5 + 3 ⇥ o1 Execution time (s) f(o1 := 0) = 5 f(o1 := 1) = 8

Insight: Performance measurements of the real system is “similar” to
the ones from the simulators 45 Measure Simulator (Gazebo) Data Conﬁgurations Performance So why not reuse these data, instead of measuring on real robot?

We developed methods to make learning cheaper via transfer learning
46 Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION Understanding the performance behavior of configurable software systems can enable (i) performance debugging, (ii) performance tuning, (iii) design-time evolution, or (iv) runtime adaptation [11]. We lack empirical understanding of how the performance behavior of a system will vary when the environment of the system changes. Such empirical understanding will provide important insights to develop faster and more accurate learning techniques that allow us to make predictions and optimizations of performance for highly configurable systems in changing environments [10]. For instance, we can learn performance behavior of a system on a cheap hardware in a controlled lab environment and use that to understand the performance behavior of the system on a production server before shipping to the end user. More specifically, we would like to know, what the relationship is between the performance of a system in a specific environment (characterized by software configuration, hardware, workload, and system version) to the one that we vary its environmental conditions. In this research, we aim for an empirical understanding of A. Preliminary concept In this section, we p cepts that we use throu enable us to concisely 1) Configuration and the i-th feature of a co enabled or disabled an configuration space is m all the features C = Dom(Fi) = {0, 1}. A a member of the confi all the parameters are range (i.e., complete ins We also describe an e = [w, h, v] drawn fr W ⇥H ⇥V , where they values for workload, ha 2) Performance mod configuration space F formance model is a b given some observation combination of system II. INTUITION rstanding the performance behavior of configurable systems can enable (i) performance debugging, (ii) ance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the ance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn ance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- e behavior of the system on a production server before to the end user. More specifically, we would like to hat the relationship is between the performance of a n a specific environment (characterized by software ation, hardware, workload, and system version) to the we vary its environmental conditions. s research, we aim for an empirical understanding of A. Preliminary concepts In this section, we provide formal definitions of cepts that we use throughout this study. The forma enable us to concisely convey concept throughout 1) Configuration and environment space: Let F the i-th feature of a configurable system A whic enabled or disabled and one of them holds by de configuration space is mathematically a Cartesian all the features C = Dom(F1) ⇥ · · · ⇥ Dom(F Dom(Fi) = {0, 1}. A configuration of a syste a member of the configuration space (feature spa all the parameters are assigned to a specific valu range (i.e., complete instantiations of the system’s pa We also describe an environment instance by 3 e = [w, h, v] drawn from a given environment s W ⇥H ⇥V , where they respectively represent sets values for workload, hardware and system version. 2) Performance model: Given a software syste configuration space F and environmental instances formance model is a black-box function f : F ⇥ given some observations of the system performanc combination of system’s features x 2 F in an en formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To tem version) to the ons. al understanding of g via an informed learning a perfor- ed on a well-suited the knowledge we the main research information (trans- o both source and ore can be carried r. This transferable [10]. s that we consider uration is a set of is the primary vari- rstand performance ike to understand der study will be formance model is a black-box function f : F ⇥ E given some observations of the system performance fo combination of system’s features x 2 F in an enviro e 2 E. To construct a performance model for a syst with configuration space F, we run A in environment in e 2 E on various combinations of configurations xi 2 F record the resulting performance values yi = f(xi) + ✏i F where ✏i ⇠ N (0, i). The training data for our regr models is then simply Dtr = {(xi, yi)}n i=1 . In other wo response function is simply a mapping from the input sp a measurable performance metric that produces interval- data (here we assume it produces real numbers). 3) Performance distribution: For the performance m we measured and associated the performance response t configuration, now let introduce another concept whe vary the environment and we measure the performanc empirical performance distribution is a stochastic pr pd : E ! (R), that defines a probability distributio performance measures for each environmental conditio Extract Reuse Learn Learn Goal: Gain strength by transferring information across environments

What is the advantage of transfer learning? • During learning
you may need thousands of rotten and fresh potato and hours of training to learn. • But now using the same knowledge of rotten features you can identify rotten tomato with less samples and training time. • You may have learned during daytime with enough light and exposure; but your present tomato identiﬁcation job is at night. • You may have learned sitting very close, just beside the box of potato; but now for tomato identiﬁcation you are in the other side of the glass. 47

48 Data Data Data Measure Measure Reuse Learn TurtleBot Simulator
(Gazebo) [P. Jamshidi, et al., “Transfer learning for improving model predictions ….”, SEAMS’17] Conﬁgurations Our transfer learning solution f(o1, o2) = 5 + 3o1 + 15o2 7o1 ⇥ o2

input, x Gaussian processes for performance modeling 49 t =
n t = n + 1 Observation Mean Uncertainty New observation output, f(x) input, x

Gaussian Processes enables reasoning about performance Step 1: Fit GP
to the data seen so far Step 2: Explore the model for regions of most variance Step 3: Sample that region Step 4: Repeat 50 -1.5 -1 -0.5 0 0.5 1 1.5 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Conﬁguration Space Empirical Model Experiment Experiment 0 20 40 60 80 100 120 140 160 180 200 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Selection Criteria Sequential Design

The intuition behind our transfer learning approach 51 Intuition: Observations
on the source(s) can affect predictions on the target Example: Learning the chess game make learning the Go game a lot easier! Conﬁgurations Performance Models tween the source and target functions, g, f, using observations Ds, Dt to build the predictive model ˆ f the relationship, we deﬁne the following kernel f k(f, g, x, x0) = kt(f, g) ⇥ kxx(x, x0), where the kernels kt represent the correlation bet and target function, while kxx is the covariance inputs. Typically, kxx is parameterized and its pa learnt by maximizing the marginal likelihood o given the observations from source and target D

CoBot experiment: DARPA BRASS 52 0 2 4 6 8
Localization error [m] 10 15 20 25 30 35 40 CPU utilization [%] Energy constraint Safety constraint Pareto front Sweet Spot better better no_of_particles=x no_of_reﬁnement=y

CoBot experiment 53 5 10 15 20 25 5 10
15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 Source (given) Target (ground truth 6 months) Prediction with 4 samples Prediction with Transfer learning CPU [%] CPU [%]

54 Transfer Learning for Improving Model Predictions in Highly Configurable
Software Pooyan Jamshidi, Miguel Velez, Christian K¨ astner Carnegie Mellon University, USA {pjamshid,mvelezce,kaestner}@cs.cmu.edu Norbert Siegmund Bauhaus-University Weimar, Germany [email protected] Prasad Kawthekar Stanford University, USA [email protected] Abstract —Modern software systems are built to be used in dynamic environments using configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at Predictive Model Learn Model with Transfer Learning Measure Measure Data Source Target Simulator (Source) Robot (Target) Adaptation Fig. 1: Transfer learning for performance model learning. order to identify the best performing configuration for a robot Details: [SEAMS ’17]

Current Research

Looking further: When transfer learning goes wrong 56 10 20
30 40 50 60 Absolute Percentage Error [%] Sources s s1 s2 s3 s4 s5 s6 noise-level 0 5 10 15 20 25 30 corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19 µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75 It worked! It didn’t! Insight: Predictions become more accurate when the source is more related to the target. Non-transfer-learning

5 10 15 20 25 number of particles 5 10
15 20 25 number of refinements 5 10 15 20 25 30 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 12 14 16 18 20 22 24 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 6 8 10 12 14 16 18 20 22 24 (a) (b) (c) (d) (e) 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 12 14 16 18 20 22 24 (f) CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] It worked! It worked! It worked! It didn’t! It didn’t! It didn’t!

Key question: Can we develop a theory to explain when
transfer learning works? 58 Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION rstanding the performance behavior of configurable e systems can enable (i) performance debugging, (ii) mance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the mance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate g techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn mance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- ce behavior of the system on a production server before g to the end user. More specifically, we would like to what the relationship is between the performance of a in a specific environment (characterized by software ration, hardware, workload, and system version) to the t we vary its environmental conditions. is research, we aim for an empirical understanding of mance behavior to improve learning via an informed g process. In other words, we at learning a perfor- model in a changed environment based on a well-suited g set that has been determined by the knowledge we in other environments. Therefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four concepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a performance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 ON behavior of configurable erformance debugging, (ii) e evolution, or (iv) runtime understanding of how the will vary when the environ- mpirical understanding will op faster and more accurate to make predictions and ighly configurable systems or instance, we can learn on a cheap hardware in a that to understand the per- a production server before cifically, we would like to ween the performance of a (characterized by software and system version) to the conditions. empirical understanding of learning via an informed we at learning a perfor- ment based on a well-suited ned by the knowledge we erefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four concepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a performance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 oad, hardware and system version. e model: Given a software system A with ce F and environmental instances E, a per- is a black-box function f : F ⇥ E ! R rvations of the system performance for each ystem’s features x 2 F in an environment ruct a performance model for a system A n space F, we run A in environment instance combinations of configurations xi 2 F, and ng performance values yi = f(xi) + ✏i, xi 2 (0, i). The training data for our regression mply Dtr = {(xi, yi)}n i=1 . In other words, a is simply a mapping from the input space to ormance metric that produces interval-scaled ume it produces real numbers). e distribution: For the performance model, associated the performance response to each w let introduce another concept where we ment and we measure the performance. An mance distribution is a stochastic process, that defines a probability distribution over sures for each environmental conditions. To ormance distribution for a system A with ce F, similarly to the process of deriving models, we run A on various combinations 2 F, for a specific environment instance values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a performance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To construct a performance distribution for a system A with configuration space F, similarly to the process of deriving the performance models, we run A on various combinations configurations xi 2 F, for a specific environment instance Extract Reuse Learn Learn Q1: How source and target are “related”? Q2: What characteristics are preserved? Q3: What are the actionable insights?

We hypothesized that we can exploit similarities across environments to
learn “cheaper” performance models 59 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn ys1 = fs (c1 ) ys2 = fs (c2 ) ys3 = fs (c3 ) ysn = fs (cn ) O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ yt1 = ft (c1 ) yt2 = ft (c2 ) yt3 = ft (c3 ) ytn = ft (cn ) Source Environment (Execution time of Program X) Target Environment (Execution time of Program Y) Similarity [P. Jamshidi, et al., “Transfer learning for performance modeling of conﬁgurable systems….”, ASE’17]

Our empirical study: We looked at different highly- conﬁgurable systems
to gain insights 60 [P. Jamshidi, et al., “Transfer learning for performance modeling of conﬁgurable systems….”, ASE’17] SPEAR (SAT Solver) Analysis time 14 options 16,384 configurations SAT problems 3 hardware 2 versions X264 (video encoder) Encoding time 16 options 4,000 configurations Video quality/size 2 hardware 3 versions SQLite (DB engine) Query time 14 options 1,000 configurations DB Queries 2 hardware 2 versions SaC (Compiler) Execution time 50 options 71,267 configurations 10 Demo programs

Linear shift happens only in limited environmental changes 61 Soft
Environmental change Severity Corr. SPEAR NUC/2 -> NUC/4 Small 1.00 Amazon_nano -> NUC Large 0.59 Hardware/workload/version V Large -0.10 x264 Version Large 0.06 Workload Medium 0.65 SQLite write-seq -> write-batch Small 0.96 read-rand -> read-seq Medium 0.50 Target Source Throughput Implication: Simple transfer learning is limited to hardware changes in practice log P(θ, Xobs ) Θ l P(θ|Xobs ) Θ Figure 5: The ﬁrst column shows the log joint probab

Soft Environmental change Severity Dim t-test x264 Version Large 16
12 10 Hardware/workload/ver V Large 8 9 SQLite write-seq -> write-batch V Large 14 3 4 read-rand -> read-seq Medium 1 1 SaC Workload V Large 50 16 10 Implication: Avoid wasting budget on non-informative part of conﬁguration space and focusing where it matters. Inﬂuential options and interactions are preserved across environments 62 216 250 = 0.000000000058 We only need to explore part of the space:

Transfer learning across environment 63 O1 × O2 × ⋯
× O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn ys1 = fs (c1 ) ys2 = fs (c2 ) ys3 = fs (c3 ) ysn = fs (cn ) Source (Execution time of Program X) Learn performance model ̂ fs ∼ fs ( ⋅ )

Observation 1: Not all options and interactions are inﬂuential and
interactions degree between options are not high 64 ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 ℂ = O1 × O2 × O3 × O4 × O5 × O6 × O7 × O8 × O9 × O10

Observation 2: Inﬂuential options and interactions are preserved across environments
65 ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 ̂ ft ( ⋅ ) = 10.4 − 2.1o1 + 1.2o3 + 2.2o7 + 0.1o1 o3 − 2.1o3 o7 + 14o1 o3 o7

66 Transfer Learning for Performance Modeling of Configurable Systems: An
Exploratory Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configuration options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many Details: [ASE ’17]

67 Details: [AAAI Spring Symposium ’19]

Current Research

How to sample the conﬁguration space to learn a “better”
performance behavior? How to select the most informative conﬁgurations?

The similarity across environment is a rich source of knowledge
for exploration of the conﬁguration space

When we treat the system as black boxes, we cannot
typically distinguish between different conﬁgurations O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn • We therefore end up blindly explore the conﬁguration space • That is essentially the key reason why “most” work in this area consider random sampling. 71

Without considering this knowledge, many samples may not provide new
information 72 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 0 × 0 c1 c2 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 0 × 1 O1 × O2 × O3 × O4 × O5 × O6 × O7 × O8 × O9 × O10 ̂ fs (c1 ) = 14.9 ̂ fs (c2 ) = 14.9 c3 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 1 × 0 ̂ fs (c3 ) = 14.9 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 ̂ fs (c128 ) = 14.9 c128 ⋯ ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7

Without knowing this knowledge, many blind/ random samples may not
provide any additional information about performance of the system

Evaluation: Learning performance behavior of Machine Learning Systems ML system:
https://pooyanjamshidi.github.io/mls

Configurations of deep neural networks affect accuracy and energy consumption
75 0 500 1000 1500 2000 2500 Energy consumption [J] 0 20 40 60 80 100 Validation (test) error CNN on CNAE-9 Data Set 72% 22X the selected cell is plugged into a large model which is trained on the combinatio validation sub-sets, and the accuracy is reported on the CIFAR-10 test set. We not is never used for model selection, and it is only used for final model evaluation. We cells, learned on CIFAR-10, in a large-scale setting on the ImageNet challenge dat LPDJH VHSFRQY[ VHSFRQY[ VHSFRQY[ FRQY[ JOREDOSRRO OLQHDU VRIWPD[ hZ<YY .ÂÁZ]GIY FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ Y<gOI .ÂÁZ]GIY FHOO FHOO FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ FRQY[ VHSFRQY[ JOREDOSRRO OLQHDU VRIWPD[ Z<OI"IjZ]GIY FHOO FHOO FHOO FHOO FHOO FHOO FHOO Figure 2: Image classification models constructed using the cells optimized with arc Top-left: small model used during architecture search on CIFAR-10. Top-right: model used for learned cell evaluation. Bottom: ImageNet model used for learned For CIFAR-10 experiments we use a model which consists of 3 ⇥ 3 convolution w the selected cell is plugged into a large model which is trained on the combination of training and validation sub-sets, and the accuracy is reported on the CIFAR-10 test set. We note that the test set is never used for model selection, and it is only used for final model evaluation. We also evaluate the cells, learned on CIFAR-10, in a large-scale setting on the ImageNet challenge dataset (Sect. 4.3). LPDJH VHSFRQY[ VHSFRQY[ VHSFRQY[ FRQY[ JOREDOSRRO OLQHDU VRIWPD[ hZ<YY .ÂÁZ]GIY FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ JOREDOSRRO OLQHDU VRIWPD[ Y<gOI .ÂÁZ]GIY FHOO FHOO FHOO FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ FRQY[ VHSFRQY[ JOREDOSRRO OLQHDU VRIWPD[ Z<OI"IjZ]GIY FHOO FHOO FHOO FHOO FHOO FHOO FHOO Figure 2: Image classification models constructed using the cells optimized with architecture search. Top-left: small model used during architecture search on CIFAR-10. Top-right: large CIFAR-10 model used for learned cell evaluation. Bottom: ImageNet model used for learned cell evaluation.

DNN measurements are costly Each sample cost ~1h 4000 *
1h ~= 6 months Yes, that’s the cost we paid for conducting our measurements!

L2S enables learning a more accurate model with less samples
exploiting the knowledge from the source 77 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (a) DNN (hard) 3 10 20 30 40 Sample Si 0 20 40 60 80 100 Mean Absolute Percentage Error L L D M R (b) XGBoost (h Convolutional Neural Network

L2S may also help data-reuse approach to learn faster 78
30 40 50 60 70 Sample Size 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (a) DNN (hard) 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (b) XGBoost (hard) 3 10 20 30 40 Sample Siz 0 20 40 60 80 100 Mean Absolute Percentage Error L L D M R (c) Storm (ha XGBoost

Evaluation: Learning performance behavior of Big Data Systems

Some environments the similarities across environments may be too low
and this results in “negative transfer” 80 0 30 40 50 60 70 Sample Size 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (b) XGBoost (hard) 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (c) Storm (hard) Apache Storm

Why performance models using L2S sample are more accurate?

The samples generated by L2S contains more information… “entropy <->
information gain” 82 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 entropy [bits] Max entropy L2S Random 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 7 entropy [bits] Max entropy L2S Random 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 7 entropy [bits] Max entropy L2S Random DNN XGboost Storm

Limitations • Limited number of systems and environmental changes •
Synthetic models • https://github.com/pooyanjamshidi/GenPerf • Binary options • Non-binary options -> binary • Negative transfer 83

Details: [FSE ’18] 84

Outline 85 Case Study Transfer Learning Empirical Study Guided Sampling
Current Research

What will the software systems of the future look like?

Software 2.0 87 Increasingly customized and conﬁgurable VISION Increasingly competing
objectives Accuracy Training speed Inference speed Model size Energy

Deep neural network as a highly configurable system 88 of
top/bottom conf.; M6/M7: Number of influential options; M8/M9: Number of options agree Correlation btw importance of options; M11/M12: Number of interactions; M13: Number of inte e↵ects; M14: Correlation btw the coe↵s; Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer 4 Technical Aims and Research Plan We will pursue the following technical aims: (1) investigate potential criteria for e↵ectiv exploration of the design space of DNN architectures (Section 4.2), (2) build analytical m curately predict the performance of a given architecture configuration given other similar which either have been measured in the target environments or other similar environm measuring the network performance directly (Section 4.3), and (3), develop a tunni that exploit the performance model from previous step to e↵ectively search for optima (Section 4.4). 4.1 Project Timeline We plan to complete the proposed project in two years. To mitigate project risks, we project into three major phases: 8 Network Design Model Compiler Hybrid Deployment OS/ Hardware Scope of this Project Neural Search Hardware Optimization Hyper-parameter DNN system development stack Deployment Topology

We found many conﬁguration with the same accuracy while having
drastically different energy demand 89 0 500 1000 1500 2000 2500 Energy consumption [J] 0 20 40 60 80 100 Validation (test) error CNN on CNAE-9 Data Set 72% 22X 100 150 200 250 300 350 400 Energy consumption [J] 8 10 12 14 16 18 Validation (test) error CNN on CNAE-9 Data Set 22X 10% 300J Pareto frontier

Details: [FlexiBO] 90

Outline 91 Case Study Transfer Learning Empirical Study Guided Sampling
Current Research

Debugging based on statistical correlation could be misleading GPU Growth
= 33% GPU Growth = 66% Swap = 1Gb Swap = 4Gb Swap Mem Swap Mem Swap Mem GPU Growth GPU Growth GPU Growth Latency Latency Latency GPU Growth Swap Mem Latency Latency Latency Latency • Correlation between GPU Growth and Latency is as strong as Swap Mem and Latency, but considerably less noisy. • Therefore, a feature selection method based on correlation while ignoring the causal structure prefer GPU Growth as the predictor for Latency which is misleading. 92

Why knowing about the underlying causal structure matters A transfer
learning scenario • The relation between X1 and X2 is about equally strong as the relation between X2 and X3, but more noisy. • {X3} and {X1, X3} are preferred over {X1}, because predicting Y from X1 leads to: • A larger variance than predicting Y from X3 • A larger bias than predicting Y from both X1 and X3. 93 Magliacane, Sara, et al. "Domain adaptation by using causal inference to predict invariant conditional distributions." Advances in Neural Information Processing Systems. 2018.

CAUPER: Causal Performance Debugger localizes and repairs performance faults approx.
50 samples 94

CAUPER is centered around causal structure discovery 95

• We measure the Individual Treatment Effect of each repair:
• The difference between the probability that the performance fault is fixed after a repair and the probability that the performance fault is still faulty after a repair . • Larger the value, more likely we are to repair the fault. • We pick the repair with the largest ITE. CAUPER iteratively explore potential performance repairs 96

CAUPER is able to find more accurate causes comparing with
an statistical debugging 97

CAUPER is able to find comparable and even better repairs
for performance faults comparing with performance optimization 98 X5k X10k X20k X50k Workload 0.4 0.5 0.6 0.7 0.8 0.9 Latency-Gain CAUPER SMAC X5k X10k X20k X50k Workload ≠0.2 ≠0.1 0.0 0.1 0.2 0.3 Heat-Gain CAUPER SMAC X5k X10k X20k X50k Workload ≠1.0 ≠0.5 0.0 0.5 Energy-Gain CAUPER SMAC

CAUPER is able to find repairs with lower costs 99
X5k X10k X20k X50k Workload 0 2500 5000 7500 10000 12500 15000 Time CAUPER SMAC

CAUPER’s repairs are transferable across environments 100 X5k-X10k X5k-X20k X5k-X50k
Workload 0.2 0.4 0.6 0.8 Latency-Gain CAUPER SMAC-Rerun SMAC-Reuse

Team effort Rahul Krishna Columbia Shahriar Iqbal UofSC M. A.
Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC Vivek Nair Facebook Tim Menzies NCSU

Machine Learning meets Software Performance

Machine Learning meets Software Performance

More Decks by Pooyan Jamshidi

Other Decks in Research

Featured

Transcript