Slide 1

Slide 1 text

Machine Learning meets Software Performance A journey from optimization to transfer learning all the way to counterfactual causal inference ASE 2020 Tutorial Friday 25 Sep Pooyan Jamshidi UofSC Europa Lander NASA Europa Clipper NASA

Slide 2

Slide 2 text

Artificial Intelligence and Systems Laboratory (AISys Lab) Machine Learning Computer Systems Autonomy Learning-enabled Autonomous Systems https://pooyanjamshidi.github.io/AISys/ 2

Slide 3

Slide 3 text

Research Directions at AISys 3 Theory:
 - Transfer Learning
 - Causal Invariances
 - Structure Learning
 - Concept Learning
 - Physics-Informed
 
 Applications:
 - Systems
 - Autonomy
 - Robotics Well-known Physics Big Data Limited known Physics Small Data Causal AI

Slide 4

Slide 4 text

Team effort 4 Rahul Krishna Columbia Shahriar Iqbal UofSC M. A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC Vivek Nair Facebook Tim Menzies NCSU

Slide 5

Slide 5 text

5 Goal: Enable developers/users to find the right quality tradeoff

Slide 6

Slide 6 text

Today’s most popular systems are configurable 6 built

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

Empirical observations confirm that systems are becoming increasingly configurable 8 08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 9

Slide 9 text

Empirical observations confirm that systems are becoming increasingly configurable 9 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 10

Slide 10 text

Configurations determine the performance behavior 10 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy

Slide 11

Slide 11 text

11 How do we understand performance behavior of real-world highly-configurable systems that scale well… … and enable developers/users to reason about qualities (performance, energy) and to make tradeoff?

Slide 12

Slide 12 text

Scope: Configuration across stack 12 CPU Memory Controller GPU Lib API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers

Slide 13

Slide 13 text

Composed Systems (Single-node) • Online processing of sensory data • Neural network models • Homogeneous tasks 13

Slide 14

Slide 14 text

Composed Systems (Multi-node) • Online processing of sensory data • Graph types models • Heterogeneous tasks 14

Slide 15

Slide 15 text

Composed Systems (IoT) • They are integrated with cloud services and we do not have access to those system, we could configure them to some extent. 15

Slide 16

Slide 16 text

Distributed (big data) • The components may be assigned to different hardware nodes without direct control of the users • These are typically configurable, but need expertise to find the best configuration 16

Slide 17

Slide 17 text

Cyber-physical systems • We may not have direct access to the hardware directly, so remote debugging is needed. 17

Slide 18

Slide 18 text

Cloud, Multi-cloud Systems Event-driven systems • Code migrate from one hardware to another (lots of interactions) 18

Slide 19

Slide 19 text

Outline 19 Case Study Transfer Learning Theory Building Guided Sampling Current Research [SEAMS’17] [ASE’17] [FSE’18]

Slide 20

Slide 20 text

SocialSensor •Identifying trending topics •Identifying user defined topics •Social media search 20

Slide 21

Slide 21 text

SocialSensor 21 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet

Slide 22

Slide 22 text

Challenges 22 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time

Slide 23

Slide 23 text

23 How can we gain a better performance without using more resources?

Slide 24

Slide 24 text

24 Let’s try out different system configurations!

Slide 25

Slide 25 text

Opportunity: Data processing engines in the pipeline were all configurable 25 > 100 > 100 > 100 2300

Slide 26

Slide 26 text

26 More combinations than estimated atoms in the universe

Slide 27

Slide 27 text

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000 4000 5000 Average write latency ( s) Default configuration was bad, so was the expert’ 27 Default Recommended by an expert Optimal Configuration better better

Slide 28

Slide 28 text

0 0.5 1 1.5 2 2.5 Throughput (ops/sec) 104 0 50 100 150 200 250 300 Latency (ms) Default configuration was bad, so was the expert’ 28 Default Recommended by an expert Optimal Configuration better better

Slide 29

Slide 29 text

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000 4000 5000 Average write latency ( s) The default configuration is typically bad and the optimal configuration is noticeably better than median 29 Default Configuration Optimal Configuration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median

Slide 30

Slide 30 text

100X more user cloud resources reduced 20% outperform expert recommendation

Slide 31

Slide 31 text

Identifying the root cause of performance faults is difficult ● Code was transplanted from TX1 to TX2 ● TX2 is more powerful, but software was 2x slower than TX1 ● Three misconfigurations: ○ Wrong compilation flags for compiling CUDA (didn't use 'dynamic' flag) ○ Wrong CPU/GPU modes (didn't use TX2 optimized cores) ○ Wrong Fan mode (didn't change to handle thermal throttling) Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 31

Slide 32

Slide 32 text

Fixing performance faults is difficult ● These were not in the default settings ● Took 1 month to fix in the end... ● We need to do this better Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 32

Slide 33

Slide 33 text

Identifying the root cause of performance faults is difficult 33

Slide 34

Slide 34 text

Identifying the root cause of performance faults is difficult 34

Slide 35

Slide 35 text

Identifying the root cause of performance faults is difficult 35

Slide 36

Slide 36 text

Identifying the root cause of performance faults is difficult 36

Slide 37

Slide 37 text

Performance distributions are multi-modal and have long tails • Certain configurations can cause performance to take abnormally large values
 • Faulty configurations take the tail values (worse than 99.99th percentile)
 • Certain configurations can cause faults on multiple performance objectives. 
 37

Slide 38

Slide 38 text

Outline 38 Case Study Transfer Learning Theory Building Guided Sampling Current Research

Slide 39

Slide 39 text

Setting the scene 39 ℂ = O1 × O2 × ⋯ × O19 × O20 Dead code removal Configuration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Configure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quantifiable aspect

Slide 40

Slide 40 text

A typical approach for understanding the performance behavior is sensitivity analysis 40 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Learn Training/Sample Set

Slide 41

Slide 41 text

Performance model could be in any appropriate form of black-box models 41 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Training/Sample Set

Slide 42

Slide 42 text

Evaluating a performance model 42 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Learn Training/Sample Set Evaluate Accuracy APE( ̂ f, f ) = | ̂ f(c) − f(c)| f(c) × 100

Slide 43

Slide 43 text

A performance model contain useful information about influential options and interactions 43 f( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 f : ℂ → ℝ

Slide 44

Slide 44 text

Performance model can then be used to reason about qualities 44 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX endif else PARROT_HAS_SETENV LINUX f(·) = 5 + 3 ⇥ o1 Execution time (s) f(o1 := 0) = 5 f(o1 := 1) = 8

Slide 45

Slide 45 text

Insight: Performance measurements of the real system is “similar” to the ones from the simulators 45 Measure Simulator (Gazebo) Data Configurations Performance So why not reuse these data, instead of measuring on real robot?

Slide 46

Slide 46 text

We developed methods to make learning cheaper via transfer learning 46 Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION Understanding the performance behavior of configurable software systems can enable (i) performance debugging, (ii) performance tuning, (iii) design-time evolution, or (iv) runtime adaptation [11]. We lack empirical understanding of how the performance behavior of a system will vary when the environ- ment of the system changes. Such empirical understanding will provide important insights to develop faster and more accurate learning techniques that allow us to make predictions and optimizations of performance for highly configurable systems in changing environments [10]. For instance, we can learn performance behavior of a system on a cheap hardware in a controlled lab environment and use that to understand the per- formance behavior of the system on a production server before shipping to the end user. More specifically, we would like to know, what the relationship is between the performance of a system in a specific environment (characterized by software configuration, hardware, workload, and system version) to the one that we vary its environmental conditions. In this research, we aim for an empirical understanding of A. Preliminary concept In this section, we p cepts that we use throu enable us to concisely 1) Configuration and the i-th feature of a co enabled or disabled an configuration space is m all the features C = Dom(Fi) = {0, 1}. A a member of the confi all the parameters are range (i.e., complete ins We also describe an e = [w, h, v] drawn fr W ⇥H ⇥V , where they values for workload, ha 2) Performance mod configuration space F formance model is a b given some observation combination of system II. INTUITION rstanding the performance behavior of configurable systems can enable (i) performance debugging, (ii) ance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the ance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn ance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- e behavior of the system on a production server before to the end user. More specifically, we would like to hat the relationship is between the performance of a n a specific environment (characterized by software ation, hardware, workload, and system version) to the we vary its environmental conditions. s research, we aim for an empirical understanding of A. Preliminary concepts In this section, we provide formal definitions of cepts that we use throughout this study. The forma enable us to concisely convey concept throughout 1) Configuration and environment space: Let F the i-th feature of a configurable system A whic enabled or disabled and one of them holds by de configuration space is mathematically a Cartesian all the features C = Dom(F1) ⇥ · · · ⇥ Dom(F Dom(Fi) = {0, 1}. A configuration of a syste a member of the configuration space (feature spa all the parameters are assigned to a specific valu range (i.e., complete instantiations of the system’s pa We also describe an environment instance by 3 e = [w, h, v] drawn from a given environment s W ⇥H ⇥V , where they respectively represent sets values for workload, hardware and system version. 2) Performance model: Given a software syste configuration space F and environmental instances formance model is a black-box function f : F ⇥ given some observations of the system performanc combination of system’s features x 2 F in an en formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To tem version) to the ons. al understanding of g via an informed learning a perfor- ed on a well-suited the knowledge we the main research information (trans- o both source and ore can be carried r. This transferable [10]. s that we consider uration is a set of is the primary vari- rstand performance ike to understand der study will be formance model is a black-box function f : F ⇥ E given some observations of the system performance fo combination of system’s features x 2 F in an enviro e 2 E. To construct a performance model for a syst with configuration space F, we run A in environment in e 2 E on various combinations of configurations xi 2 F record the resulting performance values yi = f(xi) + ✏i F where ✏i ⇠ N (0, i). The training data for our regr models is then simply Dtr = {(xi, yi)}n i=1 . In other wo response function is simply a mapping from the input sp a measurable performance metric that produces interval- data (here we assume it produces real numbers). 3) Performance distribution: For the performance m we measured and associated the performance response t configuration, now let introduce another concept whe vary the environment and we measure the performanc empirical performance distribution is a stochastic pr pd : E ! (R), that defines a probability distributio performance measures for each environmental conditio Extract Reuse Learn Learn Goal: Gain strength by transferring information across environments

Slide 47

Slide 47 text

What is the advantage of transfer learning? • During learning you may need thousands of rotten and fresh potato and hours of training to learn. • But now using the same knowledge of rotten features you can identify rotten tomato with less samples and training time. • You may have learned during daytime with enough light and exposure; but your present tomato identification job is at night. • You may have learned sitting very close, just beside the box of potato; but now for tomato identification you are in the other side of the glass. 47

Slide 48

Slide 48 text

48 Data Data Data Measure Measure Reuse Learn TurtleBot Simulator (Gazebo) [P. Jamshidi, et al., “Transfer learning for improving model predictions ….”, SEAMS’17] Configurations Our transfer learning solution f(o1, o2) = 5 + 3o1 + 15o2 7o1 ⇥ o2

Slide 49

Slide 49 text

input, x Gaussian processes for performance modeling 49 t = n t = n + 1 Observation Mean Uncertainty New observation output, f(x) input, x

Slide 50

Slide 50 text

Gaussian Processes enables reasoning about performance Step 1: Fit GP to the data seen so far Step 2: Explore the model for regions of most variance Step 3: Sample that region Step 4: Repeat 50 -1.5 -1 -0.5 0 0.5 1 1.5 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Configuration Space Empirical Model Experiment Experiment 0 20 40 60 80 100 120 140 160 180 200 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Selection Criteria Sequential Design

Slide 51

Slide 51 text

The intuition behind our transfer learning approach 51 Intuition: Observations on the source(s) can affect predictions on the target Example: Learning the chess game make learning the Go game a lot easier! Configurations Performance Models tween the source and target functions, g, f, using observations Ds, Dt to build the predictive model ˆ f the relationship, we define the following kernel f k(f, g, x, x0) = kt(f, g) ⇥ kxx(x, x0), where the kernels kt represent the correlation bet and target function, while kxx is the covariance inputs. Typically, kxx is parameterized and its pa learnt by maximizing the marginal likelihood o given the observations from source and target D

Slide 52

Slide 52 text

CoBot experiment: DARPA BRASS 52 0 2 4 6 8 Localization error [m] 10 15 20 25 30 35 40 CPU utilization [%] Energy constraint Safety constraint Pareto front Sweet Spot better better no_of_particles=x no_of_refinement=y

Slide 53

Slide 53 text

CoBot experiment 53 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 Source (given) Target (ground truth 6 months) Prediction with 4 samples Prediction with Transfer learning CPU [%] CPU [%]

Slide 54

Slide 54 text

54 Transfer Learning for Improving Model Predictions in Highly Configurable Software Pooyan Jamshidi, Miguel Velez, Christian K¨ astner Carnegie Mellon University, USA {pjamshid,mvelezce,kaestner}@cs.cmu.edu Norbert Siegmund Bauhaus-University Weimar, Germany [email protected] Prasad Kawthekar Stanford University, USA [email protected] Abstract —Modern software systems are built to be used in dynamic environments using configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at Predictive Model Learn Model with Transfer Learning Measure Measure Data Source Target Simulator (Source) Robot (Target) Adaptation Fig. 1: Transfer learning for performance model learning. order to identify the best performing configuration for a robot Details: [SEAMS ’17]

Slide 55

Slide 55 text

Outline 55 Case Study Transfer Learning Theory Building Guided Sampling Current Research

Slide 56

Slide 56 text

Looking further: When transfer learning goes wrong 56 10 20 30 40 50 60 Absolute Percentage Error [%] Sources s s1 s2 s3 s4 s5 s6 noise-level 0 5 10 15 20 25 30 corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19 µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75 It worked! It didn’t! Insight: Predictions become more accurate when the source is more related to the target. Non-transfer-learning

Slide 57

Slide 57 text

5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 5 10 15 20 25 30 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 12 14 16 18 20 22 24 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 6 8 10 12 14 16 18 20 22 24 (a) (b) (c) (d) (e) 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 12 14 16 18 20 22 24 (f) CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] It worked! It worked! It worked! It didn’t! It didn’t! It didn’t!

Slide 58

Slide 58 text

Key question: Can we develop a theory to explain when transfer learning works? 58 Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION rstanding the performance behavior of configurable e systems can enable (i) performance debugging, (ii) mance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the mance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate g techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn mance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- ce behavior of the system on a production server before g to the end user. More specifically, we would like to what the relationship is between the performance of a in a specific environment (characterized by software ration, hardware, workload, and system version) to the t we vary its environmental conditions. is research, we aim for an empirical understanding of mance behavior to improve learning via an informed g process. In other words, we at learning a perfor- model in a changed environment based on a well-suited g set that has been determined by the knowledge we in other environments. Therefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 ON behavior of configurable erformance debugging, (ii) e evolution, or (iv) runtime understanding of how the will vary when the environ- mpirical understanding will op faster and more accurate to make predictions and ighly configurable systems or instance, we can learn on a cheap hardware in a that to understand the per- a production server before cifically, we would like to ween the performance of a (characterized by software and system version) to the conditions. empirical understanding of learning via an informed we at learning a perfor- ment based on a well-suited ned by the knowledge we erefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 oad, hardware and system version. e model: Given a software system A with ce F and environmental instances E, a per- is a black-box function f : F ⇥ E ! R rvations of the system performance for each ystem’s features x 2 F in an environment ruct a performance model for a system A n space F, we run A in environment instance combinations of configurations xi 2 F, and ng performance values yi = f(xi) + ✏i, xi 2 (0, i). The training data for our regression mply Dtr = {(xi, yi)}n i=1 . In other words, a is simply a mapping from the input space to ormance metric that produces interval-scaled ume it produces real numbers). e distribution: For the performance model, associated the performance response to each w let introduce another concept where we ment and we measure the performance. An mance distribution is a stochastic process, that defines a probability distribution over sures for each environmental conditions. To ormance distribution for a system A with ce F, similarly to the process of deriving models, we run A on various combinations 2 F, for a specific environment instance values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To construct a performance distribution for a system A with configuration space F, similarly to the process of deriving the performance models, we run A on various combinations configurations xi 2 F, for a specific environment instance Extract Reuse Learn Learn Q1: How source and target are “related”? Q2: What characteristics are preserved? Q3: What are the actionable insights?

Slide 59

Slide 59 text

We hypothesized that we can exploit similarities across environments to learn “cheaper” performance models 59 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn ys1 = fs (c1 ) ys2 = fs (c2 ) ys3 = fs (c3 ) ysn = fs (cn ) O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ yt1 = ft (c1 ) yt2 = ft (c2 ) yt3 = ft (c3 ) ytn = ft (cn ) Source Environment (Execution time of Program X) Target Environment (Execution time of Program Y) Similarity [P. Jamshidi, et al., “Transfer learning for performance modeling of configurable systems….”, ASE’17]

Slide 60

Slide 60 text

Our empirical study: We looked at different highly- configurable systems to gain insights 60 [P. Jamshidi, et al., “Transfer learning for performance modeling of configurable systems….”, ASE’17] SPEAR (SAT Solver) Analysis time 14 options 16,384 configurations SAT problems 3 hardware 2 versions X264 (video encoder) Encoding time 16 options 4,000 configurations Video quality/size 2 hardware 3 versions SQLite (DB engine) Query time 14 options 1,000 configurations DB Queries 2 hardware 2 versions SaC (Compiler) Execution time 50 options 71,267 configurations 10 Demo programs

Slide 61

Slide 61 text

Linear shift happens only in limited environmental changes 61 Soft Environmental change Severity Corr. SPEAR NUC/2 -> NUC/4 Small 1.00 Amazon_nano -> NUC Large 0.59 Hardware/workload/version V Large -0.10 x264 Version Large 0.06 Workload Medium 0.65 SQLite write-seq -> write-batch Small 0.96 read-rand -> read-seq Medium 0.50 Target Source Throughput Implication: Simple transfer learning is limited to hardware changes in practice log P(θ, Xobs ) Θ l P(θ|Xobs ) Θ Figure 5: The first column shows the log joint probab

Slide 62

Slide 62 text

Soft Environmental change Severity Dim t-test x264 Version Large 16 12 10 Hardware/workload/ver V Large 8 9 SQLite write-seq -> write-batch V Large 14 3 4 read-rand -> read-seq Medium 1 1 SaC Workload V Large 50 16 10 Implication: Avoid wasting budget on non-informative part of configuration space and focusing where it matters. Influential options and interactions are preserved across environments 62 216 250 = 0.000000000058 We only need to explore part of the space:

Slide 63

Slide 63 text

Transfer learning across environment 63 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn ys1 = fs (c1 ) ys2 = fs (c2 ) ys3 = fs (c3 ) ysn = fs (cn ) Source (Execution time of Program X) Learn performance model ̂ fs ∼ fs ( ⋅ )

Slide 64

Slide 64 text

Observation 1: Not all options and interactions are influential and interactions degree between options are not high 64 ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 ℂ = O1 × O2 × O3 × O4 × O5 × O6 × O7 × O8 × O9 × O10

Slide 65

Slide 65 text

Observation 2: Influential options and interactions are preserved across environments 65 ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 ̂ ft ( ⋅ ) = 10.4 − 2.1o1 + 1.2o3 + 2.2o7 + 0.1o1 o3 − 2.1o3 o7 + 14o1 o3 o7

Slide 66

Slide 66 text

66 Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configura- tion options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many Details: [ASE ’17]

Slide 67

Slide 67 text

67 Details: [AAAI Spring Symposium ’19]

Slide 68

Slide 68 text

Outline 68 Case Study Transfer Learning Theory Building Guided Sampling Current Research

Slide 69

Slide 69 text

How to sample the configuration space to learn a “better” performance behavior? How to select the most informative configurations?

Slide 70

Slide 70 text

The similarity across environment is a rich source of knowledge for exploration of the configuration space

Slide 71

Slide 71 text

When we treat the system as black boxes, we cannot typically distinguish between different configurations O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn • We therefore end up blindly explore the configuration space • That is essentially the key reason why “most” work in this area consider random sampling. 71

Slide 72

Slide 72 text

Without considering this knowledge, many samples may not provide new information 72 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 0 × 0 c1 c2 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 0 × 1 O1 × O2 × O3 × O4 × O5 × O6 × O7 × O8 × O9 × O10 ̂ fs (c1 ) = 14.9 ̂ fs (c2 ) = 14.9 c3 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 1 × 0 ̂ fs (c3 ) = 14.9 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 ̂ fs (c128 ) = 14.9 c128 ⋯ ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7

Slide 73

Slide 73 text

Without knowing this knowledge, many blind/ random samples may not provide any additional information about performance of the system

Slide 74

Slide 74 text

Evaluation: Learning performance behavior of Machine Learning Systems ML system: https://pooyanjamshidi.github.io/mls

Slide 75

Slide 75 text

Configurations of deep neural networks affect accuracy and energy consumption 75 0 500 1000 1500 2000 2500 Energy consumption [J] 0 20 40 60 80 100 Validation (test) error CNN on CNAE-9 Data Set 72% 22X the selected cell is plugged into a large model which is trained on the combinatio validation sub-sets, and the accuracy is reported on the CIFAR-10 test set. We not is never used for model selection, and it is only used for final model evaluation. We cells, learned on CIFAR-10, in a large-scale setting on the ImageNet challenge dat LPDJH VHSFRQY[ VHSFRQY[ VHSFRQY[ FRQY[ JOREDOSRRO OLQHDU VRIWPD[ hZ

Slide 76

Slide 76 text

DNN measurements are costly Each sample cost ~1h 4000 * 1h ~= 6 months Yes, that’s the cost we paid for conducting our measurements!

Slide 77

Slide 77 text

L2S enables learning a more accurate model with less samples exploiting the knowledge from the source 77 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (a) DNN (hard) 3 10 20 30 40 Sample Si 0 20 40 60 80 100 Mean Absolute Percentage Error L L D M R (b) XGBoost (h Convolutional Neural Network

Slide 78

Slide 78 text

L2S may also help data-reuse approach to learn faster 78 30 40 50 60 70 Sample Size 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (a) DNN (hard) 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (b) XGBoost (hard) 3 10 20 30 40 Sample Siz 0 20 40 60 80 100 Mean Absolute Percentage Error L L D M R (c) Storm (ha XGBoost

Slide 79

Slide 79 text

Evaluation: Learning performance behavior of Big Data Systems

Slide 80

Slide 80 text

Some environments the similarities across environments may be too low and this results in “negative transfer” 80 0 30 40 50 60 70 Sample Size 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (b) XGBoost (hard) 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (c) Storm (hard) Apache Storm

Slide 81

Slide 81 text

Why performance models using L2S sample are more accurate?

Slide 82

Slide 82 text

The samples generated by L2S contains more information… “entropy <-> information gain” 82 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 entropy [bits] Max entropy L2S Random 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 7 entropy [bits] Max entropy L2S Random 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 7 entropy [bits] Max entropy L2S Random DNN XGboost Storm

Slide 83

Slide 83 text

Limitations • Limited number of systems and environmental changes • Synthetic models • https://github.com/pooyanjamshidi/GenPerf • Binary options • Non-binary options -> binary • Negative transfer 83

Slide 84

Slide 84 text

Details: [FSE ’18] 84

Slide 85

Slide 85 text

Outline 85 Case Study Transfer Learning Empirical Study Guided Sampling Current Research

Slide 86

Slide 86 text

What will the software systems of the future look like?

Slide 87

Slide 87 text

Software 2.0 87 Increasingly customized and configurable VISION Increasingly competing objectives Accuracy Training speed Inference speed Model size Energy

Slide 88

Slide 88 text

Deep neural network as a highly configurable system 88 of top/bottom conf.; M6/M7: Number of influential options; M8/M9: Number of options agree Correlation btw importance of options; M11/M12: Number of interactions; M13: Number of inte e↵ects; M14: Correlation btw the coe↵s; Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer 4 Technical Aims and Research Plan We will pursue the following technical aims: (1) investigate potential criteria for e↵ectiv exploration of the design space of DNN architectures (Section 4.2), (2) build analytical m curately predict the performance of a given architecture configuration given other similar which either have been measured in the target environments or other similar environm measuring the network performance directly (Section 4.3), and (3), develop a tunni that exploit the performance model from previous step to e↵ectively search for optima (Section 4.4). 4.1 Project Timeline We plan to complete the proposed project in two years. To mitigate project risks, we project into three major phases: 8 Network Design Model Compiler Hybrid Deployment OS/ Hardware Scope of this Project Neural Search Hardware Optimization Hyper-parameter DNN system development stack Deployment Topology

Slide 89

Slide 89 text

We found many configuration with the same accuracy while having drastically different energy demand 89 0 500 1000 1500 2000 2500 Energy consumption [J] 0 20 40 60 80 100 Validation (test) error CNN on CNAE-9 Data Set 72% 22X 100 150 200 250 300 350 400 Energy consumption [J] 8 10 12 14 16 18 Validation (test) error CNN on CNAE-9 Data Set 22X 10% 300J Pareto frontier

Slide 90

Slide 90 text

Details: [FlexiBO] 90

Slide 91

Slide 91 text

Outline 91 Case Study Transfer Learning Empirical Study Guided Sampling Current Research

Slide 92

Slide 92 text

Debugging based on statistical correlation could be misleading GPU Growth = 33% GPU Growth = 66% Swap = 1Gb Swap = 4Gb Swap Mem Swap Mem Swap Mem GPU Growth GPU Growth GPU Growth Latency Latency Latency GPU Growth Swap Mem Latency Latency Latency Latency ● Correlation between GPU Growth and Latency is as strong as Swap Mem and Latency, but considerably less noisy. ● Therefore, a feature selection method based on correlation while ignoring the causal structure prefer GPU Growth as the predictor for Latency which is misleading. 92

Slide 93

Slide 93 text

Why knowing about the underlying causal structure matters A transfer learning scenario • The relation between X1 and X2 is about equally strong as the relation between X2 and X3, but more noisy. • {X3} and {X1, X3} are preferred over {X1}, because predicting Y from X1 leads to: • A larger variance than predicting Y from X3 • A larger bias than predicting Y from both X1 and X3. 93 Magliacane, Sara, et al. "Domain adaptation by using causal inference to predict invariant conditional distributions." Advances in Neural Information Processing Systems. 2018.

Slide 94

Slide 94 text

CAUPER: Causal Performance Debugger localizes and repairs performance faults approx. 50 samples 94

Slide 95

Slide 95 text

CAUPER is centered around causal structure discovery 95

Slide 96

Slide 96 text

● We measure the Individual Treatment Effect of each repair: ● The difference between the probability that the performance fault is fixed after a repair and the probability that the performance fault is still faulty after a repair . ● Larger the value, more likely we are to repair the fault. ● We pick the repair with the largest ITE. CAUPER iteratively explore potential performance repairs 96

Slide 97

Slide 97 text

CAUPER is able to find more accurate causes comparing with an statistical debugging 97

Slide 98

Slide 98 text

CAUPER is able to find comparable and even better repairs for performance faults comparing with performance optimization 98 X5k X10k X20k X50k Workload 0.4 0.5 0.6 0.7 0.8 0.9 Latency-Gain CAUPER SMAC X5k X10k X20k X50k Workload ≠0.2 ≠0.1 0.0 0.1 0.2 0.3 Heat-Gain CAUPER SMAC X5k X10k X20k X50k Workload ≠1.0 ≠0.5 0.0 0.5 Energy-Gain CAUPER SMAC

Slide 99

Slide 99 text

CAUPER is able to find repairs with lower costs 99 X5k X10k X20k X50k Workload 0 2500 5000 7500 10000 12500 15000 Time CAUPER SMAC

Slide 100

Slide 100 text

CAUPER’s repairs are transferable across environments 100 X5k-X10k X5k-X20k X5k-X50k Workload 0.2 0.4 0.6 0.8 Latency-Gain CAUPER SMAC-Rerun SMAC-Reuse

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

Team effort Rahul Krishna Columbia Shahriar Iqbal UofSC M. A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC Vivek Nair Facebook Tim Menzies NCSU

Slide 103

Slide 103 text

No content