Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning meets Software Performance

Pooyan Jamshidi
September 25, 2020

Machine Learning meets Software Performance

A wide range of modern software-intensive systems (e.g., autonomous systems, big data analytics, robotics, deep neural architectures) is built configurable. These highly-configurable systems offer a rich space for adaptation to different domains and tasks. Developers and users often need to reason about the performance of such systems, making tradeoffs to change specific quality attributes or detecting performance anomalies. For instance, the developers of image recognition mobile apps are not only interested in learning which deep neural architectures are accurate enough to classify their images correctly, but also which architectures consume the least power on the mobile devices on which they are deployed. Recent research has focused on models built from performance measurements obtained by instrumenting the system. However, the fundamental problem is that the learning techniques for building a reliable performance model do not scale well, simply because the configuration space of systems is exponentially large that is impossible to exhaustively explore. For example, it will take over 60 years to explore the whole configuration space of a system with 25 binary options.

In this tutorial, I will start motivating the configuration space explosion problem based on my previous experience with large-scale big data systems in the industry. I will then present transfer learning as well as other machine learning techniques including multi-objective Bayesian optimization to tackle the sample efficiency challenge: instead of taking the measurements from the real system, we learn the performance model using samples from cheap sources, such as simulators that approximate the performance of the real system, with a fair fidelity and at a low cost. Results show that despite the high cost of measurement on the real system, learning performance models can become surprisingly cheap as long as certain properties are reused across environments. In the second half of the talk, I will present empirical evidence, which lays a foundation for a theory explaining why and when transfer learning works by showing the similarities of performance behavior across environments. I will present observations of environmental changes’ impacts (such as changes to hardware, workload, and software versions) for a selected set of configurable systems from different domains to identify the key elements that can be exploited for transfer learning. These observations demonstrate a promising path for building efficient, reliable, and dependable software systems as well as theoretically sound approaches for tackling performance optimization, testing, and debugging. Finally, I will share some promising and potential research directions including our recent progress on a performance debugging approach based on counterfactual causal inference.

Outline
Background on computer system performance
Case study: A composable highly-configurable system
Performance analysis and optimization
Transfer learning for performance analysis and optimization
Research directions 1: Cost-aware multi-objective Bayesian optimization for MLSys
Research directions 2: Counterfactual causal inference for performance debugging
Target audience
This tutorial is targeted for practitioners as well as researchers that would like to go deeper into understanding new and potentially powerful approaches for modern highly-configurable systems. This tutorial will be also suitable for students (both undergraduate and graduate) who want to learn about potential research directions and how they can find a niche and fruitful area in research at the intersections of machine learning, systems, and software engineering.

Pooyan Jamshidi

September 25, 2020
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Research

Transcript

  1. Machine Learning meets Software Performance A journey from optimization to

    transfer learning all the way to counterfactual causal inference ASE 2020 Tutorial Friday 25 Sep Pooyan Jamshidi UofSC Europa Lander NASA Europa Clipper NASA
  2. Artificial Intelligence and Systems Laboratory (AISys Lab) Machine Learning Computer

    Systems Autonomy Learning-enabled Autonomous Systems https://pooyanjamshidi.github.io/AISys/ 2
  3. Research Directions at AISys 3 Theory:
 - Transfer Learning
 -

    Causal Invariances
 - Structure Learning
 - Concept Learning
 - Physics-Informed
 
 Applications:
 - Systems
 - Autonomy
 - Robotics Well-known Physics Big Data Limited known Physics Small Data Causal AI
  4. Team effort 4 Rahul Krishna Columbia Shahriar Iqbal UofSC M.

    A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC Vivek Nair Facebook Tim Menzies NCSU
  5. 7

  6. Empirical observations confirm that systems are becoming increasingly configurable 8

    08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]
  7. Empirical observations confirm that systems are becoming increasingly configurable 9

    nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]
  8. Configurations determine the performance behavior 10 void Parrot_setenv(. . .

    name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy
  9. 11 How do we understand performance behavior of real-world highly-configurable

    systems that scale well… … and enable developers/users to reason about qualities (performance, energy) and to make tradeoff?
  10. Scope: Configuration across stack 12 CPU Memory Controller GPU Lib

    API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers
  11. Composed Systems (Single-node) • Online processing of sensory data •

    Neural network models • Homogeneous tasks 13
  12. Composed Systems (Multi-node) • Online processing of sensory data •

    Graph types models • Heterogeneous tasks 14
  13. Composed Systems (IoT) • They are integrated with cloud services

    and we do not have access to those system, we could configure them to some extent. 15
  14. Distributed (big data) • The components may be assigned to

    different hardware nodes without direct control of the users • These are typically configurable, but need expertise to find the best configuration 16
  15. Cyber-physical systems • We may not have direct access to

    the hardware directly, so remote debugging is needed. 17
  16. Outline 19 Case Study Transfer Learning Theory Building Guided Sampling

    Current Research [SEAMS’17] [ASE’17] [FSE’18]
  17. SocialSensor 21 Content Analysis Orchestrator Crawling Search and Integration Tweets:

    [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet
  18. Challenges 22 Content Analysis Orchestrator Crawling Search and Integration Tweets:

    [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time
  19. 0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000

    4000 5000 Average write latency ( s) Default configuration was bad, so was the expert’ 27 Default Recommended by an expert Optimal Configuration better better
  20. 0 0.5 1 1.5 2 2.5 Throughput (ops/sec) 104 0

    50 100 150 200 250 300 Latency (ms) Default configuration was bad, so was the expert’ 28 Default Recommended by an expert Optimal Configuration better better
  21. 0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000

    4000 5000 Average write latency ( s) The default configuration is typically bad and the optimal configuration is noticeably better than median 29 Default Configuration Optimal Configuration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median
  22. Identifying the root cause of performance faults is difficult •

    Code was transplanted from TX1 to TX2 • TX2 is more powerful, but software was 2x slower than TX1 • Three misconfigurations: ◦ Wrong compilation flags for compiling CUDA (didn't use 'dynamic' flag) ◦ Wrong CPU/GPU modes (didn't use TX2 optimized cores) ◦ Wrong Fan mode (didn't change to handle thermal throttling) Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 31
  23. Fixing performance faults is difficult • These were not in

    the default settings • Took 1 month to fix in the end... • We need to do this better Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 32
  24. Performance distributions are multi-modal and have long tails • Certain

    configurations can cause performance to take abnormally large values
 • Faulty configurations take the tail values (worse than 99.99th percentile)
 • Certain configurations can cause faults on multiple performance objectives. 
 37
  25. Setting the scene 39 ℂ = O1 × O2 ×

    ⋯ × O19 × O20 Dead code removal Configuration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Configure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quantifiable aspect
  26. A typical approach for understanding the performance behavior is sensitivity

    analysis 40 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Learn Training/Sample Set
  27. Performance model could be in any appropriate form of black-box

    models 41 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Training/Sample Set
  28. Evaluating a performance model 42 O1 × O2 × ⋯

    × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn y1 = f(c1 ) y2 = f(c2 ) y3 = f(c3 ) yn = f(cn ) ̂ f ∼ f( ⋅ ) ⋯ Learn Training/Sample Set Evaluate Accuracy APE( ̂ f, f ) = | ̂ f(c) − f(c)| f(c) × 100
  29. A performance model contain useful information about influential options and

    interactions 43 f( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 f : ℂ → ℝ
  30. Performance model can then be used to reason about qualities

    44 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX endif else PARROT_HAS_SETENV LINUX f(·) = 5 + 3 ⇥ o1 Execution time (s) f(o1 := 0) = 5 f(o1 := 1) = 8
  31. Insight: Performance measurements of the real system is “similar” to

    the ones from the simulators 45 Measure Simulator (Gazebo) Data Configurations Performance So why not reuse these data, instead of measuring on real robot?
  32. We developed methods to make learning cheaper via transfer learning

    46 Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION Understanding the performance behavior of configurable software systems can enable (i) performance debugging, (ii) performance tuning, (iii) design-time evolution, or (iv) runtime adaptation [11]. We lack empirical understanding of how the performance behavior of a system will vary when the environ- ment of the system changes. Such empirical understanding will provide important insights to develop faster and more accurate learning techniques that allow us to make predictions and optimizations of performance for highly configurable systems in changing environments [10]. For instance, we can learn performance behavior of a system on a cheap hardware in a controlled lab environment and use that to understand the per- formance behavior of the system on a production server before shipping to the end user. More specifically, we would like to know, what the relationship is between the performance of a system in a specific environment (characterized by software configuration, hardware, workload, and system version) to the one that we vary its environmental conditions. In this research, we aim for an empirical understanding of A. Preliminary concept In this section, we p cepts that we use throu enable us to concisely 1) Configuration and the i-th feature of a co enabled or disabled an configuration space is m all the features C = Dom(Fi) = {0, 1}. A a member of the confi all the parameters are range (i.e., complete ins We also describe an e = [w, h, v] drawn fr W ⇥H ⇥V , where they values for workload, ha 2) Performance mod configuration space F formance model is a b given some observation combination of system II. INTUITION rstanding the performance behavior of configurable systems can enable (i) performance debugging, (ii) ance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the ance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn ance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- e behavior of the system on a production server before to the end user. More specifically, we would like to hat the relationship is between the performance of a n a specific environment (characterized by software ation, hardware, workload, and system version) to the we vary its environmental conditions. s research, we aim for an empirical understanding of A. Preliminary concepts In this section, we provide formal definitions of cepts that we use throughout this study. The forma enable us to concisely convey concept throughout 1) Configuration and environment space: Let F the i-th feature of a configurable system A whic enabled or disabled and one of them holds by de configuration space is mathematically a Cartesian all the features C = Dom(F1) ⇥ · · · ⇥ Dom(F Dom(Fi) = {0, 1}. A configuration of a syste a member of the configuration space (feature spa all the parameters are assigned to a specific valu range (i.e., complete instantiations of the system’s pa We also describe an environment instance by 3 e = [w, h, v] drawn from a given environment s W ⇥H ⇥V , where they respectively represent sets values for workload, hardware and system version. 2) Performance model: Given a software syste configuration space F and environmental instances formance model is a black-box function f : F ⇥ given some observations of the system performanc combination of system’s features x 2 F in an en formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To tem version) to the ons. al understanding of g via an informed learning a perfor- ed on a well-suited the knowledge we the main research information (trans- o both source and ore can be carried r. This transferable [10]. s that we consider uration is a set of is the primary vari- rstand performance ike to understand der study will be formance model is a black-box function f : F ⇥ E given some observations of the system performance fo combination of system’s features x 2 F in an enviro e 2 E. To construct a performance model for a syst with configuration space F, we run A in environment in e 2 E on various combinations of configurations xi 2 F record the resulting performance values yi = f(xi) + ✏i F where ✏i ⇠ N (0, i). The training data for our regr models is then simply Dtr = {(xi, yi)}n i=1 . In other wo response function is simply a mapping from the input sp a measurable performance metric that produces interval- data (here we assume it produces real numbers). 3) Performance distribution: For the performance m we measured and associated the performance response t configuration, now let introduce another concept whe vary the environment and we measure the performanc empirical performance distribution is a stochastic pr pd : E ! (R), that defines a probability distributio performance measures for each environmental conditio Extract Reuse Learn Learn Goal: Gain strength by transferring information across environments
  33. What is the advantage of transfer learning? • During learning

    you may need thousands of rotten and fresh potato and hours of training to learn. • But now using the same knowledge of rotten features you can identify rotten tomato with less samples and training time. • You may have learned during daytime with enough light and exposure; but your present tomato identification job is at night. • You may have learned sitting very close, just beside the box of potato; but now for tomato identification you are in the other side of the glass. 47
  34. 48 Data Data Data Measure Measure Reuse Learn TurtleBot Simulator

    (Gazebo) [P. Jamshidi, et al., “Transfer learning for improving model predictions ….”, SEAMS’17] Configurations Our transfer learning solution f(o1, o2) = 5 + 3o1 + 15o2 7o1 ⇥ o2
  35. input, x Gaussian processes for performance modeling 49 t =

    n t = n + 1 Observation Mean Uncertainty New observation output, f(x) input, x
  36. Gaussian Processes enables reasoning about performance Step 1: Fit GP

    to the data seen so far Step 2: Explore the model for regions of most variance Step 3: Sample that region Step 4: Repeat 50 -1.5 -1 -0.5 0 0.5 1 1.5 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Configuration Space Empirical Model Experiment Experiment 0 20 40 60 80 100 120 140 160 180 200 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Selection Criteria Sequential Design
  37. The intuition behind our transfer learning approach 51 Intuition: Observations

    on the source(s) can affect predictions on the target Example: Learning the chess game make learning the Go game a lot easier! Configurations Performance Models tween the source and target functions, g, f, using observations Ds, Dt to build the predictive model ˆ f the relationship, we define the following kernel f k(f, g, x, x0) = kt(f, g) ⇥ kxx(x, x0), where the kernels kt represent the correlation bet and target function, while kxx is the covariance inputs. Typically, kxx is parameterized and its pa learnt by maximizing the marginal likelihood o given the observations from source and target D
  38. CoBot experiment: DARPA BRASS 52 0 2 4 6 8

    Localization error [m] 10 15 20 25 30 35 40 CPU utilization [%] Energy constraint Safety constraint Pareto front Sweet Spot better better no_of_particles=x no_of_refinement=y
  39. CoBot experiment 53 5 10 15 20 25 5 10

    15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 Source (given) Target (ground truth 6 months) Prediction with 4 samples Prediction with Transfer learning CPU [%] CPU [%]
  40. 54 Transfer Learning for Improving Model Predictions in Highly Configurable

    Software Pooyan Jamshidi, Miguel Velez, Christian K¨ astner Carnegie Mellon University, USA {pjamshid,mvelezce,kaestner}@cs.cmu.edu Norbert Siegmund Bauhaus-University Weimar, Germany [email protected] Prasad Kawthekar Stanford University, USA [email protected] Abstract —Modern software systems are built to be used in dynamic environments using configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at Predictive Model Learn Model with Transfer Learning Measure Measure Data Source Target Simulator (Source) Robot (Target) Adaptation Fig. 1: Transfer learning for performance model learning. order to identify the best performing configuration for a robot Details: [SEAMS ’17]
  41. Looking further: When transfer learning goes wrong 56 10 20

    30 40 50 60 Absolute Percentage Error [%] Sources s s1 s2 s3 s4 s5 s6 noise-level 0 5 10 15 20 25 30 corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19 µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75 It worked! It didn’t! Insight: Predictions become more accurate when the source is more related to the target. Non-transfer-learning
  42. 5 10 15 20 25 number of particles 5 10

    15 20 25 number of refinements 5 10 15 20 25 30 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 12 14 16 18 20 22 24 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 6 8 10 12 14 16 18 20 22 24 (a) (b) (c) (d) (e) 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 12 14 16 18 20 22 24 (f) CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] It worked! It worked! It worked! It didn’t! It didn’t! It didn’t!
  43. Key question: Can we develop a theory to explain when

    transfer learning works? 58 Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION rstanding the performance behavior of configurable e systems can enable (i) performance debugging, (ii) mance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the mance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate g techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn mance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- ce behavior of the system on a production server before g to the end user. More specifically, we would like to what the relationship is between the performance of a in a specific environment (characterized by software ration, hardware, workload, and system version) to the t we vary its environmental conditions. is research, we aim for an empirical understanding of mance behavior to improve learning via an informed g process. In other words, we at learning a perfor- model in a changed environment based on a well-suited g set that has been determined by the knowledge we in other environments. Therefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 ON behavior of configurable erformance debugging, (ii) e evolution, or (iv) runtime understanding of how the will vary when the environ- mpirical understanding will op faster and more accurate to make predictions and ighly configurable systems or instance, we can learn on a cheap hardware in a that to understand the per- a production server before cifically, we would like to ween the performance of a (characterized by software and system version) to the conditions. empirical understanding of learning via an informed we at learning a perfor- ment based on a well-suited ned by the knowledge we erefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 oad, hardware and system version. e model: Given a software system A with ce F and environmental instances E, a per- is a black-box function f : F ⇥ E ! R rvations of the system performance for each ystem’s features x 2 F in an environment ruct a performance model for a system A n space F, we run A in environment instance combinations of configurations xi 2 F, and ng performance values yi = f(xi) + ✏i, xi 2 (0, i). The training data for our regression mply Dtr = {(xi, yi)}n i=1 . In other words, a is simply a mapping from the input space to ormance metric that produces interval-scaled ume it produces real numbers). e distribution: For the performance model, associated the performance response to each w let introduce another concept where we ment and we measure the performance. An mance distribution is a stochastic process, that defines a probability distribution over sures for each environmental conditions. To ormance distribution for a system A with ce F, similarly to the process of deriving models, we run A on various combinations 2 F, for a specific environment instance values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To construct a performance distribution for a system A with configuration space F, similarly to the process of deriving the performance models, we run A on various combinations configurations xi 2 F, for a specific environment instance Extract Reuse Learn Learn Q1: How source and target are “related”? Q2: What characteristics are preserved? Q3: What are the actionable insights?
  44. We hypothesized that we can exploit similarities across environments to

    learn “cheaper” performance models 59 O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn ys1 = fs (c1 ) ys2 = fs (c2 ) ys3 = fs (c3 ) ysn = fs (cn ) O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ yt1 = ft (c1 ) yt2 = ft (c2 ) yt3 = ft (c3 ) ytn = ft (cn ) Source Environment (Execution time of Program X) Target Environment (Execution time of Program Y) Similarity [P. Jamshidi, et al., “Transfer learning for performance modeling of configurable systems….”, ASE’17]
  45. Our empirical study: We looked at different highly- configurable systems

    to gain insights 60 [P. Jamshidi, et al., “Transfer learning for performance modeling of configurable systems….”, ASE’17] SPEAR (SAT Solver) Analysis time 14 options 16,384 configurations SAT problems 3 hardware 2 versions X264 (video encoder) Encoding time 16 options 4,000 configurations Video quality/size 2 hardware 3 versions SQLite (DB engine) Query time 14 options 1,000 configurations DB Queries 2 hardware 2 versions SaC (Compiler) Execution time 50 options 71,267 configurations 10 Demo programs
  46. Linear shift happens only in limited environmental changes 61 Soft

    Environmental change Severity Corr. SPEAR NUC/2 -> NUC/4 Small 1.00 Amazon_nano -> NUC Large 0.59 Hardware/workload/version V Large -0.10 x264 Version Large 0.06 Workload Medium 0.65 SQLite write-seq -> write-batch Small 0.96 read-rand -> read-seq Medium 0.50 Target Source Throughput Implication: Simple transfer learning is limited to hardware changes in practice log P(θ, Xobs ) Θ l P(θ|Xobs ) Θ Figure 5: The first column shows the log joint probab
  47. Soft Environmental change Severity Dim t-test x264 Version Large 16

    12 10 Hardware/workload/ver V Large 8 9 SQLite write-seq -> write-batch V Large 14 3 4 read-rand -> read-seq Medium 1 1 SaC Workload V Large 50 16 10 Implication: Avoid wasting budget on non-informative part of configuration space and focusing where it matters. Influential options and interactions are preserved across environments 62 216 250 = 0.000000000058 We only need to explore part of the space:
  48. Transfer learning across environment 63 O1 × O2 × ⋯

    × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn ys1 = fs (c1 ) ys2 = fs (c2 ) ys3 = fs (c3 ) ysn = fs (cn ) Source (Execution time of Program X) Learn performance model ̂ fs ∼ fs ( ⋅ )
  49. Observation 1: Not all options and interactions are influential and

    interactions degree between options are not high 64 ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 ℂ = O1 × O2 × O3 × O4 × O5 × O6 × O7 × O8 × O9 × O10
  50. Observation 2: Influential options and interactions are preserved across environments

    65 ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7 ̂ ft ( ⋅ ) = 10.4 − 2.1o1 + 1.2o3 + 2.2o7 + 0.1o1 o3 − 2.1o3 o7 + 14o1 o3 o7
  51. 66 Transfer Learning for Performance Modeling of Configurable Systems: An

    Exploratory Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configura- tion options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many Details: [ASE ’17]
  52. How to sample the configuration space to learn a “better”

    performance behavior? How to select the most informative configurations?
  53. The similarity across environment is a rich source of knowledge

    for exploration of the configuration space
  54. When we treat the system as black boxes, we cannot

    typically distinguish between different configurations O1 × O2 × ⋯ × O19 × O20 0 × 0 × ⋯ × 0 × 1 0 × 0 × ⋯ × 1 × 0 0 × 0 × ⋯ × 1 × 1 1 × 1 × ⋯ × 1 × 0 1 × 1 × ⋯ × 1 × 1 ⋯ c1 c2 c3 cn • We therefore end up blindly explore the configuration space • That is essentially the key reason why “most” work in this area consider random sampling. 71
  55. Without considering this knowledge, many samples may not provide new

    information 72 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 0 × 0 c1 c2 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 0 × 1 O1 × O2 × O3 × O4 × O5 × O6 × O7 × O8 × O9 × O10 ̂ fs (c1 ) = 14.9 ̂ fs (c2 ) = 14.9 c3 1 × 0 × 1 × 0 × 0 × 0 × 1 × 0 × 1 × 0 ̂ fs (c3 ) = 14.9 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 × 1 ̂ fs (c128 ) = 14.9 c128 ⋯ ̂ fs ( ⋅ ) = 1.2 + 3o1 + 5o3 + 0.9o7 + 0.8o3 o7 + 4o1 o3 o7
  56. Without knowing this knowledge, many blind/ random samples may not

    provide any additional information about performance of the system
  57. Configurations of deep neural networks affect accuracy and energy consumption

    75 0 500 1000 1500 2000 2500 Energy consumption [J] 0 20 40 60 80 100 Validation (test) error CNN on CNAE-9 Data Set 72% 22X the selected cell is plugged into a large model which is trained on the combinatio validation sub-sets, and the accuracy is reported on the CIFAR-10 test set. We not is never used for model selection, and it is only used for final model evaluation. We cells, learned on CIFAR-10, in a large-scale setting on the ImageNet challenge dat LPDJH VHSFRQY[ VHSFRQY[ VHSFRQY[ FRQY[ JOREDOSRRO OLQHDU VRIWPD[ hZ<YY .ŸÂÁZ]GIY FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ Y<gOI .ŸÂÁZ]GIY FHOO FHOO FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ FRQY[ VHSFRQY[ JOREDOSRRO OLQHDU VRIWPD[ Z<OI"IjZ]GIY FHOO FHOO FHOO FHOO FHOO FHOO FHOO Figure 2: Image classification models constructed using the cells optimized with arc Top-left: small model used during architecture search on CIFAR-10. Top-right: model used for learned cell evaluation. Bottom: ImageNet model used for learned For CIFAR-10 experiments we use a model which consists of 3 ⇥ 3 convolution w the selected cell is plugged into a large model which is trained on the combination of training and validation sub-sets, and the accuracy is reported on the CIFAR-10 test set. We note that the test set is never used for model selection, and it is only used for final model evaluation. We also evaluate the cells, learned on CIFAR-10, in a large-scale setting on the ImageNet challenge dataset (Sect. 4.3). LPDJH VHSFRQY[ VHSFRQY[ VHSFRQY[ FRQY[ JOREDOSRRO OLQHDU VRIWPD[ hZ<YY .ŸÂÁZ]GIY FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ JOREDOSRRO OLQHDU VRIWPD[ Y<gOI .ŸÂÁZ]GIY FHOO FHOO FHOO FHOO FHOO FHOO VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ VHSFRQY[ LPDJH FRQY[ FRQY[ VHSFRQY[ JOREDOSRRO OLQHDU VRIWPD[ Z<OI"IjZ]GIY FHOO FHOO FHOO FHOO FHOO FHOO FHOO Figure 2: Image classification models constructed using the cells optimized with architecture search. Top-left: small model used during architecture search on CIFAR-10. Top-right: large CIFAR-10 model used for learned cell evaluation. Bottom: ImageNet model used for learned cell evaluation.
  58. DNN measurements are costly Each sample cost ~1h 4000 *

    1h ~= 6 months Yes, that’s the cost we paid for conducting our measurements!
  59. L2S enables learning a more accurate model with less samples

    exploiting the knowledge from the source 77 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (a) DNN (hard) 3 10 20 30 40 Sample Si 0 20 40 60 80 100 Mean Absolute Percentage Error L L D M R (b) XGBoost (h Convolutional Neural Network
  60. L2S may also help data-reuse approach to learn faster 78

    30 40 50 60 70 Sample Size 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (a) DNN (hard) 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (b) XGBoost (hard) 3 10 20 30 40 Sample Siz 0 20 40 60 80 100 Mean Absolute Percentage Error L L D M R (c) Storm (ha XGBoost
  61. Some environments the similarities across environments may be too low

    and this results in “negative transfer” 80 0 30 40 50 60 70 Sample Size 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (b) XGBoost (hard) 3 10 20 30 40 50 60 70 Sample Size 0 20 40 60 80 100 Mean Absolute Percentage Error 100 200 500 L2S+GP L2S+DataReuseTL DataReuseTL ModelShift Random+CART (c) Storm (hard) Apache Storm
  62. The samples generated by L2S contains more information… “entropy <->

    information gain” 82 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 entropy [bits] Max entropy L2S Random 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 7 entropy [bits] Max entropy L2S Random 10 20 30 40 50 60 70 sample size 0 1 2 3 4 5 6 7 entropy [bits] Max entropy L2S Random DNN XGboost Storm
  63. Limitations • Limited number of systems and environmental changes •

    Synthetic models • https://github.com/pooyanjamshidi/GenPerf • Binary options • Non-binary options -> binary • Negative transfer 83
  64. Software 2.0 87 Increasingly customized and configurable VISION Increasingly competing

    objectives Accuracy Training speed Inference speed Model size Energy
  65. Deep neural network as a highly configurable system 88 of

    top/bottom conf.; M6/M7: Number of influential options; M8/M9: Number of options agree Correlation btw importance of options; M11/M12: Number of interactions; M13: Number of inte e↵ects; M14: Correlation btw the coe↵s; Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer 4 Technical Aims and Research Plan We will pursue the following technical aims: (1) investigate potential criteria for e↵ectiv exploration of the design space of DNN architectures (Section 4.2), (2) build analytical m curately predict the performance of a given architecture configuration given other similar which either have been measured in the target environments or other similar environm measuring the network performance directly (Section 4.3), and (3), develop a tunni that exploit the performance model from previous step to e↵ectively search for optima (Section 4.4). 4.1 Project Timeline We plan to complete the proposed project in two years. To mitigate project risks, we project into three major phases: 8 Network Design Model Compiler Hybrid Deployment OS/ Hardware Scope of this Project Neural Search Hardware Optimization Hyper-parameter DNN system development stack Deployment Topology
  66. We found many configuration with the same accuracy while having

    drastically different energy demand 89 0 500 1000 1500 2000 2500 Energy consumption [J] 0 20 40 60 80 100 Validation (test) error CNN on CNAE-9 Data Set 72% 22X 100 150 200 250 300 350 400 Energy consumption [J] 8 10 12 14 16 18 Validation (test) error CNN on CNAE-9 Data Set 22X 10% 300J Pareto frontier
  67. Debugging based on statistical correlation could be misleading GPU Growth

    = 33% GPU Growth = 66% Swap = 1Gb Swap = 4Gb Swap Mem Swap Mem Swap Mem GPU Growth GPU Growth GPU Growth Latency Latency Latency GPU Growth Swap Mem Latency Latency Latency Latency • Correlation between GPU Growth and Latency is as strong as Swap Mem and Latency, but considerably less noisy. • Therefore, a feature selection method based on correlation while ignoring the causal structure prefer GPU Growth as the predictor for Latency which is misleading. 92
  68. Why knowing about the underlying causal structure matters A transfer

    learning scenario • The relation between X1 and X2 is about equally strong as the relation between X2 and X3, but more noisy. • {X3} and {X1, X3} are preferred over {X1}, because predicting Y from X1 leads to: • A larger variance than predicting Y from X3 • A larger bias than predicting Y from both X1 and X3. 93 Magliacane, Sara, et al. "Domain adaptation by using causal inference to predict invariant conditional distributions." Advances in Neural Information Processing Systems. 2018.
  69. • We measure the Individual Treatment Effect of each repair:

    • The difference between the probability that the performance fault is fixed after a repair and the probability that the performance fault is still faulty after a repair . • Larger the value, more likely we are to repair the fault. • We pick the repair with the largest ITE. CAUPER iteratively explore potential performance repairs 96
  70. CAUPER is able to find comparable and even better repairs

    for performance faults comparing with performance optimization 98 X5k X10k X20k X50k Workload 0.4 0.5 0.6 0.7 0.8 0.9 Latency-Gain CAUPER SMAC X5k X10k X20k X50k Workload ≠0.2 ≠0.1 0.0 0.1 0.2 0.3 Heat-Gain CAUPER SMAC X5k X10k X20k X50k Workload ≠1.0 ≠0.5 0.0 0.5 Energy-Gain CAUPER SMAC
  71. CAUPER is able to find repairs with lower costs 99

    X5k X10k X20k X50k Workload 0 2500 5000 7500 10000 12500 15000 Time CAUPER SMAC
  72. CAUPER’s repairs are transferable across environments 100 X5k-X10k X5k-X20k X5k-X50k

    Workload 0.2 0.4 0.6 0.8 Latency-Gain CAUPER SMAC-Rerun SMAC-Reuse
  73. Team effort Rahul Krishna Columbia Shahriar Iqbal UofSC M. A.

    Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC Vivek Nair Facebook Tim Menzies NCSU