Causal Invariances - Structure Learning - Concept Learning - Physics-Informed Applications: - Systems - Autonomy - Robotics Well-known Physics Big Data Limited known Physics Small Data Causal AI
to understand their performance behavior 6 010 7/2012 7/2014 e time 1/1999 1/2003 1/2007 1/2011 0 1/2014 Release time 006 1/2010 1/2014 2.2.14 2.3.4 35 se time ache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15] 2180 218 = 2162 Increase in size of configuration space
7 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]
significantly over a short period 14 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Store Push Store Crawled items Fetch Internet 100X 10X Real time Fetch
4000 5000 Average write latency ( s) The default configuration is typically bad and the optimal configuration is noticeably better than median 19 Default Configuration Optimal Configuration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median
Code was transplanted from TX1 to TX2 • TX2 is more powerful, but software was 2x slower than TX1 • Three misconfigurations: ◦ Wrong compilation flags for compiling CUDA (didn't use 'dynamic' flag) ◦ Wrong CPU/GPU modes (didn't use TX2 optimized cores) ◦ Wrong Fan mode (didn't change to handle thermal throttling) Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 21
the default settings • Took 1 month to fix in the end... • We need to do this better Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 22
configurations can cause performance to take abnormally large values • Faulty configurations take the tail values (worse than 99.99th percentile) • Certain configurations can cause faults on multiple performance objectives. 23
= 33% GPU Growth = 66% Swap = 1Gb Swap = 4Gb Swap Mem Swap Mem Swap Mem GPU Growth GPU Growth GPU Growth Latency Latency Latency GPU Growth Swap Mem Latency Latency Latency Latency • Correlation between GPU Growth and Latency is as strong as Swap Mem and Latency, but considerably less noisy. • Therefore, a feature selection method based on correlation while ignoring the causal structure prefer GPU Growth as the predictor for Latency which is misleading. 24
learning scenario • The relation between X1 and X2 is about equally strong as the relation between X2 and X3, but more noisy. • {X3} and {X1, X3} are preferred over {X1}, because predicting Y from X1 leads to: • A larger variance than predicting Y from X3 • A larger bias than predicting Y from both X1 and X3. 25 Magliacane, Sara, et al. "Domain adaptation by using causal inference to predict invariant conditional distributions." Advances in Neural Information Processing Systems. 2018.
• The difference between the probability that the performance fault is fixed after a repair and the probability that the performance fault is still faulty after a repair . • Larger the value, more likely we are to repair the fault. • We pick the repair with the largest ITE. CAUPER iteratively explore potential performance repairs 28
A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Marco Valtorta UofSC