Computing Active Subspaces

Paul Constantine Colorado School of Mines paulconstantine.net @DrPaulynomial activesubspaces.org Qiqi
Wang (MIT, AeroAstro) David Gleich (Purdue, CS) COMPUTING ACTIVE SUBSPACES WARNING: These slides are meant to complement the oral presentation. Use out of context at your own risk. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0011077.

ACTIVE COLLABORATIONS •  Scramjet safety (Stanford PSAAP Center) •  Photovoltaic
solar cell modeling (Mark Campanelli, NREL) •  Aerospace shape optimization (Juan Alonso, Stanford) •  Combustion stability analysis (Franck Nicoud, University Montpellier 2) •  Evapotranspiration in hydrology (Reed Maxwell, School of Mines) •  In-situ combustion reservoir simulation (Margot Gerritsen, Stanford) •  Combustion LES calibration (Guilhem Lacaze, Mohammad Khalil, SNL) •  Atmospheric re-entry (Nagi Mansour, NASA Ames; Thierry Magin, VKI) •  Subsurface sensitivity and inversion (Reed Maxwell, School of Mines) •  Wind turbine modeling (Peter Graf, Katherine Dykes, NREL) (And I’m open to others that might make me rich, famous, and/or tenured.) There are more applications than I have time for.

AUDIENCE POLL How many dimensions is “high” dimensions?

f( x )

Z f( x ) ⇢ d x Doing things in
high dimensions is hard, in general. ˜ f( x ) ⇡ f( x ) minimize x f( x ) x = f 1( y ) APPROXIMATION OPTIMIZATION INTEGRATION INVERSION You must ﬁrst discover and then exploit some low- dimensional structure to make these problems tractable.

Pray for main eﬀects dominated: f ( x ) f1(
x1) + · · · + fm( xm) DISCOVER EXPLOIT Use sparse grids. Pray for sparsity: f( x ) ⇡ P X i=1 ai i( x ), many ai = 0 Pray for low rank: f ( x ) ⇥ r X i=1 f i 1 ( x1) · · · f i m ( xm) Use separation of variables (PGD, TT, ACA, …) Use compressed sensing.

Sensitivity analysis to eliminate variables: DISCOVER EXPLOIT Work with only
important variables. 2 6 4 x1 . . . x100 3 7 5 !  x1 x4

[ Show them that cool animation! ] https://www.youtube.com/watch?v=70BTwlV3Kwg

Active subspaces are deﬁned by the eigendecomposition of a matrix
derived from the function. f = f( x ), x ⇥ X Rm, ⇤ x f = 2 6 4 f/ x1 . . . f/ xm 3 7 5 Consider a function and its gradient vector, The uncentered covariance of the gradient and its eigendecomposition Partition the eigendecomposition Rotate and separate the coordinates ⇢ : Rm ! R + ⇤ =  ⇤1 ⇤2 , W = ⇥ W 1 W 2 ⇤ , W 1 2 Rm⇥n x = W W T x = W 1W T 1 x + W 2W T 2 x = W 1y + W 2z active variables inactive variables C = Z (r x f)(r x f)T ⇢ d x = W ⇤W T

The eigenvectors indicate the directions along which the function varies
the most---on average. Lemma 1: Proof sketch: Deﬁnition of . Lemma 2: Proof sketch: Chain rule, linearity of trace, deﬁnition of . i = Z (r x f)T wi 2 ⇢ d x , i = 1, . . . , m C Z (ryf)T (ryf) ⇢ d x = 1 + · · · + n Z (rzf)T (rzf) ⇢ d x = n+1 + · · · + m C

DISCOVER the active subspace with random sampling. Draw samples: Compute:
and fj = f( xj) r x fj = r x f( xj) Approximate with Monte Carlo Equivalent to SVD of samples of the gradient. Called an active subspace method in T. Russi’s 2010 Ph.D. thesis, Uncertainty Quantiﬁcation with Experimental Data in Complex System Models xj ⇠ ⇢ C ⇡ 1 N N X j=1 r x fj r x fT j = ˆ W ˆ ⇤ ˆ W T 1 p N ⇥ r x f1 · · · r x fN ⇤ = ˆ W p ˆ ⇤ ˆ V T

Approximating the eigenpairs is not as tough as you might
think. Probability that eigenvalue estimate precedes true eigenvalue Number of samples Bound on gradient norm squared P n ˆk  (1 " ) k o  k exp ✓ N 2 k "2 4 1L2 ◆ Using Gittens and Tropp (2011)

think. P n ˆk (1 + " ) k o  ( m k + 1) exp ✓ N k"2 4 L2 ◆ Probability that eigenvalue estimate exceeds true eigenvalue Number of samples Bound on gradient norm squared Using Gittens and Tropp (2011)

Using Gittens and Tropp (2011) Approximating the eigenpairs is not
as tough as you might think. Bound on gradient norm squared Relative accuracy Dimension (with high probability) N = ⌦ ✓ L2 1 2 k "2 log( m ) ◆ = ) | k ˆk |  " k

think. Combining Gittens and Tropp (2011) with Golub and Van Loan (1996), Stewart (1973) Bound on gradient norm squared Relative accuracy Dimension (with high probability) Spectral gap N = ⌦ ✓ L2 1"2 log( m ) ◆ = ) dist ( W 1, ˆ W 1)  4 1" n n+1

Q: How many samples of the gradient do I need?
A: Scales like log of dimension times problem-dependent constants. Q: How do I choose the dimension of the active subspace? A: Look for spectral gaps. Q: What happens if I have inaccurate gradients? A: Bias (that goes to zero) appears in the error bounds. Q: How do I quantify uncertainty in the estimates? A: Use a bootstrap!

f( x ) = 1 2x T A x ,
x 2 [ 1, 1]10, C = 1 3 A2 (A) 2 4 6 8 10 10−6 10−5 10−4 10−3 10−2 10−1 100 101 102 Index Eigenvalues 2 4 6 8 10 10−6 10−5 10−4 10−3 10−2 10−1 100 101 102 Index Eigenvalues

f( x ) = 1 2x T A x ,
x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True (C) k = 6 N = 28 eigenvalues samples

f( x ) = 1 2x T A x ,
x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True dist(W 1, ˆ W 1) k = 6 N = 28 eigenvalues samples

f( x ) = 1 2x T A x ,
x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True (C) First order ﬁnite diﬀerences: h = 0.1 k = 6 N = 28 eigenvalues samples

h = 0.001 f( x ) = 1 2x T
A x , x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True First order ﬁnite diﬀerences: (C) k = 6 N = 28 eigenvalues samples

h = 0.00001 f( x ) = 1 2x T
A x , x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True First order ﬁnite diﬀerences: (C) k = 6 N = 28 eigenvalues samples

f( x ) = 1 2x T A x ,
x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True First order ﬁnite diﬀerences: h = 0.1 dist(W 1, ˆ W 1) k = 6 N = 28 eigenvalues samples

h = 0.001 f( x ) = 1 2x T
A x , x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True First order ﬁnite diﬀerences: dist(W 1, ˆ W 1) k = 6 N = 28 eigenvalues samples

h = 0.00001 f( x ) = 1 2x T
A x , x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True First order ﬁnite diﬀerences: dist(W 1, ˆ W 1) k = 6 N = 28 eigenvalues samples

1 p N ⇥ r x f1 · · ·
r x fN ⇤ ⇡ ˆ W 1 q ˆ ⇤1 ˆ V T 1 Low-rank approximation of the collection of gradients: Let’s be abundantly clear about the problem we are trying to solve. Low-dimensional linear approximation of the gradient: rf( x ) ⇡ ˆ W 1 a ( x ) f( x ) ⇡ g ⇣ ˆ W T 1 x ⌘ Approximate a function of many variables by a function of a few linear combinations of the variables: ✔ ✖ ✖

Deﬁne the conditional expectation: Theorem: Proof sketch: Tower property, Poincare
inequality, triangle inequality. g( y ) = Z f(W 1y + W 2z ) ⇢( z | y ) d z , f( x ) ⇡ g(W T 1 x ) EXPLOIT active subspaces for response surfaces with conditional averaging. ✓Z ⇣ f( x ) g( ˆ W T 1 x ) ⌘2 ⇢ d x ◆1 2  CP ⇣ " ( 1 + · · · + n)1 2 + ( n+1 + · · · + m)1 2 ⌘ Conditional expectation* Estimated eigenvectors Poincare constant Subspace error Eigenvalues for active variables Eigenvalues for inactive variables

There’s an active subspace in the canonical stochastic PDE. • 
2d Poisson with 100-term KL coeﬃcients •  Output: Spatial average over Neumann bc. Log of coeﬃcients Correlation Scalar of interest 0 20 40 60 80 100 10−20 10−15 10−10 10−5 100 105 Eigenvalues Variable index KL,β=1 KL,β=0.01 AS,β=1 AS,β=0.01 0 20 40 60 80 100 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 First Eigenvector Variable index AS,β=1 AS,β=0.01 1 2 D r · ( a r u ) = 1 , x 2 D u = 0 , x 2 1 n · a r u = 0 , x 2 2 C (x, y) = exp 1 k x y k1 log(a(x, ✓)) = m X i=1 ✓i i i(x) f ( ✓ ) = 1 | 2 | Z 2 u ( x, ✓ ) dx

There’s an active subspace in the canonical stochastic PDE. • 
2d Poisson with 100-term KL coeﬃcients •  Output: Spatial average over Neumann bc. Log of coeﬃcients Correlation Scalar of interest −3 −2 −1 0 1 2 3 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10−3 y f test GP mean train −3 −2 −1 0 1 2 3 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 x 10−4 y f test GP mean train Long corr. length Short corr. length = 0.01 = 1.00 1 2 D r · ( a r u ) = 1 , x 2 D u = 0 , x 2 1 n · a r u = 0 , x 2 2 C (x, y) = exp 1 k x y k1 log(a(x, ✓)) = m X i=1 ✓i i i(x) f ( ✓ ) = 1 | 2 | Z 2 u ( x, ✓ ) dx

DETAILS: Ø  “Full” is Gaussian process with linear mean on
all 100 parameters, trained with max likelihood on 500 runs Ø  “ASM” is Gaussian process on 1 active parameter, using 300 gradients to get active subspace and trained with max likelihood on 5 runs (one sample for conditional averages). Ø  Tested on 100 random design sites. 0.5 1 1.5 2 2.5 x 10−3 0.5 1 1.5 2 2.5 x 10−3 True Kriging Full ASM 2.6 2.8 3 3.2 3.4 3.6 x 10−4 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 x 10−4 True Kriging Full ASM = 1 Long corr. length: Short corr. length: = 0.01

THE TAKE-HOMES You must ﬁrst discover and then exploit some
low-dimensional structure to make high-dimensional problems tractable. Active subspaces oﬀer a new set of tools for working with functions that depends on several (tens to thousands of) variables. The right way to exploit active subspaces depends on what you’re trying to do (e.g., approximation, optimization, inversion, integration). COMING IN MARCH! Active Subspaces Emerging Ideas for Dimension Reduction in Parameter Studies SIAM, 2015

Questions? •  What if I don’t have gradients? •  Can
we talk about the trade-oﬀ between discovering and exploiting? •  Can you tell me more about [insert application]? •  How does this relate to other dimension reduction techniques? •  How new is all this? •  Can you show us the cool videos of the active subspace in the aerodynamic shape optimization? Paul Constantine Colorado School of Mines paulconstantine.net @DrPaulynomial Website: activesubspaces.org

Computing Active Subspaces

Computing Active Subspaces

Paul Constantine

More Decks by Paul Constantine

Other Decks in Science

Featured

Transcript

Paul Constantine Colorado School of Mines paulconstantine.net @DrPaulynomial activesubspaces.org Qiqi

ACTIVE COLLABORATIONS •  Scramjet safety (Stanford PSAAP Center) •  Photovoltaic

AUDIENCE POLL How many dimensions is “high” dimensions?

f( x )

Z f( x ) ⇢ d x Doing things in

Pray for main eﬀects dominated: f ( x ) f1(

Sensitivity analysis to eliminate variables: DISCOVER EXPLOIT Work with only

[ Show them that cool animation! ] https://www.youtube.com/watch?v=70BTwlV3Kwg

Active subspaces are deﬁned by the eigendecomposition of a matrix

The eigenvectors indicate the directions along which the function varies

DISCOVER the active subspace with random sampling. Draw samples: Compute:

Approximating the eigenpairs is not as tough as you might

Approximating the eigenpairs is not as tough as you might

Using Gittens and Tropp (2011) Approximating the eigenpairs is not

Approximating the eigenpairs is not as tough as you might

Q: How many samples of the gradient do I need?

f( x ) = 1 2x T A x ,

f( x ) = 1 2x T A x ,

f( x ) = 1 2x T A x ,

f( x ) = 1 2x T A x ,

h = 0.001 f( x ) = 1 2x T

h = 0.00001 f( x ) = 1 2x T

f( x ) = 1 2x T A x ,

h = 0.001 f( x ) = 1 2x T

h = 0.00001 f( x ) = 1 2x T

1 p N ⇥ r x f1 · · ·

Deﬁne the conditional expectation: Theorem: Proof sketch: Tower property, Poincare

There’s an active subspace in the canonical stochastic PDE. •

There’s an active subspace in the canonical stochastic PDE. •

DETAILS: Ø  “Full” is Gaussian process with linear mean on

THE TAKE-HOMES You must ﬁrst discover and then exploit some

Questions? •  What if I don’t have gradients? •  Can