Sketching Active Subspaces

Slide 1

Slide 1 text

SKETCHING ACTIVE SUBSPACES Or, what if I don’t have gradients? SLIDES: DISCLAIMER: These slides are meant to complement the oral presentation. Use out of context at your own risk. This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0011077. PAUL CONSTANTINE Ben L. Fryrear Assistant Professor Applied Mathematics & Statistics Colorado School of Mines activesubspaces.org! @DrPaulynomial! Joint with: Prof. David Gleich Purdue, CS Prof. Michael Wakin School of Mines, EE Dr. Armin Eftekhari UT Austin, Math

Slide 2

Slide 2 text

IMPACT EXPONENTIAL DECREASE IN NUMBER OF FUNCTION EVALUATIONS FOR PARAMETER STUDIES APPROXIMATION OPTIMIZATION INTEGRATION ˜ f( x ) ⇡ f( x ) Z f( x ) ⇢ d x minimize x f( x )

Slide 3

Slide 3 text

DEFINE the active subspace. Consider a function and its gradient vector, The average outer product of the gradient and its eigendecomposition, f = f( x ), x 2 Rm, rf( x ) 2 Rm, ⇢ : Rm ! R + C = Z rf rfT ⇢ d x = W ⇤W T i = Z rfT wi 2 ⇢ d x , i = 1, . . . , m LEMMA

Slide 4

Slide 4 text

DISCOVER the active subspace with random sampling. Draw samples: Compute: and fj = f( xj) Approximate with Monte Carlo Equivalent to SVD of samples of the gradient xj ⇠ ⇢ C ⇡ 1 N N X j=1 rfj rfT j = ˆ W ˆ ⇤ ˆ W T 1 p N ⇥ rf1 · · · rfN ⇤ = ˆ W p ˆ ⇤ ˆ V T rfj = rf( xj)

Slide 5

Slide 5 text

Assume What if I don’t have gradients? krf( x ) g ( x )k  m1/2 h lim h!0 h = 0 rf( x ) ⇡ g ( x ) Consider a gradient approximation: where Cost for first-order finite differences: N (m + 1) Z rf rfT ⇢ d x ⇡ 1 N N X i=1 g ( xi) g ( xi)T Approximate:

Slide 6

Slide 6 text

N = ⌦ ✓ L2 1 2 k "2 log( m ) ◆ = ) | k ˆk |  " k + O ( L h) Using Gittens and Tropp (2011) How many approximate gradients? Bound on gradient norm squared Relative accuracy Dimension (with high probability) Bias from approximate gradients

Slide 7

Slide 7 text

dist ⇣ W 1, ˆ W 1 ⌘  O(L h) (1 ") n (1 + ") n+1 + 4 1 n n+1 Gittens and Tropp (2011), Golub and Van Loan (1996), Stewart (1973) (with high probability) Spectral gap N = ⌦ ✓ L2 1"2 log( m ) ◆ = ) Something smaller than spectral gap For and small enough … " h Bias from approximate gradients How many approximate gradients?

Slide 8

Slide 8 text

f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True (C) First-order finite differences: h = 0.1 N = 28 Number of samples:

Slide 9

Slide 9 text

h = 0.001 f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 (C) First-order finite differences: N = 28 Number of samples: 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True

Slide 10

Slide 10 text

h = 0.00001 f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 (C) First-order finite differences: N = 28 Number of samples: 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 102 104 Index Eigenvalues CI Est True

Slide 11

Slide 11 text

f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 First-order finite differences: h = 0.1 N = 28 Number of samples: dist(W 1, ˆ W 1) 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True

Slide 12

Slide 12 text

f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 First-order finite differences: N = 28 Number of samples: dist(W 1, ˆ W 1) h = 0.001 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True

Slide 13

Slide 13 text

f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 First-order finite differences: N = 28 Number of samples: dist(W 1, ˆ W 1) h = 0.00001 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True 1 2 3 4 5 6 10−6 10−5 10−4 10−3 10−2 10−1 100 Subspace Dimension Distance CI Est True

Slide 14

Slide 14 text

Sufﬁcient Dimension Reduction solves a related problem. PROBLEM: Given data , find . y = f( x ) + " P(y| x ) = P(y|AT x ) Consider the regression model: predictors response random noise Assume where A 2 Rm⇥n, n < m {(yi, xi)} A See: RD Cook, Regression Graphics, Wiley (1998).

Slide 15

Slide 15 text

PAPER MAIN IDEA Sufﬁcient Dimension Reduction solves a related problem. PROBLEM: Given data , find . {(yi, xi)} A Samarov. Exploring regression structure using nonparametric functional estimation. JASA (1993) Define average gradient-based metrics for regression models. Hristache, et al. Structure adaptive approach for dimension reduction. Annals of Statistics (2001) Local linear regressions with subsets of given data to estimate gradients; take averages. “Outer product of gradients.” Fukumizu & Leng. Gradient-based kernel dimension reduction for regression. JASA (2014) Radial basis function approximation for gradients.

Slide 16

Slide 16 text

OBSERVATION: A compressed measurement of the gradient is a directional derivative. a T rf( x ) ⇡ f( x + h a ) f( x ) h Compressed measurement of the gradient TWO function evaluations! Can we develop methods that exploit this relationship to estimate the active subspace with N(k+1) < N(m+1) function evaluations?

Slide 17

Slide 17 text

where Covariance estimation from compressed measurements ⌃ = Z z zT dz ⇡ 1 N N X i=1 zi zT i ⇡ 1 N N X i=1 (Pizi) (Pizi)T zi ⇠ N(0, ⌃) 2 Rm Pi zi = Ei (ET i Ei) 1ET i zi Consider independent draws: Define the projection Ei 2 Rm⇥k independent Gaussians Qi & Hughes. Invariance of principal components under low-dimensional random projection of the data. IEEE ICIP (2012) Azizyan, Krishnamurthy, & Singh. Extreme compressive sampling for covariance estimation. arXiv:1506.00898 (2015) compressed measurements

Slide 18

Slide 18 text

Using compressed gradients to estimate the active subspace Pi rf( xi) = Ei (ET i Ei) 1ET i rf( xi) Ei = h e(i) 1 · · · e(i) k i rf( xi)T e (i) j ⇡ f( xi + h e (i) j ) f( xi) h Projection of gradient samples: where Exploiting the observation: Z rf rfT ⇢ d x ⇡ 1 N N X i=1 Pi rf( xi) Pi rf( xi) T Can we develop methods that exploit compressed gradients to estimate the active subspace with N(k+1) < N(m+1) function evaluations? MAYBE! compressed gradients

Slide 19

Slide 19 text

Low-rank approximation that uses compressed gradients G = 1 p N ⇥ rf( x1) · · · rf( xN ) ⇤ M ⇥ v1 · · · vN ⇤ = ⇥ ET 1 v1 · · · ET N vN ⇤ 2 Rk⇥N Recall: Z rf rfT ⇢ d x ⇡ G GT where Define the measurement operator: minimize A2Rm⇥r B2RN⇥r kM(G) M(ABT )k2 F Choose a rank r and solve* the optimization problem: Solve* with alternating minimization, initialized with the compressed covariance eigenvectors. Constantine, Eftekhari, & Wakin. Computing active subspaces efficiently with gradient sketching. IEEE CAMSAP (2015)

Slide 20

Slide 20 text

1.  How many measurements should we take? 2.  How many samples should we draw? 3.  What should the rank be in the low-rank approximation? 4.  What is the effect of finite difference approximations? Low-rank approximation that uses compressed gradients

Slide 21

Slide 21 text

f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 Number of measurements 4 5 6 7 8 9 Eigenvalue relative error 10-4 10-3 10-2 10-1 100 101 Proj Altmin Number of measurements 4 5 6 7 8 9 Three-d subspace error 10-4 10-3 10-2 10-1 100 Proj Altmin “Proj” is projection-based covariance estimation. “Altmin” is the low-rank approximation with rank 4. No finite difference approximations. Quadratic test function Dimension: 10, Number of samples: 200

Slide 22

Slide 22 text

Index 1 2 3 4 5 6 Eigenvalues 10-6 10-4 10-2 100 102 True Proj Altmin Dimension 1 2 3 4 5 6 Subspace error 0 0.2 0.4 0.6 0.8 1 Proj Altmin “Proj” is projection-based covariance estimation. “Altmin” is the low-rank approximation with rank 4. No finite difference approximations. f( x ) = 1 2x T A x , x 2 [ 1, 1]10, C = 1 3 A2 Quadratic test function Dimension: 10, Number of samples: 200

Slide 23

Slide 23 text

Number of measurements 10 30 50 70 90 Eigenvalue relative error 10-4 10-3 10-2 10-1 100 Proj Altmin Number of measurements 10 30 50 70 90 One-d subspace error 10-3 10-2 10-1 100 Proj Altmin “Proj” is projection-based covariance estimation. “Altmin” is the low-rank approximation with rank 8. No finite difference approximations. r · (aru) = 1, s 2 D u = 0, s 2 1 n · aru = 0, s 2 2 f( x ) = Z u( s , x ) d 2 PDE test function Dimension: 100 Number of samples: 300

Slide 24

Slide 24 text

Index 1 2 3 4 5 6 Eigenvalues 10-11 10-10 10-9 10-8 10-7 10-6 True Proj Altmin Dimension 1 2 3 4 5 6 Subspace error 0 0.2 0.4 0.6 0.8 1 Proj Altmin “Proj” is projection-based covariance estimation. “Altmin” is the low-rank approximation with rank 8. No finite difference approximations. r · (aru) = 1, s 2 D u = 0, s 2 1 n · aru = 0, s 2 2 f( x ) = Z u( s , x ) d 2 PDE test function Dimension: 100 Number of samples: 300

Slide 25

Slide 25 text

SUMMARY How can I estimate the active subspace without gradients? 1.  Finite differences 2.  Methods from Sufficient Dimension Reduction 3.  Compressed covariance estimation 4.  Compressed low-rank approximation of gradients How should I estimate the active subspace without gradients? REMEMBER THE ULTIMATE GOAL!

Slide 26

Slide 26 text

•  Why does the low-rank method work so much better? •  Is it really worth the effort to get the active subspace? PAUL CONSTANTINE Ben L. Fryrear Assistant Professor Colorado School of Mines activesubspaces.org! @DrPaulynomial! QUESTIONS?