Reproducing Kernel Tutorial

Reproducing Kernels Fred J. Hickernell Department of Applied Mathematics Center
for Interdisciplinary Scientific Computation Office of Research Illinois Institute of Technology [email protected] mypages.iit.edu/~hickernell Thanks to Mac Hyman for the invitation Thanks to many students and collaborators Slides available at speakerdeck.com/fjhickernell/reproducing-kernel-tutorial Please interrupt and ask questions November 15, 2021

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc
Measures Error Bds References What Can We Do with Reproducing Kernel Hilbert Spaces? Use the Reisz Representation Theorem to derive error bounds for algorithms for linear problems, such as integration, function approximation, solving linear differential equations Derive optimal algorithms Determine how fast the error bounds decay to zero as the computational effort increases, and even whether convergence depends significantly on the number of variables Include trends, BUT I have not prepared that for today. Derive a parallel analysis using Gaussian processes where the reproducing kernel is interpreted as a covariance kernel 2/20

Measures Error Bds References What Can We Do with Reproducing Kernel Hilbert Spaces? Use the Reisz Representation Theorem to derive error bounds for algorithms for linear problems, such as integration, function approximation, solving linear differential equations, BUT You must be able to solve the problem for your reproducing kernel You must pick a kernel that matches your input function You may need to tune the kernel parameters Derive optimal algorithms, BUT it takes O(n3) operations to compute the weights Determine how fast the error bounds decay to zero as the computational effort increases, and even whether convergence depends significantly on the number of variables Include trends, BUT I have not prepared that for today. Derive a parallel analysis using Gaussian processes where the reproducing kernel is interpreted as a covariance kernel , BUT I have not prepared that for today 2/20

Measures Error Bds References Reproducing Kernels for Functions on {1, . . . , d}, aka Vectors Let F := all functions on {1, . . . , d} “=” Rd Pick a symmetric, positive definite (positive eigenvalues) matrix W ∈ Rd×d to define an inner product ⟨f, h⟩ := fTWh, ∀f, h ∈ F, where f = f(t) d t=1 3/20

Measures Error Bds References Reproducing Kernels for Functions on {1, . . . , d}, aka Vectors Let F := all functions on {1, . . . , d} “=” Rd Pick a symmetric, positive definite (positive eigenvalues) matrix W ∈ Rd×d to define an inner product ⟨f, h⟩ := fTWh, ∀f, h ∈ F, where f = f(t) d t=1 Reproducing kernel, K, is defined by K(t, x) d t,x=1 = K := W−1, and has the properties Symmetry K(t, x) = K(x, t) because W is symmetric and thus so is K Positive Definiteness K(xi , xj ) n i,j=1 is positive definite for any distinct x1 , . . . , xn ∈ {1, . . . , d} Belonging K(·, x) = xth column of K =: Kx ∈ F Reproduction ⟨K(·, x), f⟩ = KT x Wf = ex f = f(x) since K := W−1; ex := (0, . . . , 0, 1 xth position , 0, . . .)T 3/20

Measures Error Bds References Reproducing Kernels for Functions on {1, . . . , d}, aka Vectors Let F := all functions on {1, . . . , d} “=” Rd Pick a symmetric, positive definite (positive eigenvalues) matrix W ∈ Rd×d to define an inner product ⟨f, h⟩ := fTWh, ∀f, h ∈ F, where f = f(t) d t=1 Reproducing kernel, K, is defined by K(t, x) d t,x=1 = K := W−1, and has the properties Symmetry K(t, x) = K(x, t) because W is symmetric and thus so is K Positive Definiteness K(xi , xj ) n i,j=1 is positive definite for any distinct x1 , . . . , xn ∈ {1, . . . , d} Belonging K(·, x) = xth column of K =: Kx ∈ F Reproduction ⟨K(·, x), f⟩ = KT x Wf = ex f = f(x) since K := W−1; ex := (0, . . . , 0, 1 xth position , 0, . . .)T Riesz Representation Theorem says that for any linear function, LINEAR, there is a representer g such that LINEAR(f) = ⟨g, f⟩ = gTWf. Note    g(1) . . . g(d)    = g = KWg =    KT 1 Wg . . . KT d Wg    =    ⟨K(·, 1), g⟩ . . . ⟨K(·, d), g⟩    =    LINEAR(K(·, 1)) . . . LINEAR(K(·, d))    3/20

Measures Error Bds References Reproducing Kernels for Functions on {1, . . . , d}, aka Vectors Let F be a vector space of functions define an inner product W is gone Reproducing kernel, K, has the properties Symmetry K(t, x) = K(x, t) Positive Definiteness K(xi , xj ) n i,j=1 is positive definite for any distinct x1 , . . . , xn ∈ {1, . . . , d} Belonging K(·, x) ∈ F Reproduction ⟨K(·, x), f⟩ = f(x) Riesz Representation Theorem says that LINEAR(f) = ⟨g, f⟩    g(1) . . . g(d)    =    LINEAR(K(·, 1)) . . . LINEAR(K(·, d))    3/20

Measures Error Bds References Reproducing Kernels for Functions on General Domains [1] Suppose that (F, ⟨·, ·⟩) is a Hilbert space of functions on Ω for which function evaluation is bounded. Then there exists a unique reproducing kernel K : Ω × Ω → R for which K(t, x) = K(x, t) symmetry , K(·, x) ∈ F belonging , f(x) = ⟨K(·, x), f⟩ reproduction ∀t, x ∈ Ω, f ∈ F K(X, X) = K(xi , xj ) n i,j=1 is positive definite for any n × d X with distinct rows lying in Ω F is the completion of {c1 K(·, x1 ) + · · · + cn K(·, xn ) : n ∈ N, c ∈ Rn}; any K satisfying the above implies F 4/20

Measures Error Bds References Reproducing Kernels for Functions on General Domains [1] Suppose that (F, ⟨·, ·⟩) is a Hilbert space of functions on Ω for which function evaluation is bounded. Then there exists a unique reproducing kernel K : Ω × Ω → R for which K(t, x) = K(x, t) symmetry , K(·, x) ∈ F belonging , f(x) = ⟨K(·, x), f⟩ reproduction ∀t, x ∈ Ω, f ∈ F K(X, X) = K(xi , xj ) n i,j=1 is positive definite for any n × d X with distinct rows lying in Ω F is the completion of {c1 K(·, x1 ) + · · · + cn K(·, xn ) : n ∈ N, c ∈ Rn}; any K satisfying the above implies F Riesz Representation Theorem says that for any bounded LINEAR : F → R there exists a representer g ∈ F such that LINEAR(f) = ⟨g, f⟩ for all f ∈ F. What is g? 4/20

Measures Error Bds References Reproducing Kernels for Functions on General Domains [1] Suppose that (F, ⟨·, ·⟩) is a Hilbert space of functions on Ω for which function evaluation is bounded. Then there exists a unique reproducing kernel K : Ω × Ω → R for which K(t, x) = K(x, t) symmetry , K(·, x) ∈ F belonging , f(x) = ⟨K(·, x), f⟩ reproduction ∀t, x ∈ Ω, f ∈ F K(X, X) = K(xi , xj ) n i,j=1 is positive definite for any n × d X with distinct rows lying in Ω F is the completion of {c1 K(·, x1 ) + · · · + cn K(·, xn ) : n ∈ N, c ∈ Rn}; any K satisfying the above implies F Riesz Representation Theorem says that for any bounded LINEAR : F → R there exists a representer g ∈ F such that LINEAR(f) = ⟨g, f⟩ for all f ∈ F. What is g? g(x) = reproduction ⟨K(·, x), g⟩ = symmetry ⟨g, K(·, x)⟩ = representer LINEAR K(·, x) ∀x ∈ Ω 4/20

Measures Error Bds References Reproducing Kernels for Functions on General Domains [1] Suppose that (F, ⟨·, ·⟩) is a Hilbert space of functions on Ω for which function evaluation is bounded. Then there exists a unique reproducing kernel K : Ω × Ω → R for which K(t, x) = K(x, t) symmetry , K(·, x) ∈ F belonging , f(x) = ⟨K(·, x), f⟩ reproduction ∀t, x ∈ Ω, f ∈ F K(X, X) = K(xi , xj ) n i,j=1 is positive definite for any n × d X with distinct rows lying in Ω F is the completion of {c1 K(·, x1 ) + · · · + cn K(·, xn ) : n ∈ N, c ∈ Rn}; any K satisfying the above implies F Riesz Representation Theorem says that for any bounded LINEAR : F → R there exists a representer g ∈ F such that LINEAR(f) = ⟨g, f⟩ for all f ∈ F. What is g? g(x) = reproduction ⟨K(·, x), g⟩ = symmetry ⟨g, K(·, x)⟩ = representer LINEAR K(·, x) ∀x ∈ Ω ∥g∥2 = ⟨g, g⟩ = representer LINEAR(g) = LINEAR·· LINEAR· K(·, ··) Do not need the definition of ⟨·, ·⟩ to compute g and ∥g∥ 4/20

Measures Error Bds References Squared Exponential Kernel on R The squared exponential (aka Gaussian) kernel for univariate functions takes the form K(t, x) = A exp −γ2 |t − x|2 , t, x ∈ R corresponds to the Hilbert space of functions with norm [2, (6.18)] ∥f∥2 = A √ π ∞ m=0 R f(m)(x) 2 dx m!4mγ2m+1 which means that functions have all deriviatives square integrable. 5/20

Measures Error Bds References Squared Exponential Kernel on Rd The squared exponential kernel for d-variate functions takes the form K(t, x) = A exp −γ2 1 |t1 − x1 |2 − · · · − γ2 d |td − xd |2 , t, x ∈ Rd corresponds to the Hilbert space of functions with norm ∥Dmf∥2 2 := Rd ∂∥m∥1 f(x) ∂xm1 1 · · · ∂xmd d 2 dx ∥f∥2 = A √ π m∈Nd 0 ∥Dmf∥2 2 ∥m∥1 ! 4∥m∥1 d k=1 γ2mj k which means that functions have all deriviatives square integrable. This kernel is stationary. It is isotropic if γ1 = · · · = γd . 6/20

Measures Error Bds References Matérn Kernels A popular family of kernels with a range of smoothness depending on r with an associate norm that is not simple to write down: Kr (t, x) = A ∥t − x∥r 2 Mod Bessel Secr (γ ∥t − x∥2 ) K1/2 (t, x) = A1/2 exp(−γ ∥t − x∥2 ) not very smooth K3/2 (t, x) = A3/2 (1 + γ ∥t − x∥2 ) exp(−γ ∥t − x∥2 ) somewhat smoother 7/20

Measures Error Bds References The Centered Discrepancy Kernel [3] A reproducing kernel used to analyze cubatures gives the weighted centered discrepancy takes the form K(t, x) := d k=1 1 + γk 2 |tk − 1/2| + |xk − 1/2| − |tk − xk | . t, x ∈ [0, 1]d which corresponds to the Hilbert space for functions defined on [0, 1]d with the following norm: ∥Dmf∥2 2 := Rd ∂∥m∥1 f(x) ∂xm1 1 · · · ∂xmd d 2 xj=1/2 for mj=0 j s.t. mj>0 dxj ∥f∥2 := A ∥m∥∞⩽1 ∥Dmf∥2 2 γk Mixed partial derivatives of up to order one in each coordinate must be square integrable. 8/20

Measures Error Bds References The Delta Kernel A reproducing kernel with an uncountable basis is K(t, x) := 1 + γ, t = x, 1, otherwise, t, x ∈ [0, 1]d which corresponds to the Hilbert space for functions that are a constant everywhere except possibly at a countable number of points. I(f) = [0,1]d f(x) dx ∥f∥2 := |I(f)|2 + x∈[0,1]d |f(x) − I(f)|2 γ This Hilbert space has an uncountable basis. 9/20

Measures Error Bds References Hilbert Spaces of Signed Measures [4] Let M be the Hilbert spaced of measures on Ω that is the completion of {c1 δx1 + · · · + cn δxn : n ∈ N, c ∈ Rn} under the norm induced by ⟨µ, ν⟩M := Ω×Ω K(t, x) (µ × ν)(dt × dx) and δx is the Dirac measure, i.e., Ω f(t) δx(dt) = f(x). There exists a one-to-one and onto, isometric (I think) mapping T : M → F defined as T(µ)(x) := Ω K(t, x) µ(dt) ∀x ∈ Ω, µ ∈ M such that ⟨T(ν), f⟩ = Ω f(x) ν(dx). If T(νx) is the representer for the solution of a differential equation at x, is νx the Green’s function? 10/20

Measures Error Bds References Can We Findlike the W for functions on {1, . . . , d} Let M be the Hilbert spaced of measures on Ω that is the completion of {c1 δx1 + · · · + cn δxn : n ∈ N, c ∈ Rn} under the norm induced by ⟨µ, ν⟩M := Ω×Ω K(t, x) (µ × ν)(dt × dx) and δx is the Dirac measure, i.e., Ω f(t) δx(dt) = f(x). Is there a measure ω on Ω × Ω such that Ω×Ω K(t, s)K(u, x) ω(ds × du) = K(t, x) ∀t, x ∈ Ω? This would be like the W for functions on {1, . . . , d} 11/20

Measures Error Bds References Separable Hilbert Spaces, i.e., Those with Countable Bases Hilbert spaces of functions on Ω with countable bases can be written in terms of an L2(Ω) basis f(x) = k f(k)φk(x), Ω φk(x)φl(x) dx = δk,l and the reproducing kernel is K(t, x) = k λkφk(t)φk(x), note that Ω K(x, x) dx = k λk < ∞, so λk → 0 ⟨f, g⟩ := k f(k)^ g(k) λk , since this implies ⟨K(·, x), f⟩ := k λkφk(x)f(k) λk = k φk(x)f(k) = f(x) If we formally define the distribution m(x) = k f(k)φk(x) λk , then Ω K(t, x)m(t) dt = Ω k,l λkφk(t)φk(x) f(l)φl(t) λl dt = k φk(x)f(k) = f(x) This m(x)dx seems to be the µ(dx) that gets mapped into f It would seem that W(t, x) = k φk(t)φk(x) λk , which is not a convergent series 12/20

Measures Error Bds References Error for Approximating Linear Functionals Suppose that Linear SOL : F → R is the desired solution (integral, derivative at a point, etc.) X = (x1 , . . . , xn )T is the array of data sites APPX,α(f) = α1 f(x1 ) + · · · + αn f(xn ) = αTf(X) is the approximation Then the approximation error has a tight upper bound of SOL(f) − APPX,α(f) linear, bounded = ⟨g, f⟩ ⩽ ∥g∥ badness of APPX ∥f∥ badness of f where g(x) = SOL − APPX,α K(·, x) ∥g∥2 = SOL − APPX,α ·· SOL − APPX,α · K(·, ··) = SOL·· SOL· K(·, ··) − 2αT SOL K(X, ·) + αTK(X, X)α APPX badness does not require ⟨·, ·⟩, but only K K(X, ·) = K(xi , ·) n i=1 , K(X, X) = K(xi , xj ) n i,j=1 Optimal weights are α = K(X, X)−1 SOL K(X, ·) ; optimal data sites, X, are hard nonlinear optimization 13/20

Measures Error Bds References An Example of the Cubature Error Bound [3] For the problem and non-optimal approximation SOL(f) = [0,1]d f(x) dx, APPX = 1 n n i=1 f(xi ), and the reproducing kernel K(t, x) := d k=1 1 + γk 2 |tk − 1/2| + |xk − 1/2| − |tk − xk | t, x ∈ [0, 1]d the error bound is |SOL(f) − APPX | ⩽ BAD(x1 , . . . , xn ) BAD(f), where BAD2(x1 , . . . , xn ) = ∥representer of the error∥2 = 13 12 d − 2 n n i=1 d k=1 1 + γk 2 |xik − 1/2| − |xik − 1/2|2 + 1 n2 n i,j=1 d k=1 1 + 1 2 |xik − 1/2| + xjk − 1/2 − xik − xjk Requires O(dn2) operations to compute BAD(f) = ∥f − f(1/2, . . . , 1/2)∥, which is impractical to compute 14/20

Measures Error Bds References Optimal Function Approximation Consider function evaluation at x, i.e., SOLx(f) = f(x). In this case, SOL·· x SOL· x K(·, ··) = K(x, x), SOLx K(X, ·) = K(X, x) and the optimal algorithm is APPx,X,opt (f) = K(x, X)K(X, X)−1f(X) f(x) − K(x, X)K(X, X)−1f(X) 2 ⩽ K(x, x) − K(x, X)K(X, X)−1K(X, x) ∥f∥2 ⩽ K(x, x) − K(x, X)K(X, X)−1K(X, x) only depends on X f − K(·, X)K(X, X)−1f(X) best approximation to f 2 APP·,X,opt (f) is in the Hilbert space, and even in the span of K(·, x1 ), . . . , K(·, xn ), so K(·, X)K(X, X)−1f(X) best approximation ⊥ f − K(·, X)K(X, X)−1f(X) error of approximation The optimal linear approximation for an arbitrary linear functional is just the linear functional applied to the optimal function approximation K(X, X) can be ill-conditioned for smooth kernels and lots of data 15/20

Measures Error Bds References Cardinal Functions To visualize the effect of an additional data point on the function approximation, we plot APP·,X,opt (f) = K(·, X)K(X, X)−1f(X) for f(X) = ei (all data are zero but one) The cardinal functions of the smoother kernel is more oscillatory 16/20

Measures Error Bds References Why Is the Optimal Approximation Linear? Fix x ∈ Ω. Let BX,f(X),R = {g ∈ F : ∥g∥2 ⩽ R2 + ∥APP·,X,opt (f)∥2 , g(X) = f(X)} functions that look like f BX,⊥,R = {h ∈ F : ∥h∥ ⩽ R, h(X) = 0} functions that vanish at the data sites = {h ∈ F : ∥h∥ ⩽ R, ⟨h, K(·, x1 )⟩ = · · · = ⟨h, K(·, xn )⟩ = 0} Any g ∈ BX,f(X),R may be written as g = APP·,X,opt (f) + g⊥ with g⊥ ∈ BX,⊥,R yopt := argmin y∈R ERR(y) ERR(y) := max g∈BX,f(X),R |g(x) − y| = max g⊥∈BX,⊥,R |APPx,X,opt (f) + g⊥ (x) − y| Since for every g⊥ ∈ BX,⊥,R it also is true that −g⊥ ∈ BX,⊥,R , the optimal choice of y is APPx,X,opt (f). 17/20

Measures Error Bds References Tuning the Kernel Parameters Virtually all reproducing kernels have parameters, θ, that govern smoothness and shape. To ensure that your function is typical for the reproducing kernel Hilbert space, F, one should likely tune these parameters from the function data. Here is a proposal: θopt = argmin θ log f(X)TKθ(X, X)−1f(X) squared norm of the minimum norm interpolant + 1 n log(det(Kθ(X, X))) This corresponds to choosing θ to minimize the volume of the ellipsoidal solid in Rn consisting of all possible function data whose minimum-norm interpolants have an Fθ -norm no greater than that of the observed interpolant. It also corresponds to using empirical Bayes when working in the Gaussian process setting with covariance kernels, Kθ 18/20

Thank you These slides are available at speakerdeck.com/fjhickernell/reproducing-kernel-tutorial

Measures Error Bds References References 1. N. Aronszajn. Theory of Reproducing Kernels. Trans. Amer. Math. Soc. 68, 337–404 (1950). 2. Rasmussen, C. E. & Williams, C. Gaussian Processes for Machine Learning. (online version at http://www.gaussianprocess.org/gpml/) (MIT Press, Cambridge, Massachusetts, 2006). 3. H., F. J. A Generalized Discrepancy and Quadrature Error Bound. Math. Comp. 67, 299–322 (1998). 4. H., F. J. Goodness-of-Fit Statistics, Discrepancies and Robust Designs. Statist. Probab. Lett. 44, 73–78 (1999). 20/20

Reproducing Kernel Tutorial

Reproducing Kernel Tutorial

Fred J. Hickernell

More Decks by Fred J. Hickernell

Other Decks in Research

Featured

Transcript

Reproducing Kernels Fred J. Hickernell Department of Applied Mathematics Center

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc

Thank you These slides are available at speakerdeck.com/fjhickernell/reproducing-kernel-tutorial

Background Rep Ker & Riesz Rep Thm Kernel Ex Assoc