Massimiliano Pontil (IIT Genoa & UC London, Italy & UK) Transfer Operator Learning

Slide 1

Slide 1 text

Transfer Operator Learning OT-Berlin Workshop, March 11-15, 2024 Massimiliano Pontil Italian Institute of Technology and University College London Vladimir Kostic Karim Lounici Pietro Novelli Also thanks to: Carlo Ciliberto, Antoine Chatalic, Riccardo Grazzi, Vladimir Kos ti c, Karim Lounici Prune Inzerilli, Andreas Maurer, Giacomo Mean ti , Pietro Novelli, Lorenzo Rosasco, Giacomo Turri

Slide 2

Slide 2 text

Dynamical Systems and ML • Data-driven approaches are becoming a key part of science and engineering • They are mathema ti cal models of temporally evolving phenomena, described by a state variable evolving over ti me xt ∈ 𝒳 t

Slide 3

Slide 3 text

Dynamical Systems Example: Xt+1 = F(Xt ) + noiset ℙ[Xt+1 |X1 , …, Xt] = ℙ[Xt+1 |Xt], independent of t • We focus on discrete, ti me homogenous, Markov processes:

Slide 4

Slide 4 text

Langevin Equation • Euler-Maruyama discretization: Xt+1 =Xt −∇U(Xt ) F(Xt ) + β−1/2(Wt+1 −Wt ) noiset • The (overdamped) Langevin equation driven by a potential : U: ℝd →ℝ dXt = −∇U(Xt )dt +β−1/2dWt Folding of CLN025 (Chignolin)

Slide 5

Slide 5 text

Roadmap ● Transfer / Koopman operators ● Subspace (kernel-based) approach ● Risk and spectral bounds ● (Neural) representation learning Code: https://github.com/Machine-Learning-Dynamical-Systems/kooplearn

Slide 6

Slide 6 text

• The forward transfer operator returns the expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ The Koopman/Transfer Operator

Slide 7

Slide 7 text

The Koopman/Transfer Operator • The forward transfer operator returns the expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: • globally linearizes the DS within a suitable invariant subspace: A A[ℱ]⊆ℱ (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ

Slide 8

Slide 8 text

The Koopman/Transfer Operator • If is invariant the operator is iden ti fi ed by an matrix evolving func ti on coe ff i cients: ℱ=span{ϕ1 , . . , ϕr } A r×r G ⟨w, ϕ( ⋅ )⟩ ↦ ⟨Gw, ϕ( ⋅ )⟩ (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ • The forward transfer operator returns the expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: • globally linearizes the DS within a suitable invariant subspace: A A[ℱ]⊆ℱ

Slide 9

Slide 9 text

Power of Linearity: Mode Decomposition • Spectral decomposi ti on (self-adjoint compact case): where are eigenfunc ti ons of with eigenvalue A = ∞ ∑ i=1 λi fi ⊗ fi fi A λi 𝔼 [ f(Xt )|X0 = x ]=(Atf )(x) =∑ i λt i fi (x) ⟨f, fi ⟩ • It disentangles an observable forecast into temporal and sta ti c components • A good is given by the span of leading eigenfunc ti ons ℱ

Slide 10

Slide 10 text

Related Work 10 • Data-driven algorithms to reconstruct (deterministic) dynamical systems: Kutz, Brunton, Budisic, Proctor, Mezic, Giannakis, Rowley,… • Transfer operator in molecular dynamics: Noé, Klus, N ü ske, Parrinello, Schuster,… • Conditional mean embeddings (CME) and stochastic processes: Fukumizu, Gretton, Grunewalder, Mollenhauer, Muandet, Pontil, Scholkopf, Sriperumbudur,…

Slide 11

Slide 11 text

• Ergodicity: there is a unique distribu ti on s.t. π Xt ∼π ⇒ Xt+1 ∼π • is well-de fi ned for Aℱ ℱ = L2 π ( 𝒳 ) (Aℱ f )(x) = 𝔼 [ f(Xt+1 )|Xt =x] = ∫ 𝒳 f(y)p(dy, x), f ∈ ℱ Aπ ≡ AL2 π ( 𝒳 ) : L2 π ( 𝒳 ) → L2 π ( 𝒳 ) Statistical Learning Setting [Kostic et al. NeurIPS 2022] • Challenge: the operator and its domain are unknown! transi ti on kernel

Slide 12

Slide 12 text

Subspace Approach • We restrict to a RKHS and look for an operator such that Aπ ℋ G : ℋ→ℋ Aπ ⟨w, ϕ( ⋅ )⟩≈⟨Gw, ϕ( ⋅ )⟩, 𝔼 [ϕ(Xt+1 )|Xt = x] ≈ G*ϕ(x) L2 π( 𝒳 ) L2 π( 𝒳 ) G ℋ ℋ that is: Aπ

Slide 13

Slide 13 text

Subspace Approach 𝔼 [ϕ(Xt+1 )|Xt = x] ≈ G*ϕ(x) L2 π( 𝒳 ) L2 π( 𝒳 ) Aπ G … → xt ↓ ↓ … →ϕ(xt )→ϕ(xt+1 )→ … ⟶ xt+1 → … loss: ∥ϕ(xt+1 ) − G*ϕ(xt )∥2 • We aim to minimize the risk: R(G) = 𝔼 (X,Y)∼ρ ∥ϕ(Y) − G*ϕ(X)∥2 representa ti on: ℋ ℋ • We restrict to a RKHS and look for an operator such that Aπ ℋ G : ℋ→ℋ Aπ ⟨w, ϕ( ⋅ )⟩≈⟨Gw, ϕ( ⋅ )⟩, that is state sequence:

Slide 14

Slide 14 text

Empirical Risk Minimization • Several es ti mators are included in this se tt i ng n ∑ i=1 ∥ϕ(yi ) − G*ϕ(xi )∥2 + γ∥G∥2 HS • Given an iid sample learn minimizing: (xi , yi =xi+1 )n i=1 ∼ρ ̂ G:ℋ→ℋ

Slide 15

Slide 15 text

( ̂ Gf )(x) = n ∑ i,j=1 Wi,j f(yi )k(xj , x) • Solu ti on has the form: n ∑ i=1 ∥ϕ(yi ) − G*ϕ(xi )∥2 + γ∥G∥2 HS Empirical Risk Minimization • Given a sample learn minimizing: (xi , yi =xi+1 )n i=1 ∼ρ ̂ G:ℋ→ℋ Ridge Regression Principal Component Reduced Rank Regression Low-rank: Minimizes the risk on a feature subspace spanned by the principal components Low-rank: Adds a rank constraint leading to a generalized eigenvalue problem Full-rank solu ti on WKRR = (K + γI)−1 where K=(k(xi , xj ))n i,j=1

Slide 16

Slide 16 text

Learning Guarantees ℋ ℋ L2 π( 𝒳 ) L2 π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = sup ∥h∥ℋ ≤1 ∥(Aπ − ̂ G)h∥L2 π Es ti ma ti on error: Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • We wish to bound and link the spectra of to that of E( ̂ G) ̂ G Aπ

Slide 17

Slide 17 text

17 Learning Guarantees ℋ ℋ L2 π( 𝒳 ) L2 π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = ∥Aπ| ℋ − ̂ G∥ℋ→L2 π • Both quan ti ti es appear in spectral errors bounds: ̂ G ̂ h = ̂ λ ̂ h ⇒ ∥Aπ ̂ h − ̂ λ ̂ h∥L2 π ≤ E( ̂ G) η( ̂ h) Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • Even if is small spectra es ti ma ti on may fail! E( ̂ G) Es ti ma ti on error: ∥ ̂ h∥L2 π

Slide 18

Slide 18 text

18 Learning Guarantees ℋ ℋ L2 π( 𝒳 ) L2 π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = ∥Aπ| ℋ − ̂ G∥ℋ→L2 π • Both quan ti ti es appear in spectral errors bounds: ̂ G ̂ h = ̂ λ ̂ h ⇒ ∥Aπ ̂ h − ̂ λ ̂ h∥L2 π ≤ E( ̂ G) η( ̂ h) Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • KRR is problema ti c because may deteriorate with the es ti mator rank η( ̂ h) Es ti ma ti on error: ∥ ̂ h∥L2 π

Slide 19

Slide 19 text

Estimating the Spectra of [Kostic et al. NeurIPS 2023] Aπ ℋ ℋ L2 π( 𝒳 ) L2 π( 𝒳 ) G Pℋ Es ti ma ti on error decomposi ti on ≤ ∥(I−Pℋ )Aπ| ℋ ∥ + ∥Pℋ Aπ| ℋ −G∥ + ∥G− ̂ G∥ representa ti on error es ti mator bias es ti mator variance E( ̂ G) η(h) = ∥h∥ℋ ∥C1/2h∥ℋ All operator norms are from to ℋ L2 π : true version of G ̂ G Metric distorsion: Projec ti on operator: Pℋ f = argmin h∈ℋ ∥f − h∥L2 π , f ∈L2 π Aπ| ℋ Covariance operator: C = 𝔼 X∼π ϕ(X) ⊗ ϕ(X)

Slide 20

Slide 20 text

Spectral Rates for RRR [Kostic et al. NeurIPS 2023] Theorem 1: Let . With probability at least , the RRR es ti mator sa ti s fi es ϵn = n− α 2(α + β) 1− δ spectral bias |λi (Aπ ) − ̂ λi | ≲ σr+1 (Aπ| ℋ ) σr (Aπ| ℋ ) +ϵn ln 1 δ • : e ff ec ti ve dim. of in (spectra decay of ) • : “size” of the image of in β∈[0,1] ℋ ℒ2 π C α∈(0,2] Aπ| ℋ ℒ2 π E( ̂ G) ≤ σr+1 (Aπ| ℋ ) + ϵn ln 1 δ Rate is minimax op ti mal

Slide 21

Slide 21 text

Comparison to PCR [Kostic et al. NeurIPS 2023] Theorem 1: Let . With probability at least , the PCR es ti mator sa ti s fi es ϵn = n− α 2(α + β) 1− δ • : e ff ec ti ve dim. of in (spectra decay of ) • : “size” of the image of in β∈[0,1] ℋ ℒ2 π C α∈(0,2] Aπ| ℋ ℒ2 π E( ̂ G) ≤ σ1/2 r+1 (C) + ϵn ln 1 δ |λi (Aπ ) − ̂ λi | ≲ 2σ1/2 r+1 (C) |σr (Aπ| ℋ ) − σα/2 r+1 (C)| + +ϵn ln 1 δ spectral bias Irreducible bias even if has rank Aπ| ℋ r

Slide 22

Slide 22 text

Estimating the Eigenfunctions [Kostic et al. NeurIPS 2023] ∥fi − ̂ hi ∥ℒ2 π ≤ 2|λi − ̂ λi | |gapi (Aπ ) − |λi − ̂ λ || + • Es ti ma ti on of the eigenfunc ti on is linked to es ti ma ti on error for eigenvalue and spectral gap gapi (Aπ ) = min j≠i |λi (Aπ ) − λj (Aπ )|

Slide 23

Slide 23 text

Estimating the Eigenfunctions [Kostic et al. NeurIPS 2023] ∥fi − ̂ hi ∥ℒ2 π ≤ 2|λi − ̂ λi | |gapi (Aπ ) − |λi − ̂ λ || + • Eigenfunc ti ons’ es ti ma ti on error is linked to eigenvalues’ error and spectral gap gapi (Aπ ) = min j≠i |λi (Aπ ) − λj (Aπ )| 1D quadruple-well potential • RRR estimates dominant eigenfunctions better than PCR

Slide 24

Slide 24 text

Distributions Forecasting [Inzerilli et al. Submitted 2024] ● evolves distributions: if with density w.r.t. then A* π Xt−1 ∼μt−1 qt−1 ∈ ℒ2 π ( 𝒳 ) π qt = A* π qt−1 Kμt = G*Kμt−1 ● Estimators based on a deflate-learn-inflate procedure obtain uniform in time forecasting bounds ● Estimators can be used to forecast distributions. Further, if is invariant and the kernel mean embedding, then ℋ Kμt = 𝔼 X∼μt ϕ(X)

Slide 25

Slide 25 text

Learning the Representation ( ) [Kostic et al. ICLR 2024] ℋ • We assume is compact, so it admits an SVD: • A good should: (1) approximate the operator well: (2) have low representa ti on error: (3) have low metric distor ti ons: (i.e. ) Aπ Aπ = ∞ ∑ i=1 σi ui ⊗ v* i ℋ Pℋ Aπ ≈ Aπ ∥(I−Pℋ )Aπ|ℋ ∥ ≈ 0 η(h) ≈ 1, ∀h ∈ ℋ Cℋ ≈ I

Slide 26

Slide 26 text

Learning the Representation ( ) [Kostic et al. ICLR 2024] ℋ • If desiderata are met: • Moreover if is normal the representa ti on error is zero ℋ = span(u1 , …, ur ) ∥(I−Pℋ )Aπ|ℋ ∥≤ σr+1 ∥C1/2 ℋ ∥≈σr+1 Aπ • We assume is compact, so it admits an SVD: • A good should: (1) approximate the operator well: (2) have low representa ti on error: (3) have low metric distor ti ons: (i.e. ) Aπ Aπ = ∞ ∑ i=1 σi ui ⊗ v* i ℋ Pℋ Aπ ≈ Aπ ∥(I−Pℋ )Aπ|ℋ ∥ ≈ 0 η(h) ≈ 1, ∀h ∈ ℋ Cℋ ≈ I

Slide 27

Slide 27 text

Learning the Representation ( ) [Kostic et al. ICLR 2024] ℋ • To learn the SVD we learn func ti ons maximizing the CCA score ψw , ψ′ w : 𝒳 → ℝr ∥Pℋw Aπ Pℋ′ w ∥HS = ∥(Cw X )−1/2Cw XY (Cw Y )−1/2∥2 HS

Slide 28

Slide 28 text

Learning the Representation ( ) [Kostic et al. ICLR 2024] ℋ • We introduce the relaxed di ff eren ti able score • If the representa ti on is rich, both scores have the same set of maximizers ∥Cw XY ∥2 HS ∥Cw X ∥ ∥Cw Y ∥ − γ∥I−Cw X ∥2 HS − γ∥I−Cw Y ∥2 HS • To learn the SVD we learn func ti ons maximizing the CCA score ψw , ψ′ w : 𝒳 → ℝr ∥Pℋw Aπ Pℋ′ w ∥HS = ∥(Cw X )−1/2Cw XY (Cw Y )−1/2∥2 HS

Slide 29

Slide 29 text

• DPNets can be extended to con ti nuous DS (see paper) Two Examples - Advantage of DPNets [Kostic et al. ICLR 2024] • We can use the learned for forecas ti ng or interpretability ψw Images predicted by our method (rows 1 and 2) vs. other methods

Slide 30

Slide 30 text

Off Topic ● Transfer learning approach to learn the potential energy (then used for atomistic simulations) [Falk. et al. NeurIPS 2023] [Wang et al. T-PAMI 2023] ● A theory explaining why pre-training works well for meta-learning 𝔼 ℰ(Wpre N , θpre N , μ) ≤ min W,θ ℰ(W, θ, μ) + O(1/ nT) Theorem

Slide 31

Slide 31 text

Conclusions ● We presented a framework for learning transfer operators of discrete stochastic dynamical systems, leading to efficient and reliable algorithms ● We derived spectral learning bounds and addressed representation learning ● Future work: study continuous, non-stationary, interactive DS, and incorporate invariances into the estimators Thanks!

Slide 32

Slide 32 text

References and Code ● V. Kostic, P. Novelli, A. Maurer, C. Ciliberto, L. Rosasco, M. Pontil. Learning dynamical systems via Koopman operator regression in reproducing kernel hilbert spaces. NeurIPS 2022. ● V. Kostic, K. Lounici, P. Novelli, M. Pontil. Koopman Operator Learning: Sharp Spectral Rates and Spurious Eigenvalues. NeurIPS 2023. ● G. Meanti, A. Chatalic, V. Kostic, P. Novelli, M. Pontil, L. Rosasco. Estimating Koopman operators with sketching to provably learn large scale dynamical systems. NeurIPS 2023. ● V. Kostic, P. Novelli, R. Grazzi, K. Lounici, M. Pontil. Learning invariant representations of time-homogeneous stochastic dynamical systems. ICLR 2024. ● P. Inzerilli, V. Kostic, K. Lounici, P. Novelli., M. Pontil. Consistent Long-Term Forecasting of Ergodic Dynamical Systems. Submitted 2024. ● G. Turri, V. Kostic, P. Novelli, M. Pontil. A randomized algorithm to solve reduced rank operator regression. Submitted, 2024. Code: https://github.com/Machine-Learning-Dynamical-Systems/kooplearn