Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Massimiliano Pontil (IIT Genoa & UC London, Italy & UK) Transfer Operator Learning

Jia-Jie Zhu
March 24, 2024
53

Massimiliano Pontil (IIT Genoa & UC London, Italy & UK) Transfer Operator Learning

WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24

Berlin, Germany. March 11th - 15th, 2024

Jia-Jie Zhu

March 24, 2024
Tweet

More Decks by Jia-Jie Zhu

Transcript

  1. Transfer Operator Learning OT-Berlin Workshop, March 11-15, 2024 Massimiliano Pontil

    Italian Institute of Technology and University College London Vladimir Kostic Karim Lounici Pietro Novelli Also thanks to: Carlo Ciliberto, Antoine Chatalic, Riccardo Grazzi, Vladimir Kos ti c, Karim Lounici Prune Inzerilli, Andreas Maurer, Giacomo Mean ti , Pietro Novelli, Lorenzo Rosasco, Giacomo Turri
  2. Dynamical Systems and ML • Data-driven approaches are becoming a

    key part of science and engineering • They are mathema ti cal models of temporally evolving phenomena, described by a state variable evolving over ti me xt ∈ 𝒳 t
  3. Dynamical Systems Example: Xt+1 = F(Xt ) + noiset ℙ[Xt+1

    |X1 , …, Xt] = ℙ[Xt+1 |Xt], independent of t • We focus on discrete, ti me homogenous, Markov processes:
  4. Langevin Equation • Euler-Maruyama discretization: Xt+1 =Xt −∇U(Xt ) F(Xt

    ) + β−1/2(Wt+1 −Wt ) noiset • The (overdamped) Langevin equation driven by a potential : U: ℝd →ℝ dXt = −∇U(Xt )dt +β−1/2dWt Folding of CLN025 (Chignolin)
  5. Roadmap • Transfer / Koopman operators • Subspace (kernel-based) approach

    • Risk and spectral bounds • (Neural) representation learning Code: https://github.com/Machine-Learning-Dynamical-Systems/kooplearn
  6. • The forward transfer operator returns the expected value of

    an observable (i.e. any func ti on of the state) one step ahead in the future: (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ The Koopman/Transfer Operator
  7. The Koopman/Transfer Operator • The forward transfer operator returns the

    expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: • globally linearizes the DS within a suitable invariant subspace: A A[ℱ]⊆ℱ (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ
  8. The Koopman/Transfer Operator • If is invariant the operator is

    iden ti fi ed by an matrix evolving func ti on coe ff i cients: ℱ=span{ϕ1 , . . , ϕr } A r×r G ⟨w, ϕ( ⋅ )⟩ ↦ ⟨Gw, ϕ( ⋅ )⟩ (A f )(x) = 𝔼 [ f(Xt+1 ) | Xt = x ] A : ℱ → ℱ • The forward transfer operator returns the expected value of an observable (i.e. any func ti on of the state) one step ahead in the future: • globally linearizes the DS within a suitable invariant subspace: A A[ℱ]⊆ℱ
  9. Power of Linearity: Mode Decomposition • Spectral decomposi ti on

    (self-adjoint compact case): where are eigenfunc ti ons of with eigenvalue A = ∞ ∑ i=1 λi fi ⊗ fi fi A λi 𝔼 [ f(Xt )|X0 = x ]=(Atf )(x) =∑ i λt i fi (x) ⟨f, fi ⟩ • It disentangles an observable forecast into temporal and sta ti c components • A good is given by the span of leading eigenfunc ti ons ℱ
  10. Related Work 10 • Data-driven algorithms to reconstruct (deterministic) dynamical

    systems: Kutz, Brunton, Budisic, Proctor, Mezic, Giannakis, Rowley,… • Transfer operator in molecular dynamics: Noé, Klus, N ü ske, Parrinello, Schuster,… • Conditional mean embeddings (CME) and stochastic processes: Fukumizu, Gretton, Grunewalder, Mollenhauer, Muandet, Pontil, Scholkopf, Sriperumbudur,…
  11. • Ergodicity: there is a unique distribu ti on s.t.

    π Xt ∼π ⇒ Xt+1 ∼π • is well-de fi ned for Aℱ ℱ = L2 π ( 𝒳 ) (Aℱ f )(x) = 𝔼 [ f(Xt+1 )|Xt =x] = ∫ 𝒳 f(y)p(dy, x), f ∈ ℱ Aπ ≡ AL2 π ( 𝒳 ) : L2 π ( 𝒳 ) → L2 π ( 𝒳 ) Statistical Learning Setting [Kostic et al. NeurIPS 2022] • Challenge: the operator and its domain are unknown! transi ti on kernel
  12. Subspace Approach • We restrict to a RKHS and look

    for an operator such that Aπ ℋ G : ℋ→ℋ Aπ ⟨w, ϕ( ⋅ )⟩≈⟨Gw, ϕ( ⋅ )⟩, 𝔼 [ϕ(Xt+1 )|Xt = x] ≈ G*ϕ(x) L2 π( 𝒳 ) L2 π( 𝒳 ) G ℋ ℋ that is: Aπ
  13. Subspace Approach 𝔼 [ϕ(Xt+1 )|Xt = x] ≈ G*ϕ(x) L2

    π( 𝒳 ) L2 π( 𝒳 ) Aπ G … → xt ↓ ↓ … →ϕ(xt )→ϕ(xt+1 )→ … ⟶ xt+1 → … loss: ∥ϕ(xt+1 ) − G*ϕ(xt )∥2 • We aim to minimize the risk: R(G) = 𝔼 (X,Y)∼ρ ∥ϕ(Y) − G*ϕ(X)∥2 representa ti on: ℋ ℋ • We restrict to a RKHS and look for an operator such that Aπ ℋ G : ℋ→ℋ Aπ ⟨w, ϕ( ⋅ )⟩≈⟨Gw, ϕ( ⋅ )⟩, that is state sequence:
  14. Empirical Risk Minimization • Several es ti mators are included

    in this se tt i ng n ∑ i=1 ∥ϕ(yi ) − G*ϕ(xi )∥2 + γ∥G∥2 HS • Given an iid sample learn minimizing: (xi , yi =xi+1 )n i=1 ∼ρ ̂ G:ℋ→ℋ
  15. ( ̂ Gf )(x) = n ∑ i,j=1 Wi,j f(yi

    )k(xj , x) • Solu ti on has the form: n ∑ i=1 ∥ϕ(yi ) − G*ϕ(xi )∥2 + γ∥G∥2 HS Empirical Risk Minimization • Given a sample learn minimizing: (xi , yi =xi+1 )n i=1 ∼ρ ̂ G:ℋ→ℋ Ridge Regression Principal Component Reduced Rank Regression Low-rank: Minimizes the risk on a feature subspace spanned by the principal components Low-rank: Adds a rank constraint leading to a generalized eigenvalue problem Full-rank solu ti on WKRR = (K + γI)−1 where K=(k(xi , xj ))n i,j=1
  16. Learning Guarantees ℋ ℋ L2 π( 𝒳 ) L2 π(

    𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = sup ∥h∥ℋ ≤1 ∥(Aπ − ̂ G)h∥L2 π Es ti ma ti on error: Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • We wish to bound and link the spectra of to that of E( ̂ G) ̂ G Aπ
  17. 17 Learning Guarantees ℋ ℋ L2 π( 𝒳 ) L2

    π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = ∥Aπ| ℋ − ̂ G∥ℋ→L2 π • Both quan ti ti es appear in spectral errors bounds: ̂ G ̂ h = ̂ λ ̂ h ⇒ ∥Aπ ̂ h − ̂ λ ̂ h∥L2 π ≤ E( ̂ G) η( ̂ h) Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • Even if is small spectra es ti ma ti on may fail! E( ̂ G) Es ti ma ti on error: ∥ ̂ h∥L2 π
  18. 18 Learning Guarantees ℋ ℋ L2 π( 𝒳 ) L2

    π( 𝒳 ) ̂ G Aπ| ℋ E( ̂ G) = ∥Aπ| ℋ − ̂ G∥ℋ→L2 π • Both quan ti ti es appear in spectral errors bounds: ̂ G ̂ h = ̂ λ ̂ h ⇒ ∥Aπ ̂ h − ̂ λ ̂ h∥L2 π ≤ E( ̂ G) η( ̂ h) Metric distorsion: η(h) = ∥h∥ℋ ∥h∥L2 π • KRR is problema ti c because may deteriorate with the es ti mator rank η( ̂ h) Es ti ma ti on error: ∥ ̂ h∥L2 π
  19. Estimating the Spectra of [Kostic et al. NeurIPS 2023] Aπ

    ℋ ℋ L2 π( 𝒳 ) L2 π( 𝒳 ) G Pℋ Es ti ma ti on error decomposi ti on ≤ ∥(I−Pℋ )Aπ| ℋ ∥ + ∥Pℋ Aπ| ℋ −G∥ + ∥G− ̂ G∥ representa ti on error es ti mator bias es ti mator variance E( ̂ G) η(h) = ∥h∥ℋ ∥C1/2h∥ℋ All operator norms are from to ℋ L2 π : true version of G ̂ G Metric distorsion: Projec ti on operator: Pℋ f = argmin h∈ℋ ∥f − h∥L2 π , f ∈L2 π Aπ| ℋ Covariance operator: C = 𝔼 X∼π ϕ(X) ⊗ ϕ(X)
  20. Spectral Rates for RRR [Kostic et al. NeurIPS 2023] Theorem

    1: Let . With probability at least , the RRR es ti mator sa ti s fi es ϵn = n− α 2(α + β) 1− δ spectral bias |λi (Aπ ) − ̂ λi | ≲ σr+1 (Aπ| ℋ ) σr (Aπ| ℋ ) +ϵn ln 1 δ • : e ff ec ti ve dim. of in (spectra decay of ) • : “size” of the image of in β∈[0,1] ℋ ℒ2 π C α∈(0,2] Aπ| ℋ ℒ2 π E( ̂ G) ≤ σr+1 (Aπ| ℋ ) + ϵn ln 1 δ Rate is minimax op ti mal
  21. Comparison to PCR [Kostic et al. NeurIPS 2023] Theorem 1:

    Let . With probability at least , the PCR es ti mator sa ti s fi es ϵn = n− α 2(α + β) 1− δ • : e ff ec ti ve dim. of in (spectra decay of ) • : “size” of the image of in β∈[0,1] ℋ ℒ2 π C α∈(0,2] Aπ| ℋ ℒ2 π E( ̂ G) ≤ σ1/2 r+1 (C) + ϵn ln 1 δ |λi (Aπ ) − ̂ λi | ≲ 2σ1/2 r+1 (C) |σr (Aπ| ℋ ) − σα/2 r+1 (C)| + +ϵn ln 1 δ spectral bias Irreducible bias even if has rank Aπ| ℋ r
  22. Estimating the Eigenfunctions [Kostic et al. NeurIPS 2023] ∥fi −

    ̂ hi ∥ℒ2 π ≤ 2|λi − ̂ λi | |gapi (Aπ ) − |λi − ̂ λ || + • Es ti ma ti on of the eigenfunc ti on is linked to es ti ma ti on error for eigenvalue and spectral gap gapi (Aπ ) = min j≠i |λi (Aπ ) − λj (Aπ )|
  23. Estimating the Eigenfunctions [Kostic et al. NeurIPS 2023] ∥fi −

    ̂ hi ∥ℒ2 π ≤ 2|λi − ̂ λi | |gapi (Aπ ) − |λi − ̂ λ || + • Eigenfunc ti ons’ es ti ma ti on error is linked to eigenvalues’ error and spectral gap gapi (Aπ ) = min j≠i |λi (Aπ ) − λj (Aπ )| 1D quadruple-well potential • RRR estimates dominant eigenfunctions better than PCR
  24. Distributions Forecasting [Inzerilli et al. Submitted 2024] • evolves distributions:

    if with density w.r.t. then A* π Xt−1 ∼μt−1 qt−1 ∈ ℒ2 π ( 𝒳 ) π qt = A* π qt−1 Kμt = G*Kμt−1 • Estimators based on a deflate-learn-inflate procedure obtain uniform in time forecasting bounds • Estimators can be used to forecast distributions. Further, if is invariant and the kernel mean embedding, then ℋ Kμt = 𝔼 X∼μt ϕ(X)
  25. Learning the Representation ( ) [Kostic et al. ICLR 2024]

    ℋ • We assume is compact, so it admits an SVD: • A good should: (1) approximate the operator well: (2) have low representa ti on error: (3) have low metric distor ti ons: (i.e. ) Aπ Aπ = ∞ ∑ i=1 σi ui ⊗ v* i ℋ Pℋ Aπ ≈ Aπ ∥(I−Pℋ )Aπ|ℋ ∥ ≈ 0 η(h) ≈ 1, ∀h ∈ ℋ Cℋ ≈ I
  26. Learning the Representation ( ) [Kostic et al. ICLR 2024]

    ℋ • If desiderata are met: • Moreover if is normal the representa ti on error is zero ℋ = span(u1 , …, ur ) ∥(I−Pℋ )Aπ|ℋ ∥≤ σr+1 ∥C1/2 ℋ ∥≈σr+1 Aπ • We assume is compact, so it admits an SVD: • A good should: (1) approximate the operator well: (2) have low representa ti on error: (3) have low metric distor ti ons: (i.e. ) Aπ Aπ = ∞ ∑ i=1 σi ui ⊗ v* i ℋ Pℋ Aπ ≈ Aπ ∥(I−Pℋ )Aπ|ℋ ∥ ≈ 0 η(h) ≈ 1, ∀h ∈ ℋ Cℋ ≈ I
  27. Learning the Representation ( ) [Kostic et al. ICLR 2024]

    ℋ • To learn the SVD we learn func ti ons maximizing the CCA score ψw , ψ′  w : 𝒳 → ℝr ∥Pℋw Aπ Pℋ′  w ∥HS = ∥(Cw X )−1/2Cw XY (Cw Y )−1/2∥2 HS
  28. Learning the Representation ( ) [Kostic et al. ICLR 2024]

    ℋ • We introduce the relaxed di ff eren ti able score • If the representa ti on is rich, both scores have the same set of maximizers ∥Cw XY ∥2 HS ∥Cw X ∥ ∥Cw Y ∥ − γ∥I−Cw X ∥2 HS − γ∥I−Cw Y ∥2 HS • To learn the SVD we learn func ti ons maximizing the CCA score ψw , ψ′  w : 𝒳 → ℝr ∥Pℋw Aπ Pℋ′  w ∥HS = ∥(Cw X )−1/2Cw XY (Cw Y )−1/2∥2 HS
  29. • DPNets can be extended to con ti nuous DS

    (see paper) Two Examples - Advantage of DPNets [Kostic et al. ICLR 2024] • We can use the learned for forecas ti ng or interpretability ψw Images predicted by our method (rows 1 and 2) vs. other methods
  30. Off Topic • Transfer learning approach to learn the potential

    energy (then used for atomistic simulations) [Falk. et al. NeurIPS 2023] [Wang et al. T-PAMI 2023] • A theory explaining why pre-training works well for meta-learning 𝔼 ℰ(Wpre N , θpre N , μ) ≤ min W,θ ℰ(W, θ, μ) + O(1/ nT) Theorem
  31. Conclusions • We presented a framework for learning transfer operators

    of discrete stochastic dynamical systems, leading to efficient and reliable algorithms • We derived spectral learning bounds and addressed representation learning • Future work: study continuous, non-stationary, interactive DS, and incorporate invariances into the estimators Thanks!
  32. References and Code • V. Kostic, P. Novelli, A. Maurer,

    C. Ciliberto, L. Rosasco, M. Pontil. Learning dynamical systems via Koopman operator regression in reproducing kernel hilbert spaces. NeurIPS 2022. • V. Kostic, K. Lounici, P. Novelli, M. Pontil. Koopman Operator Learning: Sharp Spectral Rates and Spurious Eigenvalues. NeurIPS 2023. • G. Meanti, A. Chatalic, V. Kostic, P. Novelli, M. Pontil, L. Rosasco. Estimating Koopman operators with sketching to provably learn large scale dynamical systems. NeurIPS 2023. • V. Kostic, P. Novelli, R. Grazzi, K. Lounici, M. Pontil. Learning invariant representations of time-homogeneous stochastic dynamical systems. ICLR 2024. • P. Inzerilli, V. Kostic, K. Lounici, P. Novelli., M. Pontil. Consistent Long-Term Forecasting of Ergodic Dynamical Systems. Submitted 2024. • G. Turri, V. Kostic, P. Novelli, M. Pontil. A randomized algorithm to solve reduced rank operator regression. Submitted, 2024. Code: https://github.com/Machine-Learning-Dynamical-Systems/kooplearn