Score Learning under the Manifold Hypothesis: Theory and Implications for Data Science

1/19 Score Learning under the Manifold Hypothesis: Theory and Implications
for Data Science Ya-Ping Hsieh Joint w/ Xiang Li Zebang Shen Niao He July, 2025

2/19 Outline Introduction Summary of Our Results Numerical Illustration Conclusion

3/19 Score Learning and Diffusion Models ◦ Score Learning: Estimate
the smoothed score function [Vin11; HD05] ▶ Core component in modern diffusion models [SSK+21; SME20; HJA20]

3/19 Score Learning and Diffusion Models ◦ Score Learning: Estimate
the smoothed score function [Vin11; HD05] ▶ Core component in modern diffusion models [SSK+21; SME20; HJA20] ▶ VE or SMLD SDE: f(x, t) = 0 and g(t) = dσ2(t) dt [SSK+21; SME20] s∗ σ (x) = ∇ log pdata ∗ N(0, σ2I) (x)

4/19 Key Challenges in Score Learning ◦ A key empirical
challenge: As σ → 0, sσ (x) − ∇ log pdata ∗ N(0, σ2I) (x) → ∞ ▶ Leads to instability in score estimation at low noise levels [SSK+21; KAA+22]

challenge: As σ → 0, sσ (x) − ∇ log pdata ∗ N(0, σ2I) (x) → ∞ ▶ Leads to instability in score estimation at low noise levels [SSK+21; KAA+22] ▶ Causes significant downstream effects [AGH+23; Rˇ SP+25]

challenge: As σ → 0, sσ (x) − ∇ log pdata ∗ N(0, σ2I) (x) → ∞ ▶ Leads to instability in score estimation at low noise levels [SSK+21; KAA+22] ▶ Causes significant downstream effects [AGH+23; Rˇ SP+25] ▶ Only partially understood theoretically [SBD+24]

5/19 Score Learning under the Manifold Hypothesis ◦ Manifold Hypothesis:
pdata is supported on a manifold M := supp(pdata) [SE19; De 22] Goal: A Precise characterization of the error rate of sσ under the manifold hypothesis.

pdata is supported on a manifold M := supp(pdata) [SE19; De 22] Goal: A Precise characterization of the error rate of sσ under the manifold hypothesis. ◦ Key Insights: ▶ Recover pdata is hard in general

pdata is supported on a manifold M := supp(pdata) [SE19; De 22] Goal: A Precise characterization of the error rate of sσ under the manifold hypothesis. ◦ Key Insights: ▶ Recover pdata is hard in general ▶ Learning geometric structure (e.g., Laplacian of M) is significantly easier

pdata is supported on a manifold M := supp(pdata) [SE19; De 22] Goal: A Precise characterization of the error rate of sσ under the manifold hypothesis. ◦ Key Insights: ▶ Recover pdata is hard in general ▶ Learning geometric structure (e.g., Laplacian of M) is significantly easier ▶ Implications for ML: ▶ Scores can serve as manifold-aware projections ▶ But may fail to capture domain-relevant quantities (e.g., ∇ log pdata in scientific applications)

7/19 Main Results Theorem (Informal, diffusion models) As σ →
0, ▶ When the score is learned very well, pσ converges weakly to pdata . ▶ When the score is learned coarsely, pσ can converge any distribution on the manifold.

7/19 Main Results Theorem (Informal, diffusion models) As σ →
0, ▶ When the score is learned very well, pσ converges weakly to pdata . ▶ When the score is learned coarsely, pσ can converge any distribution on the manifold. Theorem (Informal, modified corrector of Song et al. [SSK+21]) Even with coarsely learned score, we can learn the uniform distribution on M, w.r.t. the intrinsic (volume) measure on M, i.e., dπ0 dM ∝ 1.

8/19 Intuition: How Well Should We Learn pσ ? Manifold
Hypothesis: Assuming that pdata(u) is supported on a compact manifold M. ▶ Locally in a tubular neighborhood x = Φ(u, r), write f∗ σ = log pdata ∗ N(0, σ2I) as: f∗ σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du

8/19 Intuition: How Well Should We Learn pσ ? f∗
σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du ⇓ Approximately = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) ω (1) term: distance to the manifold M at x, i.e., u0 (x) := arg minu ∥x − Φ(u, 0)∥ 2.

σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du ⇓ Approximately = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) ω (1) term: distance to the manifold M at x, i.e., u0 (x) := arg minu ∥x − Φ(u, 0)∥ 2. σ = 1 σ = 0.7 σ = 0.4

σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du ⇓ Approximately = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) ω (1) term: distance to the manifold M at x, i.e., u0 (x) := arg minu ∥x − Φ(u, 0)∥ 2. ▶ Determine the support of the limiting density. ▶ Locally behave same for all M.

σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du ⇓ Approximately = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) Θ (1) term: 1. The original data distribution pdata . 2. The curvature, ˆ H(u, x)i,j = ∂Φ(u,0) ∂ui∂uj , Φ(u, 0) − x + ∂Φ(u,0) ∂ui , ∂Φ(u,0) ∂uj .

σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du ⇓ Approximately = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) Θ (1) term: 1. The original data distribution pdata . 2. The curvature, ˆ H(u, x)i,j = ∂Φ(u,0) ∂ui∂uj , Φ(u, 0) − x + ∂Φ(u,0) ∂ui , ∂Φ(u,0) ∂uj . To recover pdata , error of O(1) is unacceptable.

σ (x) = log M 1 (2πσ2)d/2 exp − ∥x − Φ(u, 0)∥2 2σ2 pdata(u)du ⇓ Approximately = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) how bad we learn the score o(1) o σ−2 recover pdata arbitrary distribution support no longer on the manifold

10/19 Intuition: How Well Should We Learn pσ ? Suppose
we learn ˆ fσ = f∗ σ + O(σβ). Observe: for α > −β, σα ˆ fσ (x) = − 1 2σ2−α ∥x − Φ(u0 , 0)∥2 + O(σα+β) + σα log pdata(u0 ) − log ˆ H(u0 , x) − σα d − n 2 log 2πσ2 ω(1) term o(1) term eσα ˆ fσ → dπ0 dM ∝ 1

10/19 Intuition: How Well Should We Learn pσ ? Suppose
we learn ˆ fσ = f∗ σ + O(σβ). Observe: for α > −β, σα ˆ fσ (x) = − 1 2σ2−α ∥x − Φ(u0 , 0)∥2 + O(σα+β) + σα log pdata(u0 ) − log ˆ H(u0 , x) − σα d − n 2 log 2πσ2 ω(1) term o(1) term how bad we learn the score o σ−2 uniform distribution support no longer on the manifold

10/19 Intuition: How Well Should We Learn pσ ? σα
ˆ fσ (x) = − 1 2σ2−α ∥x − Φ(u0 , 0)∥2 + O(σα+β) + σα log pdata(u0 ) − log ˆ H(u0 , x) − σα d − n 2 log 2πσ2 ω(1) term o(1) term how bad we learn the score o σ−2 uniform distribution support no longer on the manifold Existing algs: how bad we learn the score o(1) o σ−2 recover pdata arbitrary distribution support no longer on the manifold

11/19 The Modified Corrector Yang Song et al. “Score-Based Generative
Modeling through Stochastic Differential Equations”. In: ICLR. 2021 It simulates the following SDE: dXt = sσ (Xt )dt + √ 2dWt , where sσ = ∇fσ .

Modeling through Stochastic Differential Equations”. In: ICLR. 2021 It simulates the following SDE: dXt = sσ (Xt )dt + √ 2dWt , where sσ = ∇fσ . ▶ The stationary distribution of the above SDE is given by pσ (x) ∝ exp fσ (x) .

Modeling through Stochastic Differential Equations”. In: ICLR. 2021 It simulates the following SDE: dXt = σαsσ (Xt )dt + √ 2dWt , where sσ = ∇fσ + O σβ . ▶ The stationary distribution of the above SDE is given by pσ (x) ∝ exp σα fσ + O σβ .

13/19 Numerical Illustration: Perfect Score ▶ M := {x ∈
R2 : x2 1 + x2 2 = 1}, and we have Φ(θ, r) = cos(θ) sin(θ) + r cos(θ) sin(θ) . ▶ Von Mises distribution: pdata(θ) = 1 2πI0(κ) exp(κ cos(θ − θ0 )).

R2 : x2 1 + x2 2 = 1}, and we have Φ(θ, r) = cos(θ) sin(θ) + r cos(θ) sin(θ) . ▶ Von Mises distribution: pdata(θ) = 1 2πI0(κ) exp(κ cos(θ − θ0 )). When using the original corrector. t = 0 t = 3555 t = 7111 t = 32000

R2 : x2 1 + x2 2 = 1}, and we have Φ(θ, r) = cos(θ) sin(θ) + r cos(θ) sin(θ) . ▶ Von Mises distribution: pdata(θ) = 1 2πI0(κ) exp(κ cos(θ − θ0 )). When setting α = 1 for the modified corrector. t = 0 t = 3555 t = 7111 t = 32000

14/19 Numerical Illustration: Imperfect Score When we artificially add perturbation
to the score. Consider SDE dXt = σα (∇fσ (Xt ) + w(Xt )) dt + √ 2dWt , with w(Xt ) = w(x) = −∇1 2 x − 1 0 4 . original corrector modified corrector with α = 1

16/19 Conclusion ▶ Error analysis of the score function reveals
interesting properties. ▶ Leading term being the distance to the manifold. ▶ The original distribution is in higher order terms (making it hard to recover). ▶ Learning the manifold is significantly easier than learning the data distribution.

17/19 Thank you! Q&A

18/19 [AGH+23] Marloes Arts, Victor Garcia Satorras, Chin-Wei Huang, Daniel
Zugner, Marco Federici, Cecilia Clementi, Frank No´ e, Robert Pinsler, and Rianne van den Berg. “Two for one: Diffusion models and force fields for coarse-grained molecular dynamics”. In: Journal of Chemical Theory and Computation 19.18 (2023), pp. 6151–6159. [De 22] Valentin De Bortoli. “Convergence of denoising diffusion models under the manifold hypothesis”. In: arXiv preprint arXiv:2208.05314 (2022). [HD05] Aapo Hyv¨ arinen and Peter Dayan. “Estimation of non-normalized statistical models by score matching.”. In: Journal of Machine Learning Research 6.4 (2005). [HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In: Advances in neural information processing systems 33 (2020), pp. 6840–6851. [Hwa80] Chii-Ruey Hwang. “Laplace’s method revisited: weak convergence of probability measures”. In: The Annals of Probability (1980), pp. 1177–1182. [KAA+22] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. “Elucidating the design space of diffusion-based generative models”. In: Advances in neural information processing systems 35 (2022), pp. 26565–26577.

19/19 [Rˇ SP+25] Sanjeev Raja, Martin ˇ S´ ıpka, Michael
Psenka, Tobias Kreiman, Michal Pavelka, and Aditi S Krishnapriyan. “Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional”. In: arXiv preprint arXiv:2504.18506 (2025). [SBD+24] Jan Pawel Stanczuk, Georgios Batzolis, Teo Deveney, and Carola-Bibiane Sch¨ onlieb. “Diffusion models encode the intrinsic dimension of data manifolds”. In: Forty-first International Conference on Machine Learning. 2024. [SE19] Yang Song and Stefano Ermon. “Generative modeling by estimating gradients of the data distribution”. In: Advances in neural information processing systems 32 (2019). [SME20] Jiaming Song, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models”. In: arXiv preprint arXiv:2010.02502 (2020). [SSK+21] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. “Score-Based Generative Modeling through Stochastic Differential Equations”. In: ICLR. 2021. [Vin11] Pascal Vincent. “A connection between score matching and denoising autoencoders”. In: Neural computation 23.7 (2011), pp. 1661–1674.

19/19 When the Score is Learned Perfectly: Recover of pdata
f∗ σ (x) = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) Plug in the approximation into the stationary distribution: pσ (x) ∝ exp (f∗ σ (x)) = pdata(u0 ) ˆ H(u0 , x) −1/2 exp − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + o(1)

f∗ σ (x) = − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) Plug in the approximation into the stationary distribution: pσ (x) ∝ exp (f∗ σ (x)) = pdata(u0 ) ˆ H(u0 , x) −1/2 exp − 1 2σ2 ∥x − Φ(u0 , 0)∥2 + o(1) σ = 1 σ = 0.7 σ = 0.4

pσ (u) = r pσ (u, r)dr = r pσ (Φ(u, r)) ∂Φ ∂(u, r) dr = r ∂Φ ∂(u, r) pdata(u) ˆ H (u, Φ(u, r)) −1/2 exp − 1 2σ2 ∥Φ(u, r) − Φ(u, 0)∥2 + o(1) dr

pσ (u) = r pσ (u, r)dr = r pσ (Φ(u, r)) ∂Φ ∂(u, r) dr = r ∂Φ ∂(u, r) pdata(u) ˆ H (u, Φ(u, r)) −1/2 exp − 1 2σ2 ∥Φ(u, r) − Φ(u, 0)∥2 + o(1) dr ⇓ Laplace’s approximation as σ → 0 [Hwa80] p0 (u) ∝ 1 · exp(0) pdata(u) maximum value achieved at r = 0

pσ (u) = r pσ (u, r)dr = r pσ (Φ(u, r)) ∂Φ ∂(u, r) dr = r ∂Φ ∂(u, r) pdata(u) ˆ H (u, Φ(u, r)) −1/2 exp − 1 2σ2 ∥Φ(u, r) − Φ(u, 0)∥2 + o(1) dr ⇓ Laplace’s approximation as σ → 0 [Hwa80] p0 (u) ∝ 1 · exp(0)pdata(u) the Hessian at r = 0 is I

pσ (u) = r pσ (u, r)dr = r pσ (Φ(u, r)) ∂Φ ∂(u, r) dr = r ∂Φ ∂(u, r) pdata(u) ˆ H(u, Φ(u, r)) −1/2 exp − 1 2σ2 ∥Φ(u, r) − Φ(u, 0)∥2 + o(1) dr ⇓ Laplace’s approximation as σ → 0 [Hwa80] p0 (u) ∝ 1 · exp(0) pdata(u) ¨ ¨¨¨ ¨ ¨ ∂Φ ∂(u, r) r=0$$$$$$$$ $ ˆ H(u, Φ(u, 0)) −1/2 ∂Φ ∂(u,r) r=0 = ˆ H(u, Φ(u, 0)) 1/2 = |g(u)|1/2 = dM du , where g(u) is the metric tensor of M at u.

pσ (u) = r pσ (u, r)dr = r pσ (Φ(u, r)) ∂Φ ∂(u, r) dr = r ∂Φ ∂(u, r) pdata(u) ˆ H (u, Φ(u, r)) −1/2 exp − 1 2σ2 ∥Φ(u, r) − Φ(u, 0)∥2 + o(1) dr ⇓ Laplace’s approximation as σ → 0 [Hwa80] p0 (u) ∝ 1 · exp(0)pdata(u) The density converges weakly to pdata as σ → 0.

19/19 f∗ σ (x) = − 1 2σ2 ∥x −
Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) p0 (u) ∝ 1 · exp (0) · pdata(u) $$$$$$$$ $ ˆ H(u, Φ(u, 0)) −1/2 ¨¨¨ ¨¨ ¨ ∂Φ ∂(u, r) r=0 ω(1) term Θ(1) term

19/19 f∗ σ (x) = − 1 2σ2 ∥x −
Φ(u0 , 0)∥2 + log pdata(u0 ) − log ˆ H(u0 , x) − d − n 2 log 2πσ2 + o(1) p0 (u) ∝ 1 · exp (0) · pdata(u) $$$$$$$$ $ ˆ H(u, Φ(u, 0)) −1/2 ¨¨¨ ¨¨ ¨ ∂Φ ∂(u, r) r=0 ω(1) term Θ(1) term Unless the score is learned with error at most o(1), the limiting distribution can be arbitrary. Score estimation at low noise level is hard [SSK+21; KAA+22].

19/19 Consider the following SDE for 0 < α <
2: dXt = σα∇fσ (Xt )dt + √ 2dWt , then the stationary distribution becomes ∼ exp(σαfσ ). Suppose we learn the score up to an error of O(σβ) for some β > 0. If α > −β, σαfσ (x) = − 1 2σ2−α ∥x − Φ(u0 , 0)∥2 + σα log pdata(u0 ) − log ˆ H(u0 , x) − σα d − n 2 log 2πσ2 + O(σα+β) p0 (u) ∝ ∂Φ ∂(u, r) r=0 = |g(u)|1/2 = dM du . ω(1) term o(1) term

Score Learning under the Manifold Hypothesis: T...

Score Learning under the Manifold Hypothesis: Theory and Implications for Data Science

More Decks by Jia-Jie Zhu

Featured

Transcript