Masanari Kimura
January 06, 2024
180

# Equivalence of Geodesics and Importance Weighting from the Perspective of Information Geometry

Presented at IMS-APRM 2024
https://ims-aprm2024.com/

January 06, 2024

## Transcript

1. ### Equivalence of Geodesics and Importance Weighting from the Perspective of

Information Geometry Masanari Kimura Graduate University for Advanced Studies, SOKENDAI Department of Statistical Science, School of Multidisciplinary Sciences [email protected] January 6, 2024 Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 1 / 26

Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 2 / 26

Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 3 / 26
4. ### Abstract Preliminary: The set of probability distributions constructs a Riemannian

manifold. It is called a statistical manifold. In information geometry [1, 3], statistical procedures are discussed on statistical manifolds. Some statistical procedures can be identified with geometric objects. Contributions: Importance weighting, commonly used in statistics and machine learning, can be identified with the selection of a curve connecting two probability distributions. Information geometrically generalization of covariate shift adaptations. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 4 / 26
5. ### Overview Information Geometry Statistics Geometry Set of distributions  Riemannian

manifolds  Statistical procedures  Geometric objects  Importance weighting Selection of curves Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 5 / 26

Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 6 / 26
7. ### A set of probability distributions constructs a Riemannian manifold (M,

g). g = (gij) is a Riemannian metric. A parameter θ ∈ Θ uniquely determines a distribution pθ ⇒ Θ is the coordinate system. ij gij∆θi∆θj is the infinitesimal distance on M. The KL-divergence is written as DKL [pθ∥pθ+∆θ] = pθ ln pθ pθ+∆θ dµ = 1 2 ij gij∆θi∆θj, (1) where gij is the Fiher information matrix. ⇒ The Fisher information matrix is a Riemannian metric. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 7 / 26

/ 26
9. ### Why Information Geometry? Some statistical procedures can be identified with

geometric objects. Geometric identification allows deep understanding of statistical procedures, ⇒ improvement of algorithms, and ⇒ geometrically natural generalization of existing methods. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 9 / 26
10. ### Examples of Geometric Identification Masanari Kimura (SOKENDAI) IGIWERM January 6,

2024 10 / 26

Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 11 / 26
12. ### Covariate Shift Assumption ptr : training distribution pte : test

distribution Goal: obtain a good hypothesis h ∈ H for pte using a sample Sn ∼ ptr . Definition (Covariate shift [5]) We consider that the two distributions ptr(x, y) and pte(x, y) satisfy the covariate shift assumption if the following conditions hold: ptr(x) ̸= pte(x), p(y|x) = ptr(y|x) = pte(y|x). Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 12 / 26
13. ### Importance Weighted Empirical Risk Minimization (IWERM) ERM has unbiasedness and

consistency under the i.i.d. assumption. However, ERM fails under the covariate shift assumption. Good properties can be recovered by the importance weighting [5]. Eptr(x,y) pte(x) ptr(x) ℓ(h(x), y) expectaton of weighted loss under ptr = X×Y pte(x) ptr(x) ℓ(h(x), y) · ptr(x, y)dxdy = X×Y pte(x) ptr(x) ℓ(h(x), y) · ptr(x)p(y|x)dxdy = X×Y pte(x)ℓ(h(x), y) · p(y|x)dxdy = Epte(x,y) [ℓ(h(x), y)] expectation of original loss under pte . (2) Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 13 / 26
14. ### Variants of IWERM IWERM utilizes the weighted loss w(x)ℓ(h(x), y),

with w(x) = pte(x)/ptr(x). However, IWERM is numerically unstable. Several variants are proposed. Adaptive IWERM [5]: w(x) = (pte(x)/ptr(x))λ, with λ ∈ [0, 1]. Relative IWERM [6]: w(x) = pte(x)/ {(1 − λ)ptr(x) + λpte(x)}, with λ ∈ [0, 1]. AIWERM and RIWERM are both equal to IWERM with λ = 1. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 14 / 26
15. ### Geodesics on the Statistical Manifold Definition (α-geodesic [2]) The α-geodesic

connecting two probability distributions p(x) and q(x) is defined as r(α,λ) f (p(x), q(x)) = C · f−1 α (1 − λ)fα(p(x)) + λfα(q(x)) , where C is a normalization factor, and fα(a) = a1−α 2 , (α ̸= 1) log a (α = 1), for α ∈ R and λ ∈ [0, 1]. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 15 / 26

17. ### Information Geometrically Generalized IWERM For w(x) of AIWERM, ptr(x)w(x) =

ptr(x) pte(x) ptr(x) λ log {ptr(x)w(x)} = λ (log pte(x) − log ptr(x)) + log ptr(x) = (1 − λ) log ptr(x) + λ log pte(x) ptr(x)w(x) = exp {(1 − λ) log ptr(x) + λ log pte(x)} = r(1,λ) f (ptr(x), pte(x)). For w(x) of RIWERM, ptr(x)w(x) = pte(x)ptr(x) λptr(x) + (1 − λ)pte(x) = 1 λ 1 pte(x) + (1 − λ) 1 ptr(x) = r(3,λ) f (ptr(x), pte(x)). Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 17 / 26
18. ### Information Geometrically Generalized IWERM [4] For λ ∈ [0, 1]

and α ∈ R, AIWERM and RIWERM is generalized as ˆ h = argmin h∈H X×Y w(λ,α)(x)ℓ(h(x), y)ptr(x, y)dxdy, (3) where w(α,λ)(x) = r(α,λ) f (ptr(x), pte(x)) ptr(x) . (4) IWERM with λ = 1, AIWERM with α = 1, and RIWERM with α = 3. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 18 / 26
19. ### Density Ratio Representation of IGIWERM The generalized importance weighting is

also represented as density ratio: w(α,λ)(x) = (1 − λ)ptr(x)1−α 2 + λpte(x)1−α 2 2 1−α ptr(x) = 1 − λ + λ pte(x) ptr(x) 1−α 2 2 1−α , (α ̸= 1). We can utilize the techniques of direct density ratio estimation. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 19 / 26
20. ### Numerical Experiments The covariate shift is induced for each dataset.

The parameter pair (α, λ) is optimized by the Bayesian optimization. Table: Mean misclassification rates averaged over 10 trails on LIBSVM benchmark datasets. Dataset #features #data unweighted IWERM AIWERM (optimal) RIWERM (optimal) ours australian 14 690 33.46(±23.65) 22.13(±3.37) 21.98(±3.36) 21.73(±3.82) 18.85(±3.99) breast-cancer 10 683 38.28(±10.98) 41.23(±15.39) 36.41(±9.68) 36.13(±10.81) 31.65(±8.49) heart 13 270 45.17(±6.98) 39.94(±8.55) 39.76(±8.49) 39.76(±8.92) 35.37(±6.84) diabetes 8 768 33.19(±5.69) 37.22(±6.63) 33.11(±6.45) 33.38(±5.74) 32.83(±5.62) madelon 500 2, 000 47.78(±1.53) 47.28(±2.20) 47.10(±2.13) 47.12(±1.65) 46.56(±2.12) Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 20 / 26
21. ### Figure: The history of the Bayesian optimization. Masanari Kimura (SOKENDAI)

IGIWERM January 6, 2024 21 / 26
22. ### Learning Guarantee and Role of Parameters Generalization Bound for IGIWERM

The gap between the expected (with respect to test distribution) loss R(h) and empirical risk L(h; λ, α) is bounded as |R(h) − L(h; λ, α)| ≤ Eptr pte(x) ptr(x) − w(α,λ)(x) ℓ(h(x, y(x))) +25/4 max Eptr (w(α,λ)(x))2ℓ2(h(x, y(x))), Eˆ ptr (w(α,λ)(x))2ℓ2(h(x, y(x)) × p log 2ntr p + log 4 δ ntr 3 8 . (5) In the above inequality, p is the pseudo-dimension of the function class. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 22 / 26
23. ### The generalization gap for the IGIWERM under the covariate shift

assumption is written as C1 pte(x) ptr(x) − w(α,λ)(x) + C2 1 ntr . (6) C1(·): depends on pte(x) ptr(x) − w(α,λ)(x), and C2(·): depends on 1 ntr . ⇒ when the training sample size is large, the effect of the second term vanishes asymptotically, and λ should be close to 1. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 23 / 26
24. ### Conclusion Some statistical procedures can be identified with geometric objects.

Geometric identification leads geometrically natural generalization of existing methods. Importance weighting can be identified with geodesics selection. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 24 / 26
25. ### References I [1] S Amari and H Nagaoka. Differential geometry

of smooth families of probability distributions. METR, 82:49–98, 1982. [2] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016. [3] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000. [4] Masanari Kimura and Hideitsu Hino. Information geometrically generalized covariate shift adaptation. Neural Computation, 34(9):1944–1977, 2022. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 25 / 26
26. ### References II [5] Hidetoshi Shimodaira. Improving predictive inference under covariate

shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. [6] Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. Neural computation, 25(5):1324–1370, 2013. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 26 / 26