Slide 1

Slide 1 text

Equivalence of Geodesics and Importance Weighting from the Perspective of Information Geometry Masanari Kimura Graduate University for Advanced Studies, SOKENDAI Department of Statistical Science, School of Multidisciplinary Sciences [email protected] January 6, 2024 Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 1 / 26

Slide 2

Slide 2 text

Table of Contents 1 Introduction 2 Identify Statistical Concepts with Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 2 / 26

Slide 3

Slide 3 text

Table of Contents 1 Introduction 2 Identify Statistical Concepts with Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 3 / 26

Slide 4

Slide 4 text

Abstract Preliminary: The set of probability distributions constructs a Riemannian manifold. It is called a statistical manifold. In information geometry [1, 3], statistical procedures are discussed on statistical manifolds. Some statistical procedures can be identified with geometric objects. Contributions: Importance weighting, commonly used in statistics and machine learning, can be identified with the selection of a curve connecting two probability distributions. Information geometrically generalization of covariate shift adaptations. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 4 / 26

Slide 5

Slide 5 text

Overview Information Geometry Statistics Geometry Set of distributions Riemannian manifolds Statistical procedures Geometric objects Importance weighting Selection of curves Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 5 / 26

Slide 6

Slide 6 text

Table of Contents 1 Introduction 2 Identify Statistical Concepts with Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 6 / 26

Slide 7

Slide 7 text

A set of probability distributions constructs a Riemannian manifold (M, g). g = (gij) is a Riemannian metric. A parameter θ ∈ Θ uniquely determines a distribution pθ ⇒ Θ is the coordinate system. ij gij∆θi∆θj is the infinitesimal distance on M. The KL-divergence is written as DKL [pθ∥pθ+∆θ] = pθ ln pθ pθ+∆θ dµ = 1 2 ij gij∆θi∆θj, (1) where gij is the Fiher information matrix. ⇒ The Fisher information matrix is a Riemannian metric. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 7 / 26

Slide 8

Slide 8 text

Statistical Manifolds Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 8 / 26

Slide 9

Slide 9 text

Why Information Geometry? Some statistical procedures can be identified with geometric objects. Geometric identification allows deep understanding of statistical procedures, ⇒ improvement of algorithms, and ⇒ geometrically natural generalization of existing methods. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 9 / 26

Slide 10

Slide 10 text

Examples of Geometric Identification Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 10 / 26

Slide 11

Slide 11 text

Table of Contents 1 Introduction 2 Identify Statistical Concepts with Geometric Objects Brief Introduction of Information Geometry Identification of Statistical Procedures and Geometric Objects 3 Equivalence of Geodesics and Importance Weighting Importance Weighted Empirical Risk Minimization (IWERM) Information Geometrically Generalized IWERM 4 Conclusion 5 References Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 11 / 26

Slide 12

Slide 12 text

Covariate Shift Assumption ptr : training distribution pte : test distribution Goal: obtain a good hypothesis h ∈ H for pte using a sample Sn ∼ ptr . Definition (Covariate shift [5]) We consider that the two distributions ptr(x, y) and pte(x, y) satisfy the covariate shift assumption if the following conditions hold: ptr(x) ̸= pte(x), p(y|x) = ptr(y|x) = pte(y|x). Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 12 / 26

Slide 13

Slide 13 text

Importance Weighted Empirical Risk Minimization (IWERM) ERM has unbiasedness and consistency under the i.i.d. assumption. However, ERM fails under the covariate shift assumption. Good properties can be recovered by the importance weighting [5]. Eptr(x,y) pte(x) ptr(x) ℓ(h(x), y) expectaton of weighted loss under ptr = X×Y pte(x) ptr(x) ℓ(h(x), y) · ptr(x, y)dxdy = X×Y pte(x) ptr(x) ℓ(h(x), y) · ptr(x)p(y|x)dxdy = X×Y pte(x)ℓ(h(x), y) · p(y|x)dxdy = Epte(x,y) [ℓ(h(x), y)] expectation of original loss under pte . (2) Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 13 / 26

Slide 14

Slide 14 text

Variants of IWERM IWERM utilizes the weighted loss w(x)ℓ(h(x), y), with w(x) = pte(x)/ptr(x). However, IWERM is numerically unstable. Several variants are proposed. Adaptive IWERM [5]: w(x) = (pte(x)/ptr(x))λ, with λ ∈ [0, 1]. Relative IWERM [6]: w(x) = pte(x)/ {(1 − λ)ptr(x) + λpte(x)}, with λ ∈ [0, 1]. AIWERM and RIWERM are both equal to IWERM with λ = 1. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 14 / 26

Slide 15

Slide 15 text

Geodesics on the Statistical Manifold Definition (α-geodesic [2]) The α-geodesic connecting two probability distributions p(x) and q(x) is defined as r(α,λ) f (p(x), q(x)) = C · f−1 α (1 − λ)fα(p(x)) + λfα(q(x)) , where C is a normalization factor, and fα(a) = a1−α 2 , (α ̸= 1) log a (α = 1), for α ∈ R and λ ∈ [0, 1]. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 15 / 26

Slide 16

Slide 16 text

Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 16 / 26

Slide 17

Slide 17 text

Information Geometrically Generalized IWERM For w(x) of AIWERM, ptr(x)w(x) = ptr(x) pte(x) ptr(x) λ log {ptr(x)w(x)} = λ (log pte(x) − log ptr(x)) + log ptr(x) = (1 − λ) log ptr(x) + λ log pte(x) ptr(x)w(x) = exp {(1 − λ) log ptr(x) + λ log pte(x)} = r(1,λ) f (ptr(x), pte(x)). For w(x) of RIWERM, ptr(x)w(x) = pte(x)ptr(x) λptr(x) + (1 − λ)pte(x) = 1 λ 1 pte(x) + (1 − λ) 1 ptr(x) = r(3,λ) f (ptr(x), pte(x)). Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 17 / 26

Slide 18

Slide 18 text

Information Geometrically Generalized IWERM [4] For λ ∈ [0, 1] and α ∈ R, AIWERM and RIWERM is generalized as ˆ h = argmin h∈H X×Y w(λ,α)(x)ℓ(h(x), y)ptr(x, y)dxdy, (3) where w(α,λ)(x) = r(α,λ) f (ptr(x), pte(x)) ptr(x) . (4) IWERM with λ = 1, AIWERM with α = 1, and RIWERM with α = 3. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 18 / 26

Slide 19

Slide 19 text

Density Ratio Representation of IGIWERM The generalized importance weighting is also represented as density ratio: w(α,λ)(x) = (1 − λ)ptr(x)1−α 2 + λpte(x)1−α 2 2 1−α ptr(x) = 1 − λ + λ pte(x) ptr(x) 1−α 2 2 1−α , (α ̸= 1). We can utilize the techniques of direct density ratio estimation. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 19 / 26

Slide 20

Slide 20 text

Numerical Experiments The covariate shift is induced for each dataset. The parameter pair (α, λ) is optimized by the Bayesian optimization. Table: Mean misclassification rates averaged over 10 trails on LIBSVM benchmark datasets. Dataset #features #data unweighted IWERM AIWERM (optimal) RIWERM (optimal) ours australian 14 690 33.46(±23.65) 22.13(±3.37) 21.98(±3.36) 21.73(±3.82) 18.85(±3.99) breast-cancer 10 683 38.28(±10.98) 41.23(±15.39) 36.41(±9.68) 36.13(±10.81) 31.65(±8.49) heart 13 270 45.17(±6.98) 39.94(±8.55) 39.76(±8.49) 39.76(±8.92) 35.37(±6.84) diabetes 8 768 33.19(±5.69) 37.22(±6.63) 33.11(±6.45) 33.38(±5.74) 32.83(±5.62) madelon 500 2, 000 47.78(±1.53) 47.28(±2.20) 47.10(±2.13) 47.12(±1.65) 46.56(±2.12) Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 20 / 26

Slide 21

Slide 21 text

Figure: The history of the Bayesian optimization. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 21 / 26

Slide 22

Slide 22 text

Learning Guarantee and Role of Parameters Generalization Bound for IGIWERM The gap between the expected (with respect to test distribution) loss R(h) and empirical risk L(h; λ, α) is bounded as |R(h) − L(h; λ, α)| ≤ Eptr pte(x) ptr(x) − w(α,λ)(x) ℓ(h(x, y(x))) +25/4 max Eptr (w(α,λ)(x))2ℓ2(h(x, y(x))), Eˆ ptr (w(α,λ)(x))2ℓ2(h(x, y(x)) × p log 2ntr p + log 4 δ ntr 3 8 . (5) In the above inequality, p is the pseudo-dimension of the function class. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 22 / 26

Slide 23

Slide 23 text

The generalization gap for the IGIWERM under the covariate shift assumption is written as C1 pte(x) ptr(x) − w(α,λ)(x) + C2 1 ntr . (6) C1(·): depends on pte(x) ptr(x) − w(α,λ)(x), and C2(·): depends on 1 ntr . ⇒ when the training sample size is large, the effect of the second term vanishes asymptotically, and λ should be close to 1. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 23 / 26

Slide 24

Slide 24 text

Conclusion Some statistical procedures can be identified with geometric objects. Geometric identification leads geometrically natural generalization of existing methods. Importance weighting can be identified with geodesics selection. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 24 / 26

Slide 25

Slide 25 text

References I [1] S Amari and H Nagaoka. Differential geometry of smooth families of probability distributions. METR, 82:49–98, 1982. [2] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016. [3] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000. [4] Masanari Kimura and Hideitsu Hino. Information geometrically generalized covariate shift adaptation. Neural Computation, 34(9):1944–1977, 2022. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 25 / 26

Slide 26

Slide 26 text

References II [5] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. [6] Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. Neural computation, 25(5):1324–1370, 2013. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 26 / 26