Information Geometry Masanari Kimura Graduate University for Advanced Studies, SOKENDAI Department of Statistical Science, School of Multidisciplinary Sciences [email protected] January 6, 2024 Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 1 / 26
manifold. It is called a statistical manifold. In information geometry [1, 3], statistical procedures are discussed on statistical manifolds. Some statistical procedures can be identified with geometric objects. Contributions: Importance weighting, commonly used in statistics and machine learning, can be identified with the selection of a curve connecting two probability distributions. Information geometrically generalization of covariate shift adaptations. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 4 / 26
g). g = (gij) is a Riemannian metric. A parameter θ ∈ Θ uniquely determines a distribution pθ ⇒ Θ is the coordinate system. ij gij∆θi∆θj is the infinitesimal distance on M. The KL-divergence is written as DKL [pθ∥pθ+∆θ] = pθ ln pθ pθ+∆θ dµ = 1 2 ij gij∆θi∆θj, (1) where gij is the Fiher information matrix. ⇒ The Fisher information matrix is a Riemannian metric. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 7 / 26
distribution Goal: obtain a good hypothesis h ∈ H for pte using a sample Sn ∼ ptr . Definition (Covariate shift [5]) We consider that the two distributions ptr(x, y) and pte(x, y) satisfy the covariate shift assumption if the following conditions hold: ptr(x) ̸= pte(x), p(y|x) = ptr(y|x) = pte(y|x). Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 12 / 26
consistency under the i.i.d. assumption. However, ERM fails under the covariate shift assumption. Good properties can be recovered by the importance weighting [5]. Eptr(x,y) pte(x) ptr(x) ℓ(h(x), y) expectaton of weighted loss under ptr = X×Y pte(x) ptr(x) ℓ(h(x), y) · ptr(x, y)dxdy = X×Y pte(x) ptr(x) ℓ(h(x), y) · ptr(x)p(y|x)dxdy = X×Y pte(x)ℓ(h(x), y) · p(y|x)dxdy = Epte(x,y) [ℓ(h(x), y)] expectation of original loss under pte . (2) Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 13 / 26
with w(x) = pte(x)/ptr(x). However, IWERM is numerically unstable. Several variants are proposed. Adaptive IWERM [5]: w(x) = (pte(x)/ptr(x))λ, with λ ∈ [0, 1]. Relative IWERM [6]: w(x) = pte(x)/ {(1 − λ)ptr(x) + λpte(x)}, with λ ∈ [0, 1]. AIWERM and RIWERM are both equal to IWERM with λ = 1. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 14 / 26
connecting two probability distributions p(x) and q(x) is defined as r(α,λ) f (p(x), q(x)) = C · f−1 α (1 − λ)fα(p(x)) + λfα(q(x)) , where C is a normalization factor, and fα(a) = a1−α 2 , (α ̸= 1) log a (α = 1), for α ∈ R and λ ∈ [0, 1]. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 15 / 26
and α ∈ R, AIWERM and RIWERM is generalized as ˆ h = argmin h∈H X×Y w(λ,α)(x)ℓ(h(x), y)ptr(x, y)dxdy, (3) where w(α,λ)(x) = r(α,λ) f (ptr(x), pte(x)) ptr(x) . (4) IWERM with λ = 1, AIWERM with α = 1, and RIWERM with α = 3. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 18 / 26
also represented as density ratio: w(α,λ)(x) = (1 − λ)ptr(x)1−α 2 + λpte(x)1−α 2 2 1−α ptr(x) = 1 − λ + λ pte(x) ptr(x) 1−α 2 2 1−α , (α ̸= 1). We can utilize the techniques of direct density ratio estimation. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 19 / 26
The gap between the expected (with respect to test distribution) loss R(h) and empirical risk L(h; λ, α) is bounded as |R(h) − L(h; λ, α)| ≤ Eptr pte(x) ptr(x) − w(α,λ)(x) ℓ(h(x, y(x))) +25/4 max Eptr (w(α,λ)(x))2ℓ2(h(x, y(x))), Eˆ ptr (w(α,λ)(x))2ℓ2(h(x, y(x)) × p log 2ntr p + log 4 δ ntr 3 8 . (5) In the above inequality, p is the pseudo-dimension of the function class. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 22 / 26
assumption is written as C1 pte(x) ptr(x) − w(α,λ)(x) + C2 1 ntr . (6) C1(·): depends on pte(x) ptr(x) − w(α,λ)(x), and C2(·): depends on 1 ntr . ⇒ when the training sample size is large, the effect of the second term vanishes asymptotically, and λ should be close to 1. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 23 / 26
of smooth families of probability distributions. METR, 82:49–98, 1982. [2] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016. [3] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000. [4] Masanari Kimura and Hideitsu Hino. Information geometrically generalized covariate shift adaptation. Neural Computation, 34(9):1944–1977, 2022. Masanari Kimura (SOKENDAI) IGIWERM January 6, 2024 25 / 26