Slide 1

Slide 1 text

Riemannian geometry for data analysis: application to blind source separation and low-rank structured covariance matrices Florent Bouchard (L2S – CNRS, Univ. Paris-Saclay, CentraleSupélec) S3 seminar – November, 25 2020

Slide 2

Slide 2 text

Introduction

Slide 3

Slide 3 text

Introduction – Statistical data analysis • Characterize data x with some parameters θ x θ • To estimate θ, optimization problem: argmin θ∈M f (x, θ) • f : X × M → R, cost function corresponding to the model • x and θ might possess a structure ⇒ X and M are manifolds 1

Slide 4

Slide 4 text

Introduction – Riemannian geometry Geometry: metric, curves, distance or divergence treat difficult problems, complicated to handle with other approaches • modeling: design appropriate cost functions for the considered model exploit the structure of data x, e.g. centers of mass • optimization: generic convex and non-convex methods, modularity naturally takes into account structure of parameters θ • performance analysis: error measures and associated performance bounds captures the geometrical structure of the problem M TθM •θ 2

Slide 5

Slide 5 text

Introduction Riemannian geometry and optimization Approximate joint diagonalization for blind source separation Intrinsic Cramér-Rao bound for low-rank structured elliptical models Conclusions and perspectives

Slide 6

Slide 6 text

Riemannian geometry and optimization

Slide 7

Slide 7 text

Riemannian geometry [Absil et al., 2008] • Manifold M: locally diffeomorphic to Rd , with dim(M) = d, i.e. ∀θ ∈ M, ∃ Uθ ⊂ M and ϕθ : Uθ → Rd , diffeomorphism M • θ Uθ • Curve γ : R → M, γ(0) = θ, derivative: ˙ γ(0) = lim t→0 γ(t) − γ(0) t • Tangent space Tθ M = { ˙ γ(0) : γ : R → M, γ(0) = θ} M TθM θ • 3

Slide 8

Slide 8 text

Riemannian geometry [Absil et al., 2008] • Riemannian metric ·, · θ : Tθ M × Tθ M → R • inner product on TθM – bilinear, symmetric, positive definite • defines length and relative positions of tangent vectors ξ 2 θ = ξ, ξ θ α(ξ, η) = ξ, η θ ξ θ η θ • Geodesics γ : [0, 1] → M • generalizes straight lines to manifolds – defined by (γ(0), ˙ γ(0)) or (γ(0), γ(1)) • curves on M with zero acceleration: D2γ dt2 = 0 operator D2 dt2 depends on M and ·, · · • Riemannian distance: δ(θ, ˆ θ) = 1 0 ˙ γ(t) γ(t) dt γ: geodesic connecting θ and ˆ θ ⇒ distance = length of γ M TθM θ ξ η α • M θ ˆ θ ˙ γ(t) • • • γ(t) 4

Slide 9

Slide 9 text

Riemannian optimization [Absil et al., 2008] argmin θ∈M f (θ) • framework for optimization on manifold M equipped with metric ·, · · M TθM ξ • θ f (θ) D f (θ)[ξ] • descent direction of f at θ: ξ ∈ Tθ M, D f (θ)[ξ] < 0 • gradient of f at θ: grad f (θ), ξ θ = D f (θ)[ξ] 5

Slide 10

Slide 10 text

Riemannian optimization [Absil et al., 2008] • minimize f on M from θ: descent direction ξ ∈ TθM grad f (θ), ξ θ < 0 retraction of ξ on M M TθM ξ • θ • Rθ(ξ) reiterate until critical point: grad f (θ) = 0 • example of optimization algorithm: gradient descent θi+1 = Rθi (−ti grad f (θi )) 6

Slide 11

Slide 11 text

Approximate joint diagonalization for blind source separation

Slide 12

Slide 12 text

Problem: approximate joint diagonalization • data: set {Ck } of n × n symmetric positive definite matrices • goal: find joint diagonalizer B of matrices Ck {Ck } ··· k B BCk BT ··· k • formulation: non-singular matrix B ∈ Rn×n such that: argmin B f (B, {Ck }) f – diagonality criterion 7

Slide 13

Slide 13 text

Application: blind source separation • instantaneous linear mixing: x(t) = A s(t) [Comon and Jutten, 2010] • goal: retrieve A and s(t) knowing x(t) s(t) x(t) A Bx(t) B • used in many engineering fields such as electroencephalography 8

Slide 14

Slide 14 text

Geometrical modeling of the problem • goal: get Ck as close as possible to D++ n in S++ n • diagonality measure of BCk BT : relative position to D++ n D++ n diagonal positive definite matrices S++ n symmetric positive definite matrices • • • • • • Ck 9

Slide 15

Slide 15 text

Proposed geometrical model • appropriate criterion: f (B, {Ck })) = k d(BCk BT , Λk (B)) Λk (B): target diagonal matrices d(·, ·) : divergence on S++ n D++ n diagonal positive definite matrices S++ n symmetric positive definite matrices • BCk BT • Λk (B) • Λk (B) • natural choice for Λk (B): Λk (B) = argmin Λ∈D++ n d(BCk BT , Λ) 10

Slide 16

Slide 16 text

Considered divergences • least squares criterion: Euclidean distance on S++ n [Cardoso and Souloumiac, 1993] δ2 F (C, Λ) = C − Λ 2 F Λ = ddiag(C) • log-likelihood cirterion: left Kullback-Leibler divergence [Pham, 2000] d KL(C, Λ) = dKL(C, Λ) Λ = ddiag(C) where dKL(P, S) = trace(PS−1 − In) − log det(PS−1) • other measures obtained from dKL (·, ·): right measure: drKL(C, Λ) = dKL(Λ, C) Λ = ddiag(C−1)−1 symmetrized measure: dsKL(C, Λ) = 1 2 (dKL(C, Λ) + dKL(Λ, C)) Λ = ddiag(C)1/2 ddiag(C−1)−1/2 11

Slide 17

Slide 17 text

Considered divergences • natural Riemannian distance: [Skovgaard, 1984], [Bhatia, 2009] δ2 R (C, Λ) = log(Λ−1/2CΛ−1/2) 2 F ddiag(log(C−1Λ)) = 0n • log-Euclidean distance: [Arsigny et al., 2007] δ2 LE (C, Λ) = log(C) − log(Λ) 2 F Λ = exp(ddiag(log(C))) • Bhattacharyya distance: [Chebbi and Moakher, 2012], [Sra, 2013] δ2 B (C, Λ) = 4 log det((C + Λ)/2) det(C)1/2 det(Λ)1/2 2 ddiag((C + Λ)−1) = Λ−1 • Wasserstein distance: [Villani, 2008], [Bhatia et al., 2017] δ2 W (C, Λ) = trace 1 2 (C + Λ) − (Λ1/2CΛ1/2)1/2 ddiag((Λ1/2CΛ1/2)1/2) = Λ 12

Slide 18

Slide 18 text

First experiment: simulated data Ck = AΛk AT + 1 σ Ek ∆k ET k A, Ek : random elements, standard normal distribution conditioning of A with respect to inversion in [1, 10] Λk , ∆k : diagonal – pth element, χ2 distribution with expectancy 1 and divided by p σ : signal to noise ratio • performance measure: Moreau-Amari criterion [Moreau and Macchi, 1994] measure degree of similarity of BA with matrix of the form PΣ P permutation matrix – Σ non-singular diagonal matrix 13

Slide 19

Slide 19 text

First experiment: comparisons of divergences 10 50 100 500 1000 −30 −20 −10 σ IM-A (dB) F W rKL B R sKL KL LE medians, first and ninth deciles (error bars) estimated on 50 trials 14

Slide 20

Slide 20 text

Second experiment: electroencephalographic data Nasion Inion 1 2 3 4 5 6 7 8 1. Oz 2. O1 13 Hz 17 Hz 3. O2 21 Hz repos 4. POz 10 20 f (Hz) 5. PO3 10 20 6. PO4 10 20 7. PO7 10 20 8. PO8 15

Slide 21

Slide 21 text

Obtained sources with proposed methods 10 20 f (Hz) s1 s2 13 Hz 17 Hz 21 Hz repos 16

Slide 22

Slide 22 text

Comparisons of divergences performance measure ISSVEP (f ) ∈ [0, 1] for class f ratio between power at f and total power ISSVEP(f ) F W rKL B R sKL KL LE 13 Hz 0, 95 0, 95 0, 96 0, 96 0, 96 0, 96 0, 96 0, 96 17 Hz 0, 87 0, 89 0, 91 0, 91 0, 91 0, 91 0, 91 0, 91 21 Hz 0, 50 0, 54 0, 60 0, 60 0, 60 0, 60 0, 60 0, 60 s1 17

Slide 23

Slide 23 text

Intrinsic Cramér-Rao bound for low-rank structured elliptical models

Slide 24

Slide 24 text

Intrinsic Cramér-Rao bound • x ∼ L(θ), θ ∈ M, with log-likelihood Lx : M → R • given x, estimation problem: ˆ θ = argmin θ∈M −Lx (θ) • Cramèr-Rao bound of an unbiased estimator ˆ θ of θ : Eˆ θ [Err(θ, ˆ θ)] F−1 ⇒ MSE = Eˆ θ [err(θ, ˆ θ)] ≥ trace(F−1) F: Fisher information matrix 18

Slide 25

Slide 25 text

Intrinsic Cramér-Rao bound Eˆ θ [err(θ, ˆ θ)] ≥ trace(F−1) • classical case: M = Rd err(θ, ˆ θ) = θ − ˆ θ 2 2 Fij = −Ex ∂2Lx (θ) ∂θi ∂θj ⇒ constraints, invariances of the model not taken into account • generalization to Riemannian manifold M [Smith, 2005, Boumal, 2013] • err(θ, ˆ θ) = δ2 M (θ, ˆ θ) δM distance on M associated to ·, · M · • Fij = ei , ej F θ • ξ, η F θ = −Ex [D2 Lx (θ)[ξ, η]], Fisher metric • {ei } orthonormal basis of TθM associated to ·, · M θ M TθM •θ ⇒ allows to exhibit different properties from the classical case 19

Slide 26

Slide 26 text

Elliptical distributions / low-rank structure • x ∼ CES(0, R), R ∈ H++ p , with log-likelihood L++ x (R) = log(g(xH R−1x)) − log det(R)−1 g : R+ → R+ e.g. gaussian (g(t) = exp(−t)), generalized gaussian, Student t • Fisher metric of elliptical distributions on H++ p : [Breloy et al., 2018] ξ, η F,++ R = αF trace(R−1ξR−1η) + βF trace(R−1ξ) trace(R−1η) • low-rank structure – few data [Sun et al., 2016, Bouchard et al., 2020] R = Ip + H H ∈ H+ p,k ⇒ geometry of H+ p,k : several possibilities, none ideal e.g. [Bonnabel and Sepulchre, 2009, Vandereycken et al., 2012, Massart and Absil, 2018] 20

Slide 27

Slide 27 text

Geometry of low-rank structured covariance matrices • H = ϕ(U, Σ) = UΣUH with (U, Σ) ∈ Mp,k = Stp,k × H++ k • ∀O ∈ Uk , ϕ(UO, OH ΣO) = ϕ(U, Σ) ⇒ H+ p,k isomorphic to quotient [Bonnabel and Sepulchre, 2009, Bouchard et al., 2020] Mp,k = Mp,k /Uk = {π(U, Σ) : (U, Σ) ∈ Mp,k } où π(U, Σ) = {(UO, OH ΣO) : O ∈ Uk } Mp,k O → (UO, O H ΣO) • θ = (U, Σ) 21

Slide 28

Slide 28 text

quotient Mp,k : Riemannian geometry • θ = (U, Σ) ∈ Mp,k • Tθ Mp,k = {(UΩξ + U⊥ Kξ , ξΣ ) : Ωξ ∈ Ak , Kξ ∈ R(p−k)×k , ξΣ ∈ Sk } U⊥ ∈ Stp,(p−k) such that UH U⊥ = 0 ξ, η θ = 1 2 trace(ΩH ξ Ωη ) + trace(KH ξ Kη ) + α trace(Σ−1ξΣ Σ−1ηΣ ) + β trace(Σ−1ξΣ ) trace(Σ−1ηΣ ) • vertical space – tangent space to the equivalence class Vθ = {(UΩ, ΣΩ − ΩΣ) : Ω ∈ Ak } • horizontal space – orthogonal complement to Vθ according to ·, · θ Hθ = {ξ ∈ Tθ Mp,k : Ωξ = 2α(Σ−1ξΣ − ξΣ Σ−1)} Mp,k O → (UO, O T SO) TθMp,k Vθ Hθ • θ 22

Slide 29

Slide 29 text

quotient Mp,k : error measure • Riemannian distance not known ⇒ alternative error measure • alternative horizontal space – complement to Vθ, non orthogonal [Bonnabel and Sepulchre, 2009] Hθ = {ξ ∈ Tθ Mp,k : Ωξ = 0} = {(U⊥ Kξ , ξΣ ) : Kξ ∈ R(p−k)×k , ξΣ ∈ Sk } • induces divergence onto Mp,k d(π(θ), π(ˆ θ)) = Θ 2 2 + α log(Σ−1/2O ˆ OH ˆ Σ ˆ OOH Σ−1/2) 2 2 + β log det(Σ−1 ˆ Σ) 2 UT ˆ U = O cos(Θ) ˆ OT ⇒ err(θ, ˆ θ) = d(θ, ˆ θ) Mp,k O → (UO, O H ΣO) TθMp,k Vθ Hθ Hθ • θ • θ 23

Slide 30

Slide 30 text

quotient Mp,k : performance bound Eˆ θ [err(θ, ˆ θ)] = Eˆ θ [d(θ, ˆ θ)] ≥ trace(F−1 ) Fij = ei , ej F,Mp,k θ • {ei }, orthonormal basis of Hθ according to ·, · θ : {(U⊥Kij , 0), (iU⊥Kij , 0)}1≤i≤p−k 1≤j≤k , {(0, 1 √ α Σ1/2Hij Σ1/2 + √ α− √ α+kβ k √ α √ α+kβ trace(Hij )Σ)}1≤j≤i≤k , {(0, 1 √ α Σ1/2H ij Σ1/2)}1≤j

Slide 31

Slide 31 text

quotient Mp,k : Fisher metric of elliptical distributions • ϕ(θ) = UΣUH D ϕ(θ)[ξ] = UξΣ UH + ξU ΣUH + UΣξU • log-likelihood: LMp,k x (θ) = L++ x (Ip + ϕ(θ)) • Fisher metric: ξ, η F,Mp,k θ = D ϕ(θ)[ξ], D ϕ(θ)[η] F,++ Ip+ϕ(θ) where ξR , ηR F,++ R = αF trace(R−1ξR R−1ηR ) + βF trace(R−1ξR ) trace(R−1ηR ) 25

Slide 32

Slide 32 text

Numerical illustrations • covariance matrix: R = Ip + σUΣUH • p = 16, k = 8 • U ∈ Stp,k : random orthgonal matrix • Σ ∈ H++ k : diagonal matrix, minimal / maximal element: 1/ √ c, √ c (c=20: conditionning) other elements: random, uniform distribution between 1/ √ c and √ c trace normalisée : trace(Σ) = trace(Ik ) = k • σ = 50 : free parameter • data : • 500 sets of n ∈ [12, 300] random samples • t-distribution, d = 3 (non gaussian) and d = 100 (almost gaussian) degrees of freedom [Ollila et al., 2012] 26

Slide 33

Slide 33 text

Numerical illustrations 101n = p 102 −10 0 10 n err(θ, ˆ θ) (dB) d = 3 trace(F−1 θ ) T-RGD pSCM T-RTR 101n = p 102 n d = 100 27

Slide 34

Slide 34 text

Conclusions and perspectives

Slide 35

Slide 35 text

Conclusions and perspectives • Riemannian geometry: • powerful tool for signal processing and machine learning • exploits intrinsic structure of data and parameters • quite complete and allows modularity – modeling, optimization, performance analysis • when to use Riemannian geometry and optimization ? • data and/or parameters possess a structure: constraints, invariances (geometrical properties of model) • metric of particular interest with respect to model (e.g., information theory) • issue: for some manifolds, geometry complicated to fully characterize e.g., low rank symmetric positive semidefinite matrices ⇒ go beyond Riemannian geometry to provide simple and efficient geometrical objects 28

Slide 36

Slide 36 text

Absil, P.-A., Mahony, R., and Sepulchre, R. (2008). Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA. Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2007). Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM journal on matrix analysis and applications, 29(1):328–347. Bhatia, R. (2009). Positive definite matrices. Princeton University Press. Bhatia, R., Jain, T., and Lim, Y. (2017). On the Bures-Wasserstein distance between positive definite matrices. preprint. Bonnabel, S. and Sepulchre, R. (2009). Riemannian metric and geometric mean for positive semidefinite matrices of fixed rank. SIAM Journal on Matrix Analysis and Applications, 31(3):1055–1070. Bouchard, F., Breloy, A., Ginolhac, G., Renaux, A., and Pascal, F. (2020). A Riemannian framework for low-rank structured elliptical models. IEEE Transactions on Signal Processing, submitted. Available on arxiv. Boumal, N. (2013). On intrinsic Cramér-Rao bounds for Riemannian submanifolds and quotient manifolds. IEEE Transactions on Signal Processing, 61(7):1809–1821. Breloy, A., Ginolhac, G., Renaux, A., and Bouchard, F. (2018). Intrinsic Cramér–Rao bounds for scatter and shape matrices estimation in CES distributions. IEEE Signal Processing Letters, 26(2):262–266. Cardoso, J.-F. and Souloumiac, A. (1993). Blind beamforming for non Gaussian signals. IEEE Proceedings-F, 140(6):362–370. 29

Slide 37

Slide 37 text

Chebbi, Z. and Moakher, M. (2012). Means of Hermitian positive-definite matrices based on the log-determinant α-divergence function. Linear Algebra and its Applications, 436(7):1872–1889. Comon, P. and Jutten, C. (2010). Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, 1st edition. Massart, E. and Absil, P.-A. (2018). Quotient geometry with simple geodesics for the manifold of fixed-rank positive-semidefinite matrices. Technical Report UCL-INMA-2018.06. Moreau, E. and Macchi, O. (1994). A one stage self-adaptive algorithm for source separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., volume 3, pages 49–52. Ollila, E., Tyler, D. E., Koivunen, V., and Poor, H. V. (2012). Complex elliptically symmetric distributions: Survey, new results and applications. IEEE Transactions on Signal Processing, 60(11):5597–5625. Pham, D.-T. (2000). Joint approximate diagonalization of positive definite Hermitian matrices. SIAM J. Matrix Anal. Appl., 22(4):1136–1152. Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211–223. Smith, S. T. (2005). Covariance, subspace, and intrinsic Cramér-Rao bounds. IEEE Transactions on Signal Processing, 53(5):1610–1630. Sra, S. (2013). Positive definite matrices and the s-divergence. arXiv preprint arXiv:1110.1773. 30

Slide 38

Slide 38 text

Sun, Y., Babu, P., and Palomar, D. P. (2016). Robust estimation of structured covariance matrix for heavy-tailed elliptical distributions. IEEE Transactions on Signal Processing, 64(14):3576–3590. Vandereycken, B., Absil, P.-A., and Vandewalle, S. (2012). A Riemannian geometry with complete geodesics for the set of positive semidefinite matrices of fixed rank. IMA Journal of Numerical Analysis, 33(2):481–514. Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media. 31