150

# Florent Bouchard

(L2S — CNRS, Université Paris-Saclay, CentraleSupélec)

https://s3-seminar.github.io/seminars/florent-bouchard/

Title — Riemannian geometry for data analysis: illustration on blind source separation and low-rank structured covariance matrices

Abstract — In this presentation, Riemannian geometry for data analysis is introduced. In particular, it is applied on two specific statistical signal processing problems: blind source separation and low-rank structured covariance matrices. Blind source separation can be solved by jointly diagonalizing some covariance matrices. We show here how geometry can be exploited in order to provide original joint diagonalization criteria and ways to compare them theoretically. These results are illustrated with numerical experiments both on simulated and real data. Concerning low-rank structured covariance matrices, an intrinsic Cramér-Rao bound of the corresponding estimation problem is presented, illustrating the interest of geometry for performance analysis. These results are validated with simulations.

## S³ Seminar

November 27, 2020

## Transcript

1. ### Riemannian geometry for data analysis: application to blind source separation

and low-rank structured covariance matrices Florent Bouchard (L2S – CNRS, Univ. Paris-Saclay, CentraleSupélec) S3 seminar – November, 25 2020

3. ### Introduction – Statistical data analysis • Characterize data x with

some parameters θ x θ • To estimate θ, optimization problem: argmin θ∈M f (x, θ) • f : X × M → R, cost function corresponding to the model • x and θ might possess a structure ⇒ X and M are manifolds 1
4. ### Introduction – Riemannian geometry Geometry: metric, curves, distance or divergence

treat diﬃcult problems, complicated to handle with other approaches • modeling: design appropriate cost functions for the considered model exploit the structure of data x, e.g. centers of mass • optimization: generic convex and non-convex methods, modularity naturally takes into account structure of parameters θ • performance analysis: error measures and associated performance bounds captures the geometrical structure of the problem M TθM •θ 2
5. ### Introduction Riemannian geometry and optimization Approximate joint diagonalization for blind

source separation Intrinsic Cramér-Rao bound for low-rank structured elliptical models Conclusions and perspectives

7. ### Riemannian geometry [Absil et al., 2008] • Manifold M: locally

diﬀeomorphic to Rd , with dim(M) = d, i.e. ∀θ ∈ M, ∃ Uθ ⊂ M and ϕθ : Uθ → Rd , diﬀeomorphism M • θ Uθ • Curve γ : R → M, γ(0) = θ, derivative: ˙ γ(0) = lim t→0 γ(t) − γ(0) t • Tangent space Tθ M = { ˙ γ(0) : γ : R → M, γ(0) = θ} M TθM θ • 3
8. ### Riemannian geometry [Absil et al., 2008] • Riemannian metric ·,

· θ : Tθ M × Tθ M → R • inner product on TθM – bilinear, symmetric, positive deﬁnite • deﬁnes length and relative positions of tangent vectors ξ 2 θ = ξ, ξ θ α(ξ, η) = ξ, η θ ξ θ η θ • Geodesics γ : [0, 1] → M • generalizes straight lines to manifolds – deﬁned by (γ(0), ˙ γ(0)) or (γ(0), γ(1)) • curves on M with zero acceleration: D2γ dt2 = 0 operator D2 dt2 depends on M and ·, · · • Riemannian distance: δ(θ, ˆ θ) = 1 0 ˙ γ(t) γ(t) dt γ: geodesic connecting θ and ˆ θ ⇒ distance = length of γ M TθM θ ξ η α • M θ ˆ θ ˙ γ(t) • • • γ(t) 4
9. ### Riemannian optimization [Absil et al., 2008] argmin θ∈M f (θ)

• framework for optimization on manifold M equipped with metric ·, · · M TθM ξ • θ f (θ) D f (θ)[ξ] • descent direction of f at θ: ξ ∈ Tθ M, D f (θ)[ξ] < 0 • gradient of f at θ: grad f (θ), ξ θ = D f (θ)[ξ] 5
10. ### Riemannian optimization [Absil et al., 2008] • minimize f on

M from θ: descent direction ξ ∈ TθM grad f (θ), ξ θ < 0 retraction of ξ on M M TθM ξ • θ • Rθ(ξ) reiterate until critical point: grad f (θ) = 0 • example of optimization algorithm: gradient descent θi+1 = Rθi (−ti grad f (θi )) 6

12. ### Problem: approximate joint diagonalization • data: set {Ck } of

n × n symmetric positive deﬁnite matrices • goal: ﬁnd joint diagonalizer B of matrices Ck {Ck } ··· k B BCk BT ··· k • formulation: non-singular matrix B ∈ Rn×n such that: argmin B f (B, {Ck }) f – diagonality criterion 7
13. ### Application: blind source separation • instantaneous linear mixing: x(t) =

A s(t) [Comon and Jutten, 2010] • goal: retrieve A and s(t) knowing x(t) s(t) x(t) A Bx(t) B • used in many engineering ﬁelds such as electroencephalography 8
14. ### Geometrical modeling of the problem • goal: get Ck as

close as possible to D++ n in S++ n • diagonality measure of BCk BT : relative position to D++ n D++ n diagonal positive deﬁnite matrices S++ n symmetric positive deﬁnite matrices • • • • • • Ck 9
15. ### Proposed geometrical model • appropriate criterion: f (B, {Ck }))

= k d(BCk BT , Λk (B)) Λk (B): target diagonal matrices d(·, ·) : divergence on S++ n D++ n diagonal positive deﬁnite matrices S++ n symmetric positive deﬁnite matrices • BCk BT • Λk (B) • Λk (B) • natural choice for Λk (B): Λk (B) = argmin Λ∈D++ n d(BCk BT , Λ) 10
16. ### Considered divergences • least squares criterion: Euclidean distance on S++

n [Cardoso and Souloumiac, 1993] δ2 F (C, Λ) = C − Λ 2 F Λ = ddiag(C) • log-likelihood cirterion: left Kullback-Leibler divergence [Pham, 2000] d KL(C, Λ) = dKL(C, Λ) Λ = ddiag(C) where dKL(P, S) = trace(PS−1 − In) − log det(PS−1) • other measures obtained from dKL (·, ·): right measure: drKL(C, Λ) = dKL(Λ, C) Λ = ddiag(C−1)−1 symmetrized measure: dsKL(C, Λ) = 1 2 (dKL(C, Λ) + dKL(Λ, C)) Λ = ddiag(C)1/2 ddiag(C−1)−1/2 11
17. ### Considered divergences • natural Riemannian distance: [Skovgaard, 1984], [Bhatia, 2009]

δ2 R (C, Λ) = log(Λ−1/2CΛ−1/2) 2 F ddiag(log(C−1Λ)) = 0n • log-Euclidean distance: [Arsigny et al., 2007] δ2 LE (C, Λ) = log(C) − log(Λ) 2 F Λ = exp(ddiag(log(C))) • Bhattacharyya distance: [Chebbi and Moakher, 2012], [Sra, 2013] δ2 B (C, Λ) = 4 log det((C + Λ)/2) det(C)1/2 det(Λ)1/2 2 ddiag((C + Λ)−1) = Λ−1 • Wasserstein distance: [Villani, 2008], [Bhatia et al., 2017] δ2 W (C, Λ) = trace 1 2 (C + Λ) − (Λ1/2CΛ1/2)1/2 ddiag((Λ1/2CΛ1/2)1/2) = Λ 12
18. ### First experiment: simulated data Ck = AΛk AT + 1

σ Ek ∆k ET k A, Ek : random elements, standard normal distribution conditioning of A with respect to inversion in [1, 10] Λk , ∆k : diagonal – pth element, χ2 distribution with expectancy 1 and divided by p σ : signal to noise ratio • performance measure: Moreau-Amari criterion [Moreau and Macchi, 1994] measure degree of similarity of BA with matrix of the form PΣ P permutation matrix – Σ non-singular diagonal matrix 13
19. ### First experiment: comparisons of divergences 10 50 100 500 1000

−30 −20 −10 σ IM-A (dB) F W rKL B R sKL KL LE medians, ﬁrst and ninth deciles (error bars) estimated on 50 trials 14
20. ### Second experiment: electroencephalographic data Nasion Inion 1 2 3 4

5 6 7 8 1. Oz 2. O1 13 Hz 17 Hz 3. O2 21 Hz repos 4. POz 10 20 f (Hz) 5. PO3 10 20 6. PO4 10 20 7. PO7 10 20 8. PO8 15
21. ### Obtained sources with proposed methods 10 20 f (Hz) s1

s2 13 Hz 17 Hz 21 Hz repos 16
22. ### Comparisons of divergences performance measure ISSVEP (f ) ∈ [0,

1] for class f ratio between power at f and total power ISSVEP(f ) F W rKL B R sKL KL LE 13 Hz 0, 95 0, 95 0, 96 0, 96 0, 96 0, 96 0, 96 0, 96 17 Hz 0, 87 0, 89 0, 91 0, 91 0, 91 0, 91 0, 91 0, 91 21 Hz 0, 50 0, 54 0, 60 0, 60 0, 60 0, 60 0, 60 0, 60 s1 17

24. ### Intrinsic Cramér-Rao bound • x ∼ L(θ), θ ∈ M,

with log-likelihood Lx : M → R • given x, estimation problem: ˆ θ = argmin θ∈M −Lx (θ) • Cramèr-Rao bound of an unbiased estimator ˆ θ of θ : Eˆ θ [Err(θ, ˆ θ)] F−1 ⇒ MSE = Eˆ θ [err(θ, ˆ θ)] ≥ trace(F−1) F: Fisher information matrix 18
25. ### Intrinsic Cramér-Rao bound Eˆ θ [err(θ, ˆ θ)] ≥ trace(F−1)

• classical case: M = Rd err(θ, ˆ θ) = θ − ˆ θ 2 2 Fij = −Ex ∂2Lx (θ) ∂θi ∂θj ⇒ constraints, invariances of the model not taken into account • generalization to Riemannian manifold M [Smith, 2005, Boumal, 2013] • err(θ, ˆ θ) = δ2 M (θ, ˆ θ) δM distance on M associated to ·, · M · • Fij = ei , ej F θ • ξ, η F θ = −Ex [D2 Lx (θ)[ξ, η]], Fisher metric • {ei } orthonormal basis of TθM associated to ·, · M θ M TθM •θ ⇒ allows to exhibit diﬀerent properties from the classical case 19
26. ### Elliptical distributions / low-rank structure • x ∼ CES(0, R),

R ∈ H++ p , with log-likelihood L++ x (R) = log(g(xH R−1x)) − log det(R)−1 g : R+ → R+ e.g. gaussian (g(t) = exp(−t)), generalized gaussian, Student t • Fisher metric of elliptical distributions on H++ p : [Breloy et al., 2018] ξ, η F,++ R = αF trace(R−1ξR−1η) + βF trace(R−1ξ) trace(R−1η) • low-rank structure – few data [Sun et al., 2016, Bouchard et al., 2020] R = Ip + H H ∈ H+ p,k ⇒ geometry of H+ p,k : several possibilities, none ideal e.g. [Bonnabel and Sepulchre, 2009, Vandereycken et al., 2012, Massart and Absil, 2018] 20
27. ### Geometry of low-rank structured covariance matrices • H = ϕ(U,

Σ) = UΣUH with (U, Σ) ∈ Mp,k = Stp,k × H++ k • ∀O ∈ Uk , ϕ(UO, OH ΣO) = ϕ(U, Σ) ⇒ H+ p,k isomorphic to quotient [Bonnabel and Sepulchre, 2009, Bouchard et al., 2020] Mp,k = Mp,k /Uk = {π(U, Σ) : (U, Σ) ∈ Mp,k } où π(U, Σ) = {(UO, OH ΣO) : O ∈ Uk } Mp,k O → (UO, O H ΣO) • θ = (U, Σ) 21
28. ### quotient Mp,k : Riemannian geometry • θ = (U, Σ)

∈ Mp,k • Tθ Mp,k = {(UΩξ + U⊥ Kξ , ξΣ ) : Ωξ ∈ Ak , Kξ ∈ R(p−k)×k , ξΣ ∈ Sk } U⊥ ∈ Stp,(p−k) such that UH U⊥ = 0 ξ, η θ = 1 2 trace(ΩH ξ Ωη ) + trace(KH ξ Kη ) + α trace(Σ−1ξΣ Σ−1ηΣ ) + β trace(Σ−1ξΣ ) trace(Σ−1ηΣ ) • vertical space – tangent space to the equivalence class Vθ = {(UΩ, ΣΩ − ΩΣ) : Ω ∈ Ak } • horizontal space – orthogonal complement to Vθ according to ·, · θ Hθ = {ξ ∈ Tθ Mp,k : Ωξ = 2α(Σ−1ξΣ − ξΣ Σ−1)} Mp,k O → (UO, O T SO) TθMp,k Vθ Hθ • θ 22
29. ### quotient Mp,k : error measure • Riemannian distance not known

⇒ alternative error measure • alternative horizontal space – complement to Vθ, non orthogonal [Bonnabel and Sepulchre, 2009] Hθ = {ξ ∈ Tθ Mp,k : Ωξ = 0} = {(U⊥ Kξ , ξΣ ) : Kξ ∈ R(p−k)×k , ξΣ ∈ Sk } • induces divergence onto Mp,k d(π(θ), π(ˆ θ)) = Θ 2 2 + α log(Σ−1/2O ˆ OH ˆ Σ ˆ OOH Σ−1/2) 2 2 + β log det(Σ−1 ˆ Σ) 2 UT ˆ U = O cos(Θ) ˆ OT ⇒ err(θ, ˆ θ) = d(θ, ˆ θ) Mp,k O → (UO, O H ΣO) TθMp,k Vθ Hθ Hθ • θ • θ 23
30. ### quotient Mp,k : performance bound Eˆ θ [err(θ, ˆ θ)]

= Eˆ θ [d(θ, ˆ θ)] ≥ trace(F−1 ) Fij = ei , ej F,Mp,k θ • {ei }, orthonormal basis of Hθ according to ·, · θ : {(U⊥Kij , 0), (iU⊥Kij , 0)}1≤i≤p−k 1≤j≤k , {(0, 1 √ α Σ1/2Hij Σ1/2 + √ α− √ α+kβ k √ α √ α+kβ trace(Hij )Σ)}1≤j≤i≤k , {(0, 1 √ α Σ1/2H ij Σ1/2)}1≤j<i≤k • Kij ∈ R(p−k)×k : element ij equals 1, 0 elsewhere • Hii ∈ Rk×k : element ii equals 1, 0 elsewhere • Hij ∈ Rk×k , i = j: elements ij and ji equal 1/√ 2, 0 elsewhere • H ij ∈ iRk×k , i = j: elements ij and ji equal i/√ 2 et −i/√ 2, 0 elsewhere • ·, · F,Mp,k · : Fisher metric onto Mp,k 24
31. ### quotient Mp,k : Fisher metric of elliptical distributions • ϕ(θ)

= UΣUH D ϕ(θ)[ξ] = UξΣ UH + ξU ΣUH + UΣξU • log-likelihood: LMp,k x (θ) = L++ x (Ip + ϕ(θ)) • Fisher metric: ξ, η F,Mp,k θ = D ϕ(θ)[ξ], D ϕ(θ)[η] F,++ Ip+ϕ(θ) where ξR , ηR F,++ R = αF trace(R−1ξR R−1ηR ) + βF trace(R−1ξR ) trace(R−1ηR ) 25
32. ### Numerical illustrations • covariance matrix: R = Ip + σUΣUH

• p = 16, k = 8 • U ∈ Stp,k : random orthgonal matrix • Σ ∈ H++ k : diagonal matrix, minimal / maximal element: 1/ √ c, √ c (c=20: conditionning) other elements: random, uniform distribution between 1/ √ c and √ c trace normalisée : trace(Σ) = trace(Ik ) = k • σ = 50 : free parameter • data : • 500 sets of n ∈ [12, 300] random samples • t-distribution, d = 3 (non gaussian) and d = 100 (almost gaussian) degrees of freedom [Ollila et al., 2012] 26
33. ### Numerical illustrations 101n = p 102 −10 0 10 n

err(θ, ˆ θ) (dB) d = 3 trace(F−1 θ ) T-RGD pSCM T-RTR 101n = p 102 n d = 100 27

35. ### Conclusions and perspectives • Riemannian geometry: • powerful tool for

signal processing and machine learning • exploits intrinsic structure of data and parameters • quite complete and allows modularity – modeling, optimization, performance analysis • when to use Riemannian geometry and optimization ? • data and/or parameters possess a structure: constraints, invariances (geometrical properties of model) • metric of particular interest with respect to model (e.g., information theory) • issue: for some manifolds, geometry complicated to fully characterize e.g., low rank symmetric positive semideﬁnite matrices ⇒ go beyond Riemannian geometry to provide simple and eﬃcient geometrical objects 28
36. ### Absil, P.-A., Mahony, R., and Sepulchre, R. (2008). Optimization Algorithms

on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA. Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2007). Geometric means in a novel vector space structure on symmetric positive-deﬁnite matrices. SIAM journal on matrix analysis and applications, 29(1):328–347. Bhatia, R. (2009). Positive deﬁnite matrices. Princeton University Press. Bhatia, R., Jain, T., and Lim, Y. (2017). On the Bures-Wasserstein distance between positive deﬁnite matrices. preprint. Bonnabel, S. and Sepulchre, R. (2009). Riemannian metric and geometric mean for positive semideﬁnite matrices of ﬁxed rank. SIAM Journal on Matrix Analysis and Applications, 31(3):1055–1070. Bouchard, F., Breloy, A., Ginolhac, G., Renaux, A., and Pascal, F. (2020). A Riemannian framework for low-rank structured elliptical models. IEEE Transactions on Signal Processing, submitted. Available on arxiv. Boumal, N. (2013). On intrinsic Cramér-Rao bounds for Riemannian submanifolds and quotient manifolds. IEEE Transactions on Signal Processing, 61(7):1809–1821. Breloy, A., Ginolhac, G., Renaux, A., and Bouchard, F. (2018). Intrinsic Cramér–Rao bounds for scatter and shape matrices estimation in CES distributions. IEEE Signal Processing Letters, 26(2):262–266. Cardoso, J.-F. and Souloumiac, A. (1993). Blind beamforming for non Gaussian signals. IEEE Proceedings-F, 140(6):362–370. 29
37. ### Chebbi, Z. and Moakher, M. (2012). Means of Hermitian positive-deﬁnite

matrices based on the log-determinant α-divergence function. Linear Algebra and its Applications, 436(7):1872–1889. Comon, P. and Jutten, C. (2010). Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, 1st edition. Massart, E. and Absil, P.-A. (2018). Quotient geometry with simple geodesics for the manifold of ﬁxed-rank positive-semideﬁnite matrices. Technical Report UCL-INMA-2018.06. Moreau, E. and Macchi, O. (1994). A one stage self-adaptive algorithm for source separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., volume 3, pages 49–52. Ollila, E., Tyler, D. E., Koivunen, V., and Poor, H. V. (2012). Complex elliptically symmetric distributions: Survey, new results and applications. IEEE Transactions on Signal Processing, 60(11):5597–5625. Pham, D.-T. (2000). Joint approximate diagonalization of positive deﬁnite Hermitian matrices. SIAM J. Matrix Anal. Appl., 22(4):1136–1152. Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, pages 211–223. Smith, S. T. (2005). Covariance, subspace, and intrinsic Cramér-Rao bounds. IEEE Transactions on Signal Processing, 53(5):1610–1630. Sra, S. (2013). Positive deﬁnite matrices and the s-divergence. arXiv preprint arXiv:1110.1773. 30
38. ### Sun, Y., Babu, P., and Palomar, D. P. (2016). Robust

estimation of structured covariance matrix for heavy-tailed elliptical distributions. IEEE Transactions on Signal Processing, 64(14):3576–3590. Vandereycken, B., Absil, P.-A., and Vandewalle, S. (2012). A Riemannian geometry with complete geodesics for the set of positive semideﬁnite matrices of ﬁxed rank. IMA Journal of Numerical Analysis, 33(2):481–514. Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media. 31