S³ Seminar
November 27, 2020
510

# Florent Bouchard

(L2S — CNRS, Université Paris-Saclay, CentraleSupélec)

https://s3-seminar.github.io/seminars/florent-bouchard/

Title — Riemannian geometry for data analysis: illustration on blind source separation and low-rank structured covariance matrices

Abstract — In this presentation, Riemannian geometry for data analysis is introduced. In particular, it is applied on two specific statistical signal processing problems: blind source separation and low-rank structured covariance matrices. Blind source separation can be solved by jointly diagonalizing some covariance matrices. We show here how geometry can be exploited in order to provide original joint diagonalization criteria and ways to compare them theoretically. These results are illustrated with numerical experiments both on simulated and real data. Concerning low-rank structured covariance matrices, an intrinsic Cramér-Rao bound of the corresponding estimation problem is presented, illustrating the interest of geometry for performance analysis. These results are validated with simulations.

## S³ Seminar

November 27, 2020

## Transcript

1. Riemannian geometry for data analysis:
application to blind source separation and low-rank structured
covariance matrices
Florent Bouchard (L2S – CNRS, Univ. Paris-Saclay, CentraleSupélec)
S3 seminar – November, 25 2020

2. Introduction

3. Introduction – Statistical data analysis
• Characterize data x with some parameters θ
x θ
• To estimate θ, optimization problem:
argmin
θ∈M
f (x, θ)
• f : X × M → R, cost function corresponding to the model
• x and θ might possess a structure ⇒ X and M are manifolds
1

4. Introduction – Riemannian geometry
Geometry: metric, curves, distance or divergence
treat diﬃcult problems, complicated to handle with other approaches
• modeling: design appropriate cost functions for the considered model
exploit the structure of data x, e.g. centers of mass
• optimization: generic convex and non-convex methods, modularity
naturally takes into account structure of parameters θ
• performance analysis: error measures and associated performance bounds
captures the geometrical structure of the problem
M
TθM
•θ
2

5. Introduction
Riemannian geometry and optimization
Approximate joint diagonalization for blind source separation
Intrinsic Cramér-Rao bound for low-rank structured elliptical models
Conclusions and perspectives

6. Riemannian geometry and optimization

7. Riemannian geometry [Absil et al., 2008]
• Manifold M: locally diﬀeomorphic to Rd , with dim(M) = d, i.e.
∀θ ∈ M, ∃ Uθ
⊂ M and ϕθ
: Uθ
→ Rd , diﬀeomorphism
M

θ

• Curve γ : R → M, γ(0) = θ, derivative: ˙
γ(0) = lim
t→0
γ(t) − γ(0)
t
• Tangent space Tθ
M = { ˙
γ(0) : γ : R → M, γ(0) = θ}
M
TθM
θ •
3

8. Riemannian geometry [Absil et al., 2008]
• Riemannian metric ·, · θ
: Tθ
M × Tθ
M → R
• inner product on TθM – bilinear, symmetric, positive deﬁnite
• deﬁnes length and relative positions of tangent vectors
ξ 2
θ
= ξ, ξ θ α(ξ, η) =
ξ, η θ
ξ
θ
η
θ
• Geodesics γ : [0, 1] → M
• generalizes straight lines to manifolds – deﬁned by (γ(0), ˙
γ(0)) or (γ(0), γ(1))
• curves on M with zero acceleration:
D2γ
dt2
= 0
operator D2
dt2
depends on M and ·, · ·
• Riemannian distance: δ(θ, ˆ
θ) =
1
0
˙
γ(t)
γ(t)
dt
γ: geodesic connecting θ and ˆ
θ ⇒ distance = length of γ
M
TθM
θ
ξ
η
α

M
θ ˆ
θ
˙
γ(t)
• •

γ(t)
4

9. Riemannian optimization [Absil et al., 2008]
argmin
θ∈M
f (θ)
• framework for optimization on manifold M equipped with metric ·, · ·
M
TθM
ξ

θ
f (θ)
D f (θ)[ξ]
• descent direction of f at θ:
ξ ∈ Tθ
M, D f (θ)[ξ] < 0
• gradient of f at θ:
= D f (θ)[ξ]
5

10. Riemannian optimization [Absil et al., 2008]
• minimize f on M from θ:
descent direction ξ ∈ TθM
grad f (θ), ξ θ < 0
retraction of ξ on M
M
TθM ξ

θ

Rθ(ξ)
reiterate until critical point: grad f (θ) = 0
• example of optimization algorithm: gradient descent
θi+1
= Rθi
(−ti
))
6

11. Approximate joint diagonalization for
blind source separation

12. Problem: approximate joint diagonalization
• data: set {Ck
} of n × n symmetric positive deﬁnite matrices
• goal: ﬁnd joint diagonalizer B of matrices Ck
{Ck }
···
k
B
BCk BT
···
k
• formulation: non-singular matrix B ∈ Rn×n such that:
argmin
B
f (B, {Ck
})
f – diagonality criterion
7

13. Application: blind source separation
• instantaneous linear mixing: x(t) = A s(t) [Comon and Jutten, 2010]
• goal: retrieve A and s(t) knowing x(t)
s(t) x(t)
A Bx(t)
B
• used in many engineering ﬁelds such as electroencephalography
8

14. Geometrical modeling of the problem
• goal: get Ck
as close as possible to D++
n
in S++
n
• diagonality measure of BCk
BT : relative position to D++
n
D++
n
diagonal positive
deﬁnite matrices
S++
n
symmetric positive
deﬁnite matrices

• •
Ck
9

15. Proposed geometrical model
• appropriate criterion: f (B, {Ck
})) =
k
d(BCk
BT , Λk
(B))
Λk (B): target diagonal matrices
d(·, ·) : divergence on S++
n
D++
n
diagonal positive
deﬁnite matrices
S++
n
symmetric positive
deﬁnite matrices

BCk BT

Λk (B)

Λk (B)
• natural choice for Λk
(B):
Λk
(B) = argmin
Λ∈D++
n
d(BCk
BT , Λ)
10

16. Considered divergences
• least squares criterion: Euclidean distance on S++
n
[Cardoso and Souloumiac, 1993]
δ2
F
(C, Λ) = C − Λ 2
F
Λ = ddiag(C)
• log-likelihood cirterion: left Kullback-Leibler divergence [Pham, 2000]
d KL(C, Λ) = dKL(C, Λ) Λ = ddiag(C)
where dKL(P, S) = trace(PS−1 − In) − log det(PS−1)
• other measures obtained from dKL
(·, ·):
right measure:
drKL(C, Λ) = dKL(Λ, C) Λ = ddiag(C−1)−1
symmetrized measure:
dsKL(C, Λ) =
1
2
(dKL(C, Λ) + dKL(Λ, C)) Λ = ddiag(C)1/2 ddiag(C−1)−1/2
11

17. Considered divergences
• natural Riemannian distance: [Skovgaard, 1984], [Bhatia, 2009]
δ2
R
(C, Λ) = log(Λ−1/2CΛ−1/2)
2
F
ddiag(log(C−1Λ)) = 0n
• log-Euclidean distance: [Arsigny et al., 2007]
δ2
LE
(C, Λ) = log(C) − log(Λ) 2
F
Λ = exp(ddiag(log(C)))
• Bhattacharyya distance: [Chebbi and Moakher, 2012], [Sra, 2013]
δ2
B
(C, Λ) = 4 log
det((C + Λ)/2)
det(C)1/2 det(Λ)1/2
2 ddiag((C + Λ)−1) = Λ−1
• Wasserstein distance: [Villani, 2008], [Bhatia et al., 2017]
δ2
W
(C, Λ) = trace
1
2
(C + Λ) − (Λ1/2CΛ1/2)1/2 ddiag((Λ1/2CΛ1/2)1/2) = Λ
12

18. First experiment: simulated data
Ck
= AΛk
AT +
1
σ
Ek
∆k
ET
k
A, Ek : random elements, standard normal distribution
conditioning of A with respect to inversion in [1, 10]
Λk , ∆k : diagonal – pth element, χ2 distribution with expectancy 1 and divided
by p
σ : signal to noise ratio
• performance measure: Moreau-Amari criterion [Moreau and Macchi, 1994]
measure degree of similarity of BA with matrix of the form PΣ
P permutation matrix – Σ non-singular diagonal matrix
13

19. First experiment: comparisons of divergences
10 50 100 500 1000
−30
−20
−10
σ
IM-A (dB)
F W
rKL B
R sKL
KL LE
medians, ﬁrst and ninth deciles (error bars) estimated on 50 trials
14

20. Second experiment: electroencephalographic data
Nasion
Inion
1
2 3
4
5 6
7 8
1. Oz 2. O1
13 Hz
17 Hz
3. O2
21 Hz
repos
4. POz
10 20
f (Hz)
5. PO3
10 20
6. PO4
10 20
7. PO7
10 20
8. PO8
15

21. Obtained sources with proposed methods
10 20
f (Hz)
s1 s2 13 Hz
17 Hz
21 Hz
repos
16

22. Comparisons of divergences
performance measure ISSVEP
(f ) ∈ [0, 1] for class f
ratio between power at f and total power
ISSVEP(f ) F W rKL B R sKL KL LE
13 Hz 0, 95 0, 95 0, 96 0, 96 0, 96 0, 96 0, 96 0, 96
17 Hz 0, 87 0, 89 0, 91 0, 91 0, 91 0, 91 0, 91 0, 91
21 Hz 0, 50 0, 54 0, 60 0, 60 0, 60 0, 60 0, 60 0, 60
s1
17

23. Intrinsic Cramér-Rao bound for low-rank
structured elliptical models

24. Intrinsic Cramér-Rao bound
• x ∼ L(θ), θ ∈ M, with log-likelihood Lx
: M → R
• given x, estimation problem:
ˆ
θ = argmin
θ∈M
−Lx
(θ)
• Cramèr-Rao bound of an unbiased estimator ˆ
θ of θ :

θ
[Err(θ, ˆ
θ)] F−1 ⇒ MSE = Eˆ
θ
[err(θ, ˆ
θ)] ≥ trace(F−1)
F: Fisher information matrix
18

25. Intrinsic Cramér-Rao bound

θ
[err(θ, ˆ
θ)] ≥ trace(F−1)
• classical case: M = Rd
err(θ, ˆ
θ) = θ − ˆ
θ 2
2
Fij
= −Ex
∂2Lx
(θ)
∂θi ∂θj
⇒ constraints, invariances of the model not taken into account
• generalization to Riemannian manifold M [Smith, 2005, Boumal, 2013]
• err(θ, ˆ
θ) = δ2
M
(θ, ˆ
θ)
δM distance on M associated to ·, · M
·
• Fij
= ei , ej
F
θ
• ξ, η F
θ
= −Ex [D2 Lx (θ)[ξ, η]],
Fisher metric
• {ei } orthonormal basis of TθM
associated to ·, · M
θ
M
TθM
•θ
⇒ allows to exhibit diﬀerent properties from the classical case
19

26. Elliptical distributions / low-rank structure
• x ∼ CES(0, R), R ∈ H++
p
, with log-likelihood
L++
x
(R) = log(g(xH R−1x)) − log det(R)−1 g : R+ → R+
e.g. gaussian (g(t) = exp(−t)), generalized gaussian, Student t
• Fisher metric of elliptical distributions on H++
p
: [Breloy et al., 2018]
ξ, η F,++
R
= αF trace(R−1ξR−1η) + βF trace(R−1ξ) trace(R−1η)
• low-rank structure – few data [Sun et al., 2016, Bouchard et al., 2020]
R = Ip
+ H H ∈ H+
p,k
⇒ geometry of H+
p,k
: several possibilities, none ideal
e.g. [Bonnabel and Sepulchre, 2009, Vandereycken et al., 2012, Massart and Absil, 2018]
20

27. Geometry of low-rank structured covariance matrices
• H = ϕ(U, Σ) = UΣUH with (U, Σ) ∈ Mp,k
= Stp,k
× H++
k
• ∀O ∈ Uk
, ϕ(UO, OH ΣO) = ϕ(U, Σ)
⇒ H+
p,k
isomorphic to quotient [Bonnabel and Sepulchre, 2009, Bouchard et al., 2020]
Mp,k
= Mp,k
/Uk
= {π(U, Σ) : (U, Σ) ∈ Mp,k
}
où π(U, Σ) = {(UO, OH ΣO) : O ∈ Uk
}
Mp,k
O → (UO, O H
ΣO)

θ = (U, Σ)
21

28. quotient Mp,k
: Riemannian geometry
• θ = (U, Σ) ∈ Mp,k
• Tθ
Mp,k
= {(UΩξ
+ U⊥

, ξΣ
) : Ωξ
∈ Ak
, Kξ
∈ R(p−k)×k , ξΣ
∈ Sk
}
U⊥ ∈ Stp,(p−k)
such that UH U⊥
= 0
ξ, η θ
=
1
2
trace(ΩH
ξ
Ωη
) + trace(KH
ξ

)
+ α trace(Σ−1ξΣ
Σ−1ηΣ
) + β trace(Σ−1ξΣ
) trace(Σ−1ηΣ
)
• vertical space – tangent space to the equivalence class

= {(UΩ, ΣΩ − ΩΣ) : Ω ∈ Ak
}
• horizontal space – orthogonal complement to Vθ according to ·, · θ

= {ξ ∈ Tθ
Mp,k
: Ωξ
= 2α(Σ−1ξΣ
− ξΣ
Σ−1)}
Mp,k
O → (UO, O T
SO)
TθMp,k Vθ

θ
22

29. quotient Mp,k
: error measure
• Riemannian distance not known ⇒ alternative error measure
• alternative horizontal space – complement to Vθ, non orthogonal
[Bonnabel and Sepulchre, 2009]

= {ξ ∈ Tθ
Mp,k
: Ωξ
= 0}
= {(U⊥

, ξΣ
) : Kξ
∈ R(p−k)×k , ξΣ
∈ Sk
}
• induces divergence onto Mp,k
d(π(θ), π(ˆ
θ)) = Θ 2
2
+ α log(Σ−1/2O ˆ
OH ˆ
Σ ˆ
OOH Σ−1/2)
2
2
+ β log det(Σ−1 ˆ
Σ)
2
UT ˆ
U = O cos(Θ) ˆ
OT
⇒ err(θ, ˆ
θ) = d(θ, ˆ
θ)
Mp,k
O → (UO, O H
ΣO)
TθMp,k Vθ

• θ

θ
23

30. quotient Mp,k
: performance bound

θ
[err(θ, ˆ
θ)] = Eˆ
θ
[d(θ, ˆ
θ)] ≥ trace(F−1
)
Fij
= ei , ej
F,Mp,k
θ
• {ei }, orthonormal basis of Hθ
according to ·, · θ
:
{(U⊥Kij , 0), (iU⊥Kij , 0)}1≤i≤p−k
1≤j≤k
,
{(0, 1

α
Σ1/2Hij Σ1/2 +

α−

α+kβ
k

α

α+kβ
trace(Hij )Σ)}1≤j≤i≤k ,
{(0, 1

α
Σ1/2H
ij
Σ1/2)}1≤j• Kij ∈ R(p−k)×k : element ij equals 1, 0 elsewhere
• Hii ∈ Rk×k : element ii equals 1, 0 elsewhere
• Hij ∈ Rk×k , i = j: elements ij and ji equal 1/√
2, 0 elsewhere
• H
ij
∈ iRk×k , i = j: elements ij and ji equal i/√
2 et −i/√
2, 0 elsewhere
• ·, · F,Mp,k
·
: Fisher metric onto Mp,k
24

31. quotient Mp,k
: Fisher metric of elliptical distributions
• ϕ(θ) = UΣUH D ϕ(θ)[ξ] = UξΣ
UH + ξU
ΣUH + UΣξU
• log-likelihood: LMp,k
x
(θ) = L++
x
(Ip
+ ϕ(θ))
• Fisher metric:
ξ, η F,Mp,k
θ
= D ϕ(θ)[ξ], D ϕ(θ)[η] F,++
Ip+ϕ(θ)
where
ξR
, ηR
F,++
R
= αF trace(R−1ξR
R−1ηR
) + βF trace(R−1ξR
) trace(R−1ηR
)
25

32. Numerical illustrations
• covariance matrix: R = Ip
+ σUΣUH
• p = 16, k = 8
• U ∈ Stp,k : random orthgonal matrix
• Σ ∈ H++
k
: diagonal matrix,
minimal / maximal element: 1/

c,

c (c=20: conditionning)
other elements: random, uniform distribution between 1/

c
and

c
trace normalisée : trace(Σ) = trace(Ik ) = k
• σ = 50 : free parameter
• data :
• 500 sets of n ∈ [12, 300] random samples
• t-distribution, d = 3 (non gaussian) and d = 100 (almost gaussian) degrees of
freedom
[Ollila et al., 2012]
26

33. Numerical illustrations
101n = p 102
−10
0
10
n
err(θ, ˆ
θ) (dB)
d = 3
trace(F−1
θ
) T-RGD
pSCM T-RTR
101n = p 102
n
d = 100
27

34. Conclusions and perspectives

35. Conclusions and perspectives
• Riemannian geometry:
• powerful tool for signal processing and machine learning
• exploits intrinsic structure of data and parameters
• quite complete and allows modularity – modeling, optimization,
performance analysis
• when to use Riemannian geometry and optimization ?
• data and/or parameters possess a structure: constraints, invariances
(geometrical properties of model)
• metric of particular interest with respect to model (e.g., information theory)
• issue: for some manifolds, geometry complicated to fully characterize
e.g., low rank symmetric positive semideﬁnite matrices
⇒ go beyond Riemannian geometry to provide simple and eﬃcient
geometrical objects
28

36. Absil, P.-A., Mahony, R., and Sepulchre, R. (2008).
Optimization Algorithms on Matrix Manifolds.
Princeton University Press, Princeton, NJ, USA.
Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2007).
Geometric means in a novel vector space structure on symmetric positive-deﬁnite matrices.
SIAM journal on matrix analysis and applications, 29(1):328–347.
Bhatia, R. (2009).
Positive deﬁnite matrices.
Princeton University Press.
Bhatia, R., Jain, T., and Lim, Y. (2017).
On the Bures-Wasserstein distance between positive deﬁnite matrices.
preprint.
Bonnabel, S. and Sepulchre, R. (2009).
Riemannian metric and geometric mean for positive semideﬁnite matrices of ﬁxed rank.
SIAM Journal on Matrix Analysis and Applications, 31(3):1055–1070.
Bouchard, F., Breloy, A., Ginolhac, G., Renaux, A., and Pascal, F. (2020).
A Riemannian framework for low-rank structured elliptical models.
IEEE Transactions on Signal Processing, submitted. Available on arxiv.
Boumal, N. (2013).
On intrinsic Cramér-Rao bounds for Riemannian submanifolds and quotient manifolds.
IEEE Transactions on Signal Processing, 61(7):1809–1821.
Breloy, A., Ginolhac, G., Renaux, A., and Bouchard, F. (2018).
Intrinsic Cramér–Rao bounds for scatter and shape matrices estimation in CES distributions.
IEEE Signal Processing Letters, 26(2):262–266.
Cardoso, J.-F. and Souloumiac, A. (1993).
Blind beamforming for non Gaussian signals.
IEEE Proceedings-F, 140(6):362–370.
29

37. Chebbi, Z. and Moakher, M. (2012).
Means of Hermitian positive-deﬁnite matrices based on the log-determinant α-divergence function.
Linear Algebra and its Applications, 436(7):1872–1889.
Comon, P. and Jutten, C. (2010).
Handbook of Blind Source Separation: Independent Component Analysis and Applications.
Massart, E. and Absil, P.-A. (2018).
Quotient geometry with simple geodesics for the manifold of ﬁxed-rank positive-semideﬁnite matrices.
Technical Report UCL-INMA-2018.06.
Moreau, E. and Macchi, O. (1994).
A one stage self-adaptive algorithm for source separation.
In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., volume 3,
pages 49–52.
Ollila, E., Tyler, D. E., Koivunen, V., and Poor, H. V. (2012).
Complex elliptically symmetric distributions: Survey, new results and applications.
IEEE Transactions on Signal Processing, 60(11):5597–5625.
Pham, D.-T. (2000).
Joint approximate diagonalization of positive deﬁnite Hermitian matrices.
SIAM J. Matrix Anal. Appl., 22(4):1136–1152.
Skovgaard, L. T. (1984).
A Riemannian geometry of the multivariate normal model.
Scandinavian Journal of Statistics, pages 211–223.
Smith, S. T. (2005).
Covariance, subspace, and intrinsic Cramér-Rao bounds.
IEEE Transactions on Signal Processing, 53(5):1610–1630.
Sra, S. (2013).
Positive deﬁnite matrices and the s-divergence.
arXiv preprint arXiv:1110.1773.
30

38. Sun, Y., Babu, P., and Palomar, D. P. (2016).
Robust estimation of structured covariance matrix for heavy-tailed elliptical distributions.
IEEE Transactions on Signal Processing, 64(14):3576–3590.
Vandereycken, B., Absil, P.-A., and Vandewalle, S. (2012).
A Riemannian geometry with complete geodesics for the set of positive semideﬁnite matrices of ﬁxed rank.
IMA Journal of Numerical Analysis, 33(2):481–514.
Villani, C. (2008).
Optimal transport: old and new, volume 338.