Y-h. Taguchi
February 15, 2021
180

# MATHEMATICAL FORMULATION AND APPLICATION OF KERNEL TENSOR DECOMPOSITION BASED UNSUPERVISED FEATURE EXTRACTION

Seminar at IEEE Bangalore, 15th Feb 2021
https://events.vtools.ieee.org/m/260534
http://doi.org/10.1016/j.knosys.2021.106834

## Y-h. Taguchi

February 15, 2021

## Transcript

1. 1
MATHEMATICAL FORMULATION AND APPLICATION
OF KERNEL TENSOR DECOMPOSITION BASED
UNSUPERVISED FEATURE EXTRACTION
Y-h. Taguchi
Department of Physics, Chuo University
Tokyo, Japan
Published in Knowledge-based systems (IF=5.9)
https://doi.org/10.1016/j.knosys.2021.106834

2. 2
Purpose:
Identification of small number of critical variables within large
number of variables (=p) based upon small number of samples (=n).
(so called “large p small n” problem)
→ difficult because …..
Statistical test:
Small n→ not small enough (not significant enough) P-values
Large p→ strong multiple comparison correction (corrected P-
values take larger values)
→ No significant p-values at all.

3. 3
(e.g., lasso, random forest)
“large p small n”→ overfitting….
Too optimized selection toward a specific set of small number n
results in “sample specific-variable selection”
→ Other set of variables will be selected if using another set of
small number of samples (n) is used.

4. 4
Try synthetic example of (p>>n, i.e. p/n ~ 10
Try synthetic example of (p>>n, i.e. p/n ~ 102
2)
)

5. 5
N variables
N
1
M measurements
M/2
M measurements
Gaussian
Zero mean
Gaussian
Non-zero mean
M2 samples
/variable
i≦N
1
:distinct between j,k≦M/2 and others
i>N
1
: no distinction
1
variables correctly?

6. 6
Strategy 1
● Apply t test to individual variables to test if it is distinct
between two classes (i.e. j,k≦M/2 vs others)
● Computed P-values are corrected
with considering multiple comparison
corrections by Benjamini-Hochberg method.
● Variables with corrected P-values <0.05 are selected.
j
k
M
M/2
M/2

7. 7
i > N
1
i ≦ N
1
P>0.05 989.3 3.4
P≦0.05 0.7 6.6
N=103, N
1
=10, M=6, Gaussian dist. μ(mean)=2, σ(SD)=1
Averaged over 100 independent trials.
Fact
N P
Prediction N TN FN
P FP TP
Fact
N P
Prediction N 990 0
P 0 10
Matthew’s correlation coefficient (MCC)
(TP⨉TN)-(FN⨉FP)
(TN+FP)(FN+TP)(TN+FN)(RP+TP)
~ 0.77

8. 8
Lasso (N
1
=10 given, since no P-value computations)
i > N
1
i ≦ N
1
P>0.05 989.4 2.4
P≦0.05 0.6 7.6
MCC ~ 0.84
Random Forest (N
1
=10 given, since no P-value computations)
i > N
1
i ≦ N
1
P>0.05 988.2 1.8
P≦0.05 1.8 8.2
MCC ~ 0.81

9. 9
Singular value decomposition (SVD)
xij
N
M
(uli)T
N
L
vlj
L
M

x
ij
≃∑
l=1
L
u
li
λl
v
l j
L
L
⨉ λl

10. 10
x
ijk
G
u
l1i
u
l2j
u
l3k
L1
L2
L3
HOSVD (Higher Order Singular Value Decomposition)
Extension to tensor…..
N
M
K
x
ijk
≃∑
l
1
=1
L
1 ∑
l
2
=1
L
2 ∑
l
3
=1
L
3 G(l
1
l
2
l
3
)u
l
1
i
u
l
2
j
u
l
3
k

11. 11
N variables
N
1
M measurements
M/2
M measurements
Gaussian
Zero mean
Gaussian
Non-zero mean
M2 samples
/variable
x
ijk
≃∑
l
1
=1
L
1 ∑
l
2
=1
L
2 ∑
l
3
=1
L
3 G(l
1
l
2
l
3
)u
l
1
i
u
l
2
j
u
l
3
k

12. 12
j k i
u
1j
u
1k
u
1i
i ≦ N
1

13. 13
u
1i
u
1i
i ≦ N
1

14. 14
P
i
=P
χ2
[>
(u
1i
σ1
)2]
- log
10
P
i
Assuming that u
1i
obey Gaussian (null hypothesis), P-values are
attributed to individual variables (i) using χ2 distribution
- log
10
P
i
i ≦ N
1

15. 15
i
<0.05 are selected
i > N
1
i ≦ N
1
P>0.05 989.9 2.2
P≦0.05 0.1 7.8
MCC ~ 0.88
t test MCC ~ 0.77
lasso MCC ~ 0.84
Random forest MCC ~ 0.81

16. 16
We named this strategy as “TD (tensor decomposition) based
unsupervised FE (feature extraction)”, which was in detail
described in my recently published book.
Unsupervised Feature extraction applied
to Bioinformatcs,
2020, Springer international.

17. 17
Advantages of TD based unsupervised FE,
Advantages of TD based unsupervised FE,
1) It is very fitted to feature selection problems in “large p small
n” problem.
2) In contrast to conventional feature selection methods (e.g.,
lasso and random forest) no knowledge about the number of
selected variables is required. Variables can be selected using P-
values like conventional statistical test.

18. 18
3) In contrast to conventional statistical tests (e.g., t test), it work
in “large p small n” problems, at least, comparative with
conventional feature selections that require the number of
variables selected.
4) TD based unsupervised FE is unsupervised method, since it
does not require knowledge about classes or labeling when
singular value vectors (u
l1i
, u
l2j
, u
l3k
) are generated.
MCC ~ 0.88
t test MCC ~ 0.77
x
ijk
≃∑
l
1
=1
L
1 ∑
l
2
=1
L
2 ∑
l
3
=1
L
3 G(l
1
l
2
l
3
)u
l
1
i
u
l
2
j
u
l
3
k

19. 19
Application to a real example
Application to a real example

20. 20

21. 21
Data set　GSE147507
Gene expression of human lung cell lines with/without SARS-CoV-2
infection.
i:genes(21797)
j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4:
A549 MOI 2.0, j=5:A549 ACE2 expressed
(MOI:Multiplicity of infection)
k: k=1: Mock, k=2:SARS-CoV-2 infected
m: three biological replicates

22. 22
x
i jk m
∈ℝ21797×5×2×3
x
i jk m
≃∑
l
1
=1
L
1

l
2
=1
L
2

l
3
=1
L
3

l
4
=1
L
4
G(l
1
l
2
l
3
l
4
)u
l
1
j
u
l
2
k
u
l
3
m
u
l
4
i
u
l1j
: l
1
th cell lines dependence
u
l2k
: l
2
th with and without SARS-CoV-2 infection
u
l3m
: l
3
th dependence upon biological replicate
u
l4i
: l
4
th gene dependence
G: weight of individual terms

23. 23
Purpose： identification of l
1
,l
2
,l
3
independent of cell
lines and biological replicates （u
l1j
,u
l3m
take constant
regardless j,m） and dependent upon with or wothout
SARS-CoV-2 infection（u
l21
=-u
l22

Heavy “large p small n” problem
Number of variables(=p): 21797 ~ 104
Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10
p/n ~ 103

24. 24
l
1
=1 l
2
=2
l
3
=1
Cell lines With and without
SARS-CoV-2
infection
biological
replicate
Independent of cell lines
and biological replicate,
but dependent upon
SARS-CoV-2 infection.

25. 25
l
1
=1 l
2
=2 l
3
=1
｜G｜is the largest in which l
4

26. 26
Gene expression independent of cell lines and
biological replicate, but dependent upon SARS-CoV-2
infection is associated with u
5i
(l
4
=5)
P
i
=P
χ2
[>
(u
5i
σ5
)2]
Computed P-values are corrected with considering multiple
comparison corrections by Benjamini-Hochberg method.
163 genes with corrected P-values <0.01 are selected among 21,797
genes.

27. 27
Multiple hits with known SARS-CoV-2 interacting human genes

28. 28
Comparisons with conventional methods:
Comparisons with conventional methods:
Since we do not know how many genes should be selected, lasso and
random forest is useless. Instead we employed SAM and limma, which
are gene selection specific algorithm (adjusted P-values are used ).
t test SAM limma
P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01
Calu3 21754 43 21797 0 335 3789
NHBE 21797 0 21797 0 342 3906
A549
MOI 0.2 21797 0 21797 0 319 4391
MOI 2.0 21472 325 21797 0 208 4169
ACE2 expressed 21796 1 21797 0 182 4245

29. 29
Kernelization of TD based unsupervised FE
Kernelization of TD based unsupervised FE

30. 30
Published in Knowledge-based systems (IF=5.9)
https://doi.org/10.1016/j.knosys.2021.106834

31. 31
Kernel Tensor decomposition
x
ijk
G
u
l1i
u
l2j
u
l3k
L1
L2
L3
N
M
K
x
ij’k’
N
M
K

x
jkj ' k '
=∑
i
x
ijk
x
ij' k '
(Linear kernel)

32. 32
x
jkj ' k '
≃∑
l
1
=1
L
1 ∑
l
2
=1
L
2 ∑
l
3
=1
L
3 ∑
l
4
=1
L
4 G(l
1
l
2
l
3
l
4
)u
l
1
j
u
l
2
k
u
l
3
j'
u
l
4
k '
x
jkj’k’
G
u
l3j’
u
l1j
u
l2k
L3
L1
L2
u
l4k’
L4
Kernel Trick
x
jkj’k’
→ k(x
ijk
,x
ij’k’
):non-negative definite

33. 33
k (x
ijk
, x
ij ' k '
)=exp(−α∑i
( x
ijk
−x
ij ' k '
)2)
k (x
ijk
, x
ij ' k '
)=(1+∑
i
x
ijk
x
ij ' k '
)
d
Polynomial kernel
k(x
ijk
,x
ij’k’
)→ tensor decomposition

34. 34
Synthetic example:Swiss Roll
x
ijk
∈ℝ1000×3×10
⨉ 10
Number of points (=n) Spatial dimension (=p)

35. 35
SVD applied to single Swiss Roll

36. 36
TD applied to a bundle of 10 Swiss Rolls

37. 37
Kernel TD
(with RBF)
applied to
a bundle of
10 Swiss Rolls

38. 38
Feature selection
Feature selection
Linear Kernel:
x
jkj’k’
→ u
l1j
, u
l2k
u
l
1 i
∝∑
jk
x
ijk
u
l
1
j
u
l
2
k
P
i
=P
χ2
[>
(u
l
1
i
σl
1
)2]
Computed P-values are corrected with considering multiple
comparison corrections by Benjamini-Hochberg method.
Features with corrected P-values <0.01 are selected.
TD

39. 39
RBF, Polynomial Kernels
Exclusion of a specific i
i
Recompute x
jkj’k’
x
jkj’k’
→ u
l1j
⨉ u
l2k
TD
Estimate coincidence between u
l1j
, u
l2k
and classification of (k,j)
Rank i
i based upon the amount of decreased coincidence
u
l1j
⨉ u
l2k
k

40. 40
Application to SARS-CoV-2 data set
Applying RBF kernel and select 163 top ranked genes.
TD KTD

41. 41
Conclusions
TD based unsupervised FE is specialized to feature selections in
“large p small n”
It can work comparatively with conventional feature selections
(lasso, random forest) and can give us P-values that lasso and
random forest cannot.
TD based unsupervised FE could select human genes related
with SARS-CoV-2 infection even when other conventional gene
selection methods (t test, SAM, limma) cannot work well.

42. 42
TD based unsupervised FE was successfully “kernelized”.
Kernel TD (KTD) based unsupervised FE could even
outperform TD based unsupervised FE when it was applied to
identification human genes related to SARS-CoV-2 infection.
Other advanced KTD based unsupervised FE is expected to
develop to attack more wide range of problems including
genomic science/bioinformatics.