Slide 1

Slide 1 text

ISAIC2020 1 Principal Component Analysis, Tensor Decomposition, and Kernel Tensor Decomposition Based Unsupervised Feature Extraction Applied to Bioinformatics Y-h. Taguchi, Department of Physics, Chuo University, Tokyo 112-8551, Japan

Slide 2

Slide 2 text

ISAIC2020 2 Singular value decomposition (SVD) xij N M (uli)T N L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example

Slide 3

Slide 3 text

ISAIC2020 3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli DEG: Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients

Slide 4

Slide 4 text

ISAIC2020 4 x ijk G u l1i u l2j u l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example

Slide 5

Slide 5 text

ISAIC2020 5 Interpretation….. j:samples Healthy control Patients ul2j For some specific l2 For some specific l3 k:tissues Tissue specific expression ul3k

Slide 6

Slide 6 text

ISAIC2020 6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes Healthy controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed

Slide 7

Slide 7 text

ISAIC2020 7 Extension to Kernel Trick SVD → Principal Component Analysis (PCA) N M (uli)T N L vlj L M ⨉ ≈ L L ⨉ λl xij xij’ N M ⨉ xjj’ M M = (vlj)T M L ≈ L L ⨉ λ2 l vlj’ L M ⨉

Slide 8

Slide 8 text

ISAIC2020 8 Kernel Trick x jj’ → k(x ij ,x ij’ ):non-negative definite k (x ij , x ij ' )=exp(−α∑i (x ij −x ij ' )2) Radial base function (RBF) kernel k (x ij , x ij ' )=(1+∑ i x ij x ij ' ) d Polynomial kernel k(x ij ,x ij’ )→ diagonalization

Slide 9

Slide 9 text

ISAIC2020 9 x jkj ' k ' ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 ij u l 2 k u l 3 j' u l 4 k ' Kernel Tensor decomposition x ijk G u l1i u l2j u l3k L1 L2 L3 N M K x ij’k’ N M K ⨉ x jkj’k’ = G u l3j’ u l1j u l2k L3 L1 L2 u l4k’ L4 x jkj ' k ' =∑ i x ijk x ij' k ' https://doi.org/10.1101/2020.10.09.333195 https://doi.org/10.1101/2020.10.09.333195

Slide 10

Slide 10 text

ISAIC2020 10 Kernel Trick x jkj’k’ → k(x ijk ,x ij’k’ ):non-negative definite k (x ijk , x ij ' k ' )=exp(−α∑i ( x ijk −x ij ' k ' )2) Radial base function kernel k (x ijk , x ij ' k ' )=(1+∑ i x ijk x ij ' k ' ) d Polynomial kernel k(x ijk ,x ij’k’ )→ tensor decomposition

Slide 11

Slide 11 text

ISAIC2020 11 Synthetic example:Swiss Roll x ijk ∈ℝ1000×3×10 ⨉ 10 Number of points Spatial dimension

Slide 12

Slide 12 text

ISAIC2020 12 SVD applied to single Swiss Roll

Slide 13

Slide 13 text

ISAIC2020 13 TD applied to a bundle of 10 Swiss Rolls

Slide 14

Slide 14 text

ISAIC2020 14 Kernel TD (with RBF) applied to a bundle of 10 Swiss Rolls

Slide 15

Slide 15 text

ISAIC2020 15 Large p small n problem N(μ,σ): normal distribution, μ:mean,σ:standard deviation N genes 1st M experiments 2nd M experiments M2 samples x ijk ∈ℝN ×M ×M ∼ N(μ,3) j ,k≤ M 2 ,i≤N 1 ≪N N(0,3) otherwise M << N

Slide 16

Slide 16 text

ISAIC2020 16 N genes N 1 M experiments M/2 M experiments Zero mean Non-zero mean i≦N 1 :distinct between j,k≦M/2 and others i>N 1 : no distinction Task: Can we get latent vectors coincident with distinction?

Slide 17

Slide 17 text

ISAIC2020 17 N=103, N 1 =10, μ=2,σ=1,M=6 P-values attributed to correlation coefficients between distinction and latent variables (smaller is better) KTD with linear is equivalent to that with RBF kernel. 0.043 0.043 0.039 0.039

Slide 18

Slide 18 text

ISAIC2020 18 Application to real data : SARS-CoV-2 infection

Slide 19

Slide 19 text

ISAIC2020 19

Slide 20

Slide 20 text

ISAIC2020 20 x i jk m ∈ℝ21797×5×2×3 Data sets GSE147507 Three kinds of human lung cell lines infected by SARS-CoV-2 i:genes(21797) j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4: A549 MOI 2.0, j=5:A549 ACE2 expressed (MOI:Multiplicity of infection) k: k=1: Mock, k=2:SARS-CoV-2 infected m: three biological replicates

Slide 21

Slide 21 text

ISAIC2020 21 x i jk m ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell line dependence u l2k : l 2 th SARS-CoV-2 infection YES/NO u l3m : l 3 th biological replicate dependence u l4i : l 4 th gene dependence G: weights Purpose: identification of l 1 ,l 2 ,l 3 independent of cell lines or replicates (u l1j and u l3m are constant independent of j,m)whereas dependent upon SARS-CoV-2 infection(u l21 =-u l22 )

Slide 22

Slide 22 text

ISAIC2020 22 l 1 =1 l 2 =2 l 3 =1 Cell lines SARS-CoV-2 Yes or Not biological replicate Independent of cell lines or biological replicate but depedent upon SARS- CoV-2 infection

Slide 23

Slide 23 text

ISAIC2020 23 Kernel TD with RBF is more distinct between Kernel TD with RBF is more distinct between normal and infected cell lines than TD normal and infected cell lines than TD infection Not infection

Slide 24

Slide 24 text

ISAIC2020 24 l 1 =1 l 2 =2 l 3 =1 u 5i (l 4 =5) represents gene expression profiles independent of cell lines or biological replicate but altered by SARS-CoV-2 infection Which l 4 has largest |G|?

Slide 25

Slide 25 text

ISAIC2020 25 P-values are attributed to gene i with χ2 distribution under the null hypothesis that u 5i obeys Gaussian and 163 genes associated P-values corrected by multiple comparison correction (BH) less than 0.01. ABCC3 ACE2 ACTB ACTG1 ACTN4 AHNAK AKAP12 AKR1B1 AKR1B10 AKR1C2 ALDH1A1 ALDH3A1 ALDOA AMIGO2 ANTXR1 ANXA2 ASNS ASPH ATF4 ATP1B1 C3 CALM2 CALR CD24 CFL1 CPLX2 CRIM1 CTGF CXCL5 CYP24A1 DCBLD2 DDIT4 DHCR24 EEF1A1 EEF2 EIF1 EIF4B EIF5A ENO1 ERBB2 EREG FADS2 FASN FDCSP FDPS FLNB FTH1 FTL G6PD GAPDH GAS5 GPX2 GSTP1 H1F0 HMGA1 HNRNPA2B1 HSP90AA1 HSP90AB1 HSPA8 ICAM1 IER3 IFIT2 IGFBP3 IGFBP4 ITGA2 ITGA3 ITGAV ITGB1 JUN KRT18 KRT19 KRT23 KRT5 KRT6A KRT7 KRT8 KRT81 LAMB3 LAMC2 LCN2 LDHA LIF LOXL2 MIEN1 MTHFD2 MYL6 NAMPT NAP1L1 NEAT1 NFKBIA NPM1 NQO1 OAS2 P4HB PABPC1 PFN1 PGK1 PKM PLAU PLOD2 PMEPA1 PPIA PPP1R15A PSAT1 PSMD3 PTMA RAI14 RNF213 RPL10 RPL12 RPL23 RPL26 RPL28 RPL3 RPL37 RPL4 RPL5 RPL7 RPL7A RPL9 RPS19 RPS20 RPS24 RPS27 RPS27A RPS3A RPS4X RPS6 S100A2 S100A6 SAT1 SCD SERPINA3 SERPINE1 SLC38A2 SLC7A11 SLC7A5 SPP1 SPTBN1 SQSTM1 STARD3 STAT1 STC2 TGFBI TGM2 TIPARP TMSB4X TNFAIP2 TOP2A TPI1 TPM1 TPT1 TRAM1 TUBA1B TUBB TUBB4B TXNIP TXNRD1 UBC VEGFA VIM YBX1 YWHAZ

Slide 26

Slide 26 text

ISAIC2020 26 Many human genes known to be interacted with SARS-CoV-2 proteins are included

Slide 27

Slide 27 text

ISAIC2020 27 Many human genes that are inportant when SARS-CoV- 2 infects seem to be detected. ↓ Drug repositioning will be possible by identifying compounds that affect selevcted 163 genes ↓ Fortunately, we have data bases that list genes whose expression is altered with drug treatments

Slide 28

Slide 28 text

ISAIC2020 28 Many reported SARS-CoV-2 taregetting candidate drugs are detected

Slide 29

Slide 29 text

ISAIC2020 29 Term Overlap P-value Adjusted P-value Ivermectin-7.5 mg/kg in CMC-Rat-Liver-1d-dn 12/277 2.98E-06 9.93E-06 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-5d-dn 12/289 4.60E-06 1.44E-05 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-3d-dn 11/285 2.29E-05 5.56E-05 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-1d-up 10/323 3.28E-04 5.39E-04 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-5d-up 8/311 4.06E-03 5.10E-03 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-3d-up 8/315 4.38E-03 5.46E-03 Ivermectin was hit! DrugMatrix in Enrichr Enrichr

Slide 30

Slide 30 text

ISAIC2020 30 Summary Our methods can identify effective candidate compounds for COVID-19. I have published a mono graph from Springer international at Sep 2019.