Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Principal Component Analysis, Tensor Decomposition, and Kernel Tensor Decomposition Based Unsupervised Feature Extraction Applied to Bioinformatics

Y-h. Taguchi
December 03, 2020

Principal Component Analysis, Tensor Decomposition, and Kernel Tensor Decomposition Based Unsupervised Feature Extraction Applied to Bioinformatics

Invited talk at ISAIC2020
http://www.confisaic.com/#/
Online, 2nd - 4th December, 2020.

Y-h. Taguchi

December 03, 2020
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. ISAIC2020 1 Principal Component Analysis, Tensor Decomposition, and Kernel Tensor

    Decomposition Based Unsupervised Feature Extraction Applied to Bioinformatics Y-h. Taguchi, Department of Physics, Chuo University, Tokyo 112-8551, Japan
  2. ISAIC2020 2 Singular value decomposition (SVD) xij N M (uli)T

    N L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example
  3. ISAIC2020 3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli

    DEG: Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients
  4. ISAIC2020 4 x ijk G u l1i u l2j u

    l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example
  5. ISAIC2020 5 Interpretation….. j:samples Healthy control Patients ul2j For some

    specific l2 For some specific l3 k:tissues Tissue specific expression ul3k
  6. ISAIC2020 6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes

    Healthy controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed
  7. ISAIC2020 7 Extension to Kernel Trick SVD → Principal Component

    Analysis (PCA) N M (uli)T N L vlj L M ⨉ ≈ L L ⨉ λl xij xij’ N M ⨉ xjj’ M M = (vlj)T M L ≈ L L ⨉ λ2 l vlj’ L M ⨉
  8. ISAIC2020 8 Kernel Trick x jj’ → k(x ij ,x

    ij’ ):non-negative definite k (x ij , x ij ' )=exp(−α∑i (x ij −x ij ' )2) Radial base function (RBF) kernel k (x ij , x ij ' )=(1+∑ i x ij x ij ' ) d Polynomial kernel k(x ij ,x ij’ )→ diagonalization
  9. ISAIC2020 9 x jkj ' k ' ≃∑ l 1

    =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 ij u l 2 k u l 3 j' u l 4 k ' Kernel Tensor decomposition x ijk G u l1i u l2j u l3k L1 L2 L3 N M K x ij’k’ N M K ⨉ x jkj’k’ = G u l3j’ u l1j u l2k L3 L1 L2 u l4k’ L4 x jkj ' k ' =∑ i x ijk x ij' k ' https://doi.org/10.1101/2020.10.09.333195 https://doi.org/10.1101/2020.10.09.333195
  10. ISAIC2020 10 Kernel Trick x jkj’k’ → k(x ijk ,x

    ij’k’ ):non-negative definite k (x ijk , x ij ' k ' )=exp(−α∑i ( x ijk −x ij ' k ' )2) Radial base function kernel k (x ijk , x ij ' k ' )=(1+∑ i x ijk x ij ' k ' ) d Polynomial kernel k(x ijk ,x ij’k’ )→ tensor decomposition
  11. ISAIC2020 15 Large p small n problem N(μ,σ): normal distribution,

    μ:mean,σ:standard deviation N genes 1st M experiments 2nd M experiments M2 samples x ijk ∈ℝN ×M ×M ∼ N(μ,3) j ,k≤ M 2 ,i≤N 1 ≪N N(0,3) otherwise M << N
  12. ISAIC2020 16 N genes N 1 M experiments M/2 M

    experiments Zero mean Non-zero mean i≦N 1 :distinct between j,k≦M/2 and others i>N 1 : no distinction Task: Can we get latent vectors coincident with distinction?
  13. ISAIC2020 17 N=103, N 1 =10, μ=2,σ=1,M=6 P-values attributed to

    correlation coefficients between distinction and latent variables (smaller is better) KTD with linear is equivalent to that with RBF kernel. 0.043 0.043 0.039 0.039
  14. ISAIC2020 20 x i jk m ∈ℝ21797×5×2×3 Data sets GSE147507 Three

    kinds of human lung cell lines infected by SARS-CoV-2 i:genes(21797) j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4: A549 MOI 2.0, j=5:A549 ACE2 expressed (MOI:Multiplicity of infection) k: k=1: Mock, k=2:SARS-CoV-2 infected m: three biological replicates
  15. ISAIC2020 21 x i jk m ≃∑ l 1 =1

    L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell line dependence u l2k : l 2 th SARS-CoV-2 infection YES/NO u l3m : l 3 th biological replicate dependence u l4i : l 4 th gene dependence G: weights Purpose: identification of l 1 ,l 2 ,l 3 independent of cell lines or replicates (u l1j and u l3m are constant independent of j,m)whereas dependent upon SARS-CoV-2 infection(u l21 =-u l22 )
  16. ISAIC2020 22 l 1 =1 l 2 =2 l 3

    =1 Cell lines SARS-CoV-2 Yes or Not biological replicate Independent of cell lines or biological replicate but depedent upon SARS- CoV-2 infection
  17. ISAIC2020 23 Kernel TD with RBF is more distinct between

    Kernel TD with RBF is more distinct between normal and infected cell lines than TD normal and infected cell lines than TD infection Not infection
  18. ISAIC2020 24 l 1 =1 l 2 =2 l 3

    =1 u 5i (l 4 =5) represents gene expression profiles independent of cell lines or biological replicate but altered by SARS-CoV-2 infection Which l 4 has largest |G|?
  19. ISAIC2020 25 P-values are attributed to gene i with χ2

    distribution under the null hypothesis that u 5i obeys Gaussian and 163 genes associated P-values corrected by multiple comparison correction (BH) less than 0.01. ABCC3 ACE2 ACTB ACTG1 ACTN4 AHNAK AKAP12 AKR1B1 AKR1B10 AKR1C2 ALDH1A1 ALDH3A1 ALDOA AMIGO2 ANTXR1 ANXA2 ASNS ASPH ATF4 ATP1B1 C3 CALM2 CALR CD24 CFL1 CPLX2 CRIM1 CTGF CXCL5 CYP24A1 DCBLD2 DDIT4 DHCR24 EEF1A1 EEF2 EIF1 EIF4B EIF5A ENO1 ERBB2 EREG FADS2 FASN FDCSP FDPS FLNB FTH1 FTL G6PD GAPDH GAS5 GPX2 GSTP1 H1F0 HMGA1 HNRNPA2B1 HSP90AA1 HSP90AB1 HSPA8 ICAM1 IER3 IFIT2 IGFBP3 IGFBP4 ITGA2 ITGA3 ITGAV ITGB1 JUN KRT18 KRT19 KRT23 KRT5 KRT6A KRT7 KRT8 KRT81 LAMB3 LAMC2 LCN2 LDHA LIF LOXL2 MIEN1 MTHFD2 MYL6 NAMPT NAP1L1 NEAT1 NFKBIA NPM1 NQO1 OAS2 P4HB PABPC1 PFN1 PGK1 PKM PLAU PLOD2 PMEPA1 PPIA PPP1R15A PSAT1 PSMD3 PTMA RAI14 RNF213 RPL10 RPL12 RPL23 RPL26 RPL28 RPL3 RPL37 RPL4 RPL5 RPL7 RPL7A RPL9 RPS19 RPS20 RPS24 RPS27 RPS27A RPS3A RPS4X RPS6 S100A2 S100A6 SAT1 SCD SERPINA3 SERPINE1 SLC38A2 SLC7A11 SLC7A5 SPP1 SPTBN1 SQSTM1 STARD3 STAT1 STC2 TGFBI TGM2 TIPARP TMSB4X TNFAIP2 TOP2A TPI1 TPM1 TPT1 TRAM1 TUBA1B TUBB TUBB4B TXNIP TXNRD1 UBC VEGFA VIM YBX1 YWHAZ
  20. ISAIC2020 27 Many human genes that are inportant when SARS-CoV-

    2 infects seem to be detected. ↓ Drug repositioning will be possible by identifying compounds that affect selevcted 163 genes ↓ Fortunately, we have data bases that list genes whose expression is altered with drug treatments
  21. ISAIC2020 29 Term Overlap P-value Adjusted P-value Ivermectin-7.5 mg/kg in

    CMC-Rat-Liver-1d-dn 12/277 2.98E-06 9.93E-06 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-5d-dn 12/289 4.60E-06 1.44E-05 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-3d-dn 11/285 2.29E-05 5.56E-05 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-1d-up 10/323 3.28E-04 5.39E-04 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-5d-up 8/311 4.06E-03 5.10E-03 Ivermectin-7.5 mg/kg in CMC-Rat-Liver-3d-up 8/315 4.38E-03 5.46E-03 Ivermectin was hit! DrugMatrix in Enrichr Enrichr
  22. ISAIC2020 30 Summary Our methods can identify effective candidate compounds

    for COVID-19. I have published a mono graph from Springer international at Sep 2019.