Tensor decomposition based unsupervised feature extraction applied to Bioinformatics

Slide 1

Slide 1 text

Tensor decomposition based unsupervised feature extraction applied to Bioinformatics Y-h. Taguchi Department of Physics, Chuo University Tokyo 112-8551, Japan

Slide 2

Slide 2 text

Introduction Introduction Bioinformatics is a research field to analyze massive genomics data sets using cutting edge computational/statistical/machine learning techniques. Typical data sets analyzed are ● Gene expression profiles ~104 ● DNA methylation ~107 ● DNA accessibility ~ 107 ● microRNA expression ~ 103 whereas the number of samples is few (10 to 102).

Slide 3

Slide 3 text

Data analysis of bioinformatics is a typical large p small n problem. What is large p small n problem? There are only a few examples with many features, it is often difficult to distinguish between them.

Slide 4

Slide 4 text

For example, some novels to be classified into either fantasy or science fiction are given to you. In this case, features are a set of words included in each novel. Fantasy: Snow White, The Load of the Rings, Knights of the Round Table Science Fiction: Star Trek, X men, Superman Is Star Wars science fiction or fantasy? Only based upon small number of examples, labeling a new title is very difficult.

Slide 5

Slide 5 text

In bioinformatics, number of samples (~102) whereas the number of features is huge (>>104). As such a method, we developed tensor decomposition applicable to large p small n problem. In this talk, I would like to introduce some examples of application of the method we proposed, “Tensor Decomposition (TD) based unsupervised feature extraction (FE)” to bioinformatics problems.

Slide 6

Slide 6 text

I have published a book on this topics from Springer international. I am glad if the audience can buy it and learn my method. Y-h. Taguchi, Unsupervised Feature Extraction Applied to Bioinformatics --- A PCA and TD Based Approach --- Springer International (2020)

Slide 7

Slide 7 text

What is a tensor? Scholar x: a number Vector x i : a set of scholars in line Matrix x ij : a set of scholars aligned in a table (i.e. rows and columns) Tensor x ijk : a set of scholars aligned in an array more then two rows x ijk i j k 1 (1,2,3,4,...) (1 2 3 4 5 6 7 8 9 )

Slide 8

Slide 8 text

Tensor is suitable to store genomics data: Gene expression :x ijk ∈ ℝN⨉M⨉K N genes ⨉ M persons ⨉ K tissues x ijk i:genes j:persons k:tissues

Slide 9

Slide 9 text

What is tensor decomposition(TD)? Expand tensor as a series of product of vectors, x ijk i:genes j:persons k:tissues G k j i l 1 l 2 l 3 = u l 1 i u l 2 j u l 3 k u l 1 i u l 2 j u l 3 k x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =2 L 2 ∑ l 3 =1 L 3 G (l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Slide 10

Slide 10 text

Advantages of tensor decomposition(TD): We can know “Dependence of x ijk upon i” → u l1i “Dependence of x ijk upon j” → u l2j “Dependence of x ijk upon k” → u l3k ← Healthy control vs patient ← tissue specificity Gene selection ↑ We can answer the question : Which genes are expressed between healthy controls and patients in tissue specific manner?

Slide 11

Slide 11 text

11 Application to a real example Application to a real example Drug repositioning for COVID-19 Drug repositioning for COVID-19

Slide 12

Slide 12 text

Slide 13

Slide 13 text

13 Data set　GSE147507 Gene expression of human lung cell lines with/without SARS-CoV-2 infection. i:genes(21797) j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4: A549 MOI 2.0, j=5:A549 ACE2 expressed (MOI:Multiplicity of infection) k: k=1: Mock, k=2:SARS-CoV-2 infected m: three biological replicates

Slide 14

Slide 14 text

14 x i jk m ∈ℝ21797×5×2×3 x i jk m ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell lines dependence u l2k : l 2 th with and without SARS-CoV-2 infection u l3m : l 3 th dependence upon biological replicate u l4i : l 4 th gene dependence G: weight of individual terms

Slide 15

Slide 15 text

15 Purpose： identification of l 1 ,l 2 ,l 3 independent of cell lines and biological replicates （u l1j ,u l3m take constant regardless j,m） and dependent upon with or wothout SARS-CoV-2 infection（u l21 =-u l22 ） Heavy “large p small n” problem Number of variables(=p): 21797 ~ 104 Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10 p/n ~ 103

Slide 16

Slide 16 text

16 l 1 =1 l 2 =2 l 3 =1 Cell lines With and without SARS-CoV-2 infection biological replicate Independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection.

Slide 17

Slide 17 text

17 l 1 =1 l 2 =2 l 3 =1 ｜G｜is the largest in which l 4 ？

Slide 18

Slide 18 text

18 Gene expression independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection is associated with u 5i (l 4 =5) P i =P χ2 [> (u 5i σ5 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. 163 genes with corrected P-values <0.01 are selected among 21,797 genes.

Slide 19

Slide 19 text

19 Multiple hits with known SARS-CoV-2 interacting human genes

Slide 20

Slide 20 text

20 Comparisons with conventional methods: Comparisons with conventional methods: Since we do not know how many genes should be selected, lasso and random forest is useless. Instead we employed SAM and limma, which are gene selection specific algorithm (adjusted P-values are used ). t test SAM limma P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01 Calu3 21754 43 21797 0 335 3789 NHBE 21797 0 21797 0 342 3906 A549 MOI 0.2 21797 0 21797 0 319 4391 MOI 2.0 21472 325 21797 0 208 4169 ACE2 expressed 21796 1 21797 0 182 4245

Slide 21

Slide 21 text

21 Comparisons with DESeq2: Comparisons with DESeq2: DESeq2 P>0.01 P≦0.01 Calu3 7278 16432 NHBE 23383 327 A549 MOI 0.2 7858 15852 MOI 2.0 16279 7431 ACE2 expressed 16201 7509 After the publication of our paper, we have found the paper[*] that originally studied this GEO data was published (when we have done this study, only GEO data set was provides and no papers were published). The paper includes DESeq2 results. It is similar to limma; it detected most of genes as DEGs whereas it identified limited number of DEGs for NHBE cell lines [*]Daniel Blanco-Melo et al, Cell, 2020; 181(5): 1036-1045.e9. doi: 10.1016/j.cell.2020.04.026. Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19

Slide 22

Slide 22 text

22 Application to a real example Application to a real example Integrated analysis of gene expression Integrated analysis of gene expression profiles without sample mathing profiles without sample mathing

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Integrated analysis of gene expression without sample matching or common labeling Healthy control Patients gene(N) M 1 WT KO M 2

Slide 25

Slide 25 text

Usage：・How similar is the alteration of gene expression with that caused by KO (OE) of some gene? ・Need to integrate two single cell gene epxression profile (single cell cannot be labeled) ・Comparison of development of two distinct species (e.g., human and mouse have distinct development speed)

Slide 26

Slide 26 text

Method：TD is applied to a tensor of bundle of low dimensional embedding obtained by applying SVD to individual data sets. Computed singular value vectors are mapped back to individual data N(gene) M 1 sample × N L N(gene) M 2 sample × N L N L K x ilk ×× N L L M 1 SVD × × N L L M 2 SVD

Slide 27

Slide 27 text

x ilk G u l1i u l2l u l3k L1 L2 L3 HOSVD K L N M 1 L2 L M 1 L M 2 M 2 L2 Healthy control patients vs WT KO vs L L2 × L L2 u l2l ×

Slide 28

Slide 28 text

Real Data sets: Real Data sets:Alzheimer Diseases Alzheimer Diseases Data Set 1(GSE160224) 58303 genes vs 9 samples iPSC-derived neurons: 3 Control, 3 APP duplication, 3 gene corr. Classification: 3 Control vs 6 AD (2 classes) Data Set 2(GSE155567) 60617 genes vs 23 samples CD33 KO/WT vs PTPN6 KD/WT: 4 classes 6 WT/WT, 6 WT/KD, 5 KO/WT, 6 KO/KD Data Set 3(GSE162873) 47749 genes vs 8 samples Cell lines: 2 AD1, 2 AD2, 4 Controls (3 classes) 60617 genes included in Data Set 2 are considered. Missing values are filled with zero. Zero mean and standard deviation of 1 is assumed in each sample and analyzed in integrated manner. L=8 L=8。

Slide 29

Slide 29 text

Data set 1 Data set 2 Data set 2 Data set 3 Data set 1 C N T L AD Data set 3 AD1AD2 C N T L Data set 2 WT WT WT KD KO WT KO KD CD33 PTPN6

Slide 30

Slide 30 text

g e n e Data set ∑ l 2 =1 3 G (l 1 l 2 l 3 )2 Gene selection

Slide 31

Slide 31 text

P i =P χ2 [>∑l 1 =1 5 (u l 1 i σl 1 )2] BH multiple comparison correction Adjusted P i <0.01 → 565 genes u l 1 i 　　 is assumed to obey multiple Gaussian (null hypothesis) Rejection probability is attributed to gene using χ2 distribution

Slide 32

Slide 32 text

Enrichment analysis

Slide 33

Slide 33 text

Summary of this part: In the case where the genes match but there is no correspondence between the samples, I found that it works well to project the dimensions of the samples to a lower dimension of the same dimension using SVD or HOVSD and then bundle them together to make a tensor. By re-projecting the singular value vectors obtained by decomposing the bundled tensor to the dimensions of the original samples, I found that I could visualize the correspondence between the samples (which should not have been there originally). When used in scRNA-seq, a problem with ~104 single cells can be treated as a problem with only 10 dimensions, thus saving a thousandth of memory

Slide 34

Slide 34 text

My contact information: E-mail: [email protected] URL: https://researchmap.jp/Yh_Taguchi/ Linkedin: https://www.linkedin.com/in/y-h-taguchi-164900b4/