Integrated Analysis of single cell multi omics data sets using tensor decomposition based unsupervised feature extraction

Slide 1

Slide 1 text

MBSJ 44 Integrated Analysis of single cell multi omics data sets using tensor decomposition based unsupervised feature extraction Y-h. Taguchi Department of Physics, Chuo University, Tokyo Japan The contents of this poster was published in: Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. doi:10.3390/genes12091442

Slide 2

Slide 2 text

MBSJ 44 Introduction Integrated analysis of single cell multi-omics data is difficult because….. 1. The number of features in individual omics differ (gene~104, DNA methylation/accessibility~108) 2. Full of missing values The percentages of missing values: 70% for gene expression >90% for DNA methylation and accessibility Careful pre-processing is usually required….

Slide 3

Slide 3 text

MBSJ 44 In this poster, we proposed the usage of tensor decomposition so as to enable us to integrate gene expression, DNA methylation and accessibility without particular pre-processing.

Slide 4

Slide 4 text

MBSJ 44 Singular value decomposition (SVD) xij N M (uli)T N L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl

Slide 5

Slide 5 text

MBSJ 44 x ijk G u l1i u l2j u l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Slide 6

Slide 6 text

MBSJ 44 Application to integrated analysis of multiomics data set Since individual omics data is associated with distinct features. N k : number of features of individual omics data M: number of single cells K: number of omics (K=3, in the present study) x ijk ∈ℝN k ×M ×K

Slide 7

Slide 7 text

MBSJ 44 Apply SVD to x ijk Project x ijk onto u li . x ijk ≃∑ l=1 L u lik λl v l jk x ljk =∑ i=1 N u li x ijk ∈ℝL×M× K

Slide 8

Slide 8 text

MBSJ 44 Apply HOSVD to x ljk as Apply Umap to u l2j (1≦ l 2 ≦ 3L, L=10) x ljk ≃∑ l 1 =L L 1 ∑ l 2 =1 M ∑ l 3 =1 K G (l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k

Slide 9

Slide 9 text

MBSJ 44 Dataset 1 Dataset 1 The multiomics dataset retrieved from GEO ID GSE154762, which is denoted as Dataset 1 in this study, is composed of 899 single cells for which gene expression, DNA methylation, and DNA accessibility were measured. These single cells represent human oocyte maturation. Dataset 2 Dataset 2 The multiomics dataset retrieved from GEO ID GSE154762, which is denoted as Dataset 2 in this study, is composed of 852 single cells for which DNA methylation and DNA accessibility were measured, as well as 758 single cells for which gene expression was measured. These single cells represent the four time points of the mouse embryo.

Slide 10

Slide 10 text

MBSJ 44 Dataset 1 Dataset 1 Dataset 2 Dataset 2

Slide 11

Slide 11 text

MBSJ 44 Identify which u l2j are coincident with classification. Categorical regression: δ js =1 when j ∈ sth category, otherwise 0. a l2 ,b l2s : regression coefficients. Dataset 1 & 2: 18 l 2 s are associated with corrected P-values less than 0.05. u l 2 j =a l 2 +∑ s=1 S b l 2 s δjs

Slide 12

Slide 12 text

MBSJ 44 Select l 1 that has the largest where l 2 s are restricted to the selected 18 l 2 s, since such l 1 should be associated with classifications. For dataset 1 and 2, l 1 =1 has the largest value. ∑ l 2 ∑ l 3 =1 K |(G (l 1 l 2 l 3 ))|

Slide 13

Slide 13 text

MBSJ 44 Gene selection: Generate u l1i from u l1l Attribute P-value to gene i with assuming that u l1i obey Gaussian. P i =P χ2 [>∑(u 1i σ5 )2] u l 1 i =∑ l=1 L u l 1 l u li1 ∈ℝL×M×K

Slide 14

Slide 14 text

MBSJ 44 45 genes and 175 genes are selected for dataset 1 and 2, respectively, as those associated with adjusted P-values less than 0.01. Enrichment analysis was performed toward these genes in order to validate selected genes biologically.

Slide 15

Slide 15 text

MBSJ 44 Dataset 1 Forty-seven genes were enriched by H3K36me3 based on “ENCODE Histone Modifications 2015”; H3K36m3 is known to play critical roles during oocyte maturation [12]. Forty-seven genes were also targeted by MYC based on “ENCODE and ChEA Consensus TFs from ChIP-X”; Myc is known to play critical roles in oogenesis [13]. Forty-seven genes were also targeted by TAF7 based on “ENCODE and ChEA Consensus TFs from ChIP-X” and “ENCODE TF ChIP- seq 2015”; TAF7 is known to play critical roles during oocyte growth [14]. Forty-seven genes were also targeted by ATF2 based on “ENCODE and ChEA Consensus TFs from ChIP-X”; the expression of ATF2 is known to be altered during oocyte development [15].

Slide 16

Slide 16 text

MBSJ 44 Dataset 2 One-hundred and seventy-five genes were enriched by H3K36me3 based on “ENCODE Histone Modifications 2015”; H3K36m3 is known to play critical roles during gastrulation [17]. One-hundred and seventy-five genes were also targeted by MYC based on “ENCODE and ChEA Consensus TFs from ChIP-X”; Myc is also known to play critical roles in gastrulation [18]. One hundred and seventy-five genes were also targeted by TAF7 based on “ENCODE and ChEA Consensus TFs from ChIP-X” and “ENCODE TF ChIP-seq 2015”; TAF7 is known to play critical roles during gastrulation [19]. One- hundred and seventy-five genes were also targeted by ATF2 based on “ENCODE and ChEA Consensus TFs from ChIP-X”; the expression of ATF2 is known to be maintained during gastrulation [20].

Slide 17

Slide 17 text

MBSJ 44 This suggests that TD based unsupervised FE correctly selected biologically reasonable genes.

Slide 18

Slide 18 text

MBSJ 44 TD can deal with data set full of missing values. TD can deal with data set full of missing values.

Slide 19

Slide 19 text

MBSJ 44 Conclusions TD is an effective method that can integrate single cell multiomics data set as it is without preprocessing individial omics data set specifically.