Integrated Analysis of single cell multi omics data sets using tensor decomposition based unsupervised feature extraction

MBSJ 44 Integrated Analysis of single cell multi omics data
sets using tensor decomposition based unsupervised feature extraction Y-h. Taguchi Department of Physics, Chuo University, Tokyo Japan The contents of this poster was published in: Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. doi:10.3390/genes12091442

MBSJ 44 Introduction Integrated analysis of single cell multi-omics data
is difficult because….. 1. The number of features in individual omics differ (gene~104, DNA methylation/accessibility~108) 2. Full of missing values The percentages of missing values: 70% for gene expression >90% for DNA methylation and accessibility Careful pre-processing is usually required….

MBSJ 44 In this poster, we proposed the usage of
tensor decomposition so as to enable us to integrate gene expression, DNA methylation and accessibility without particular pre-processing.

MBSJ 44 Singular value decomposition (SVD) xij N M (uli)T
N L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl

MBSJ 44 x ijk G u l1i u l2j u
l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

MBSJ 44 Application to integrated analysis of multiomics data set
Since individual omics data is associated with distinct features. N k : number of features of individual omics data M: number of single cells K: number of omics (K=3, in the present study) x ijk ∈ℝN k ×M ×K

MBSJ 44 Apply SVD to x ijk Project x ijk
onto u li . x ijk ≃∑ l=1 L u lik λl v l jk x ljk =∑ i=1 N u li x ijk ∈ℝL×M× K

MBSJ 44 Apply HOSVD to x ljk as Apply Umap
to u l2j (1≦ l 2 ≦ 3L, L=10) x ljk ≃∑ l 1 =L L 1 ∑ l 2 =1 M ∑ l 3 =1 K G (l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k

MBSJ 44 Dataset 1 Dataset 1 The multiomics dataset retrieved
from GEO ID GSE154762, which is denoted as Dataset 1 in this study, is composed of 899 single cells for which gene expression, DNA methylation, and DNA accessibility were measured. These single cells represent human oocyte maturation. Dataset 2 Dataset 2 The multiomics dataset retrieved from GEO ID GSE154762, which is denoted as Dataset 2 in this study, is composed of 852 single cells for which DNA methylation and DNA accessibility were measured, as well as 758 single cells for which gene expression was measured. These single cells represent the four time points of the mouse embryo.

MBSJ 44 Dataset 1 Dataset 1 Dataset 2 Dataset 2

MBSJ 44 Identify which u l2j are coincident with classification.
Categorical regression: δ js =1 when j ∈ sth category, otherwise 0. a l2 ,b l2s : regression coefficients. Dataset 1 & 2: 18 l 2 s are associated with corrected P-values less than 0.05. u l 2 j =a l 2 +∑ s=1 S b l 2 s δjs

MBSJ 44 Select l 1 that has the largest where
l 2 s are restricted to the selected 18 l 2 s, since such l 1 should be associated with classifications. For dataset 1 and 2, l 1 =1 has the largest value. ∑ l 2 ∑ l 3 =1 K |(G (l 1 l 2 l 3 ))|

MBSJ 44 Gene selection: Generate u l1i from u l1l
Attribute P-value to gene i with assuming that u l1i obey Gaussian. P i =P χ2 [>∑(u 1i σ5 )2] u l 1 i =∑ l=1 L u l 1 l u li1 ∈ℝL×M×K

MBSJ 44 45 genes and 175 genes are selected for
dataset 1 and 2, respectively, as those associated with adjusted P-values less than 0.01. Enrichment analysis was performed toward these genes in order to validate selected genes biologically.

MBSJ 44 Dataset 1 Forty-seven genes were enriched by H3K36me3
based on “ENCODE Histone Modifications 2015”; H3K36m3 is known to play critical roles during oocyte maturation [12]. Forty-seven genes were also targeted by MYC based on “ENCODE and ChEA Consensus TFs from ChIP-X”; Myc is known to play critical roles in oogenesis [13]. Forty-seven genes were also targeted by TAF7 based on “ENCODE and ChEA Consensus TFs from ChIP-X” and “ENCODE TF ChIP- seq 2015”; TAF7 is known to play critical roles during oocyte growth [14]. Forty-seven genes were also targeted by ATF2 based on “ENCODE and ChEA Consensus TFs from ChIP-X”; the expression of ATF2 is known to be altered during oocyte development [15].

MBSJ 44 Dataset 2 One-hundred and seventy-five genes were enriched
by H3K36me3 based on “ENCODE Histone Modifications 2015”; H3K36m3 is known to play critical roles during gastrulation [17]. One-hundred and seventy-five genes were also targeted by MYC based on “ENCODE and ChEA Consensus TFs from ChIP-X”; Myc is also known to play critical roles in gastrulation [18]. One hundred and seventy-five genes were also targeted by TAF7 based on “ENCODE and ChEA Consensus TFs from ChIP-X” and “ENCODE TF ChIP-seq 2015”; TAF7 is known to play critical roles during gastrulation [19]. One- hundred and seventy-five genes were also targeted by ATF2 based on “ENCODE and ChEA Consensus TFs from ChIP-X”; the expression of ATF2 is known to be maintained during gastrulation [20].

MBSJ 44 This suggests that TD based unsupervised FE correctly
selected biologically reasonable genes.

MBSJ 44 TD can deal with data set full of
missing values. TD can deal with data set full of missing values.

MBSJ 44 Conclusions TD is an effective method that can
integrate single cell multiomics data set as it is without preprocessing individial omics data set specifically.

Integrated Analysis of single cell multi omics ...

Integrated Analysis of single cell multi omics data sets using tensor decomposition based unsupervised feature extraction

Y-h. Taguchi

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript

MBSJ 44 Integrated Analysis of single cell multi omics data

MBSJ 44 Introduction Integrated analysis of single cell multi-omics data

MBSJ 44 In this poster, we proposed the usage of

MBSJ 44 Singular value decomposition (SVD) xij N M (uli)T

MBSJ 44 x ijk G u l1i u l2j u

MBSJ 44 Application to integrated analysis of multiomics data set

MBSJ 44 Apply SVD to x ijk Project x ijk

MBSJ 44 Apply HOSVD to x ljk as Apply Umap

MBSJ 44 Dataset 1 Dataset 1 The multiomics dataset retrieved

MBSJ 44 Dataset 1 Dataset 1 Dataset 2 Dataset 2

MBSJ 44 Identify which u l2j are coincident with classification.

MBSJ 44 Select l 1 that has the largest where

MBSJ 44 Gene selection: Generate u l1i from u l1l

MBSJ 44 45 genes and 175 genes are selected for

MBSJ 44 Dataset 1 Forty-seven genes were enriched by H3K36me3

MBSJ 44 Dataset 2 One-hundred and seventy-five genes were enriched

MBSJ 44 This suggests that TD based unsupervised FE correctly

MBSJ 44 TD can deal with data set full of

MBSJ 44 Conclusions TD is an effective method that can