Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

Slide 1

Slide 1 text

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis Y-h. Taguchi Chuo University, Tokyo, Japan. and Turki Turki King Abdulaziz University, Jeddah, Saudi Arabia Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. https://doi.org/10.3390/genes12091442

Slide 2

Slide 2 text

Introduction It is difficult to integrate multiomics single cell data set because 1) The number of features is huge (~108 for site-wise measurements) 2) Full of missing data (only a few percentages of non-missing values for site-wise measurements). 3) Difficult to integrate distinct number of features, mRNA ~ 104.

Slide 3

Slide 3 text

Conventional approaches: Conventional approaches: Give up integrating multiomics data. (Analyze individual omics data separately). Screening filled features (i.e. excluding features with missing values) Filling missing values artificially (e.g., using Bayes predictors) The proposed approach: The proposed approach: Integrate multiomics data sets full of missing values as well as associated with distinct number of features without any pre-process (as it is) using tensor decomposition (TD).

Slide 4

Slide 4 text

GSE154762: Dataset 1 GSE154762: Dataset 1 Number of cells: 899 Gene expression+DNA mathylation+DNA accessibility GSE121708: Dataset 2 Number of cells: 852 (758 for gene expression) Gene expression+DNA mathylation+DNA accessibility

Slide 5

Slide 5 text

PreProcess Gene expression: nothing DNA methylation: -1:unmethylated, 0:missing values, 1:metylated DNA accessibility: average over every 200 nucleotide regions. (It is four histone proteins + a linker protein) Standardized: Gene expression: zero mean, variance of 1 DNA methylation and accessibility for data set 1: Mean absolute values is one Those for data set 2: nothing (because of heterogeneity)

Slide 6

Slide 6 text

For data set 1 or 2: x ijk ∈ℝN k ×M ×3 N k : Number of features of kth omics data: k=1: gene expression, k=2: DNA methylation, k=3: DNA accessibility M:number of cells. Since N k s are not common we need to adjust N k s into one value.

Slide 7

Slide 7 text

Full of missing values Full of missing values

Slide 8

Slide 8 text

x ijk =∑ l=1 L λl u lik u l jk x ljk =∑ i=1 N k x ijk u lik ∈ℝL× M×K Apply TD to x ljk to get where we emply L=10 x ljk =∑ l 1 =1 L ∑ l 2 =1 M ∑ l 3 =1 3 G(l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k

Slide 9

Slide 9 text

What is tensor decomposition(TD)? Expand tensor as a series of product of vectors, x ijk l:reduced dimension j:cells k:multiomics G k j l l 1 l 2 l 3 = u l 1 l u l 2 j u l 3 k u l 1 i u l 2 j u l 3 k x ljk ≃∑ l 1 =1 L 1 ∑ l 2 =2 L 2 ∑ l 3 =1 L 3 G (l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k

Slide 10

Slide 10 text

Select u l2j associated with classification, s. Data set 1:human oocyte maturation Classification: Cell types Data set 2:four time points of the mouse embryo Classification: time points a l2s ,b l2 : regression coefficients δ js =1 when j ∈ s, otherwise =0 Check which u l2j is coincident with classes, s. u l 2 j =a l 2 s δjs +b l 2

Slide 11

Slide 11 text

18 (for data set 1) and 12 (for data set 2) u l2j are significantly correlated with classifications. UMAP was applied to top 30 u l2j and we got two dimensional embedding as can be seen in the following slides.

Slide 12

Slide 12 text

data set 1

Slide 13

Slide 13 text

data set 2

Slide 14

Slide 14 text

We also performed gene selections and biological validation of them using enrichment analysis. But no time to present them. Conclusions: Conclusions: We have applied TD to integration of single cell multiomics data sets. Without specific preprocessing, TD successfully obtained low dimensional embedding with which UMAP can generate embedding coincident with classification.