Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

Y-h. Taguchi
October 31, 2021

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

Presentation at ICBBS2021
http://www.icbbs.org/
at 31th Oct. 2021
(On line presentation)

Published paper is
Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. https://doi.org/10.3390/genes12091442

Y-h. Taguchi

October 31, 2021
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis Y-h.

    Taguchi Chuo University, Tokyo, Japan. and Turki Turki King Abdulaziz University, Jeddah, Saudi Arabia Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. https://doi.org/10.3390/genes12091442
  2. Introduction It is difficult to integrate multiomics single cell data

    set because 1) The number of features is huge (~108 for site-wise measurements) 2) Full of missing data (only a few percentages of non-missing values for site-wise measurements). 3) Difficult to integrate distinct number of features, mRNA ~ 104.
  3. Conventional approaches: Conventional approaches: Give up integrating multiomics data. (Analyze

    individual omics data separately). Screening filled features (i.e. excluding features with missing values) Filling missing values artificially (e.g., using Bayes predictors) The proposed approach: The proposed approach: Integrate multiomics data sets full of missing values as well as associated with distinct number of features without any pre-process (as it is) using tensor decomposition (TD).
  4. GSE154762: Dataset 1 GSE154762: Dataset 1 Number of cells: 899

    Gene expression+DNA mathylation+DNA accessibility GSE121708: Dataset 2 Number of cells: 852 (758 for gene expression) Gene expression+DNA mathylation+DNA accessibility
  5. PreProcess Gene expression: nothing DNA methylation: -1:unmethylated, 0:missing values, 1:metylated

    DNA accessibility: average over every 200 nucleotide regions. (It is four histone proteins + a linker protein) Standardized: Gene expression: zero mean, variance of 1 DNA methylation and accessibility for data set 1: Mean absolute values is one Those for data set 2: nothing (because of heterogeneity)
  6. For data set 1 or 2: x ijk ∈ℝN k

    ×M ×3 N k : Number of features of kth omics data: k=1: gene expression, k=2: DNA methylation, k=3: DNA accessibility M:number of cells. Since N k s are not common we need to adjust N k s into one value.
  7. x ijk =∑ l=1 L λl u lik u l

    jk x ljk =∑ i=1 N k x ijk u lik ∈ℝL× M×K Apply TD to x ljk to get where we emply L=10 x ljk =∑ l 1 =1 L ∑ l 2 =1 M ∑ l 3 =1 3 G(l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k
  8. What is tensor decomposition(TD)? Expand tensor as a series of

    product of vectors, x ijk l:reduced dimension j:cells k:multiomics G k j l l 1 l 2 l 3 = u l 1 l u l 2 j u l 3 k u l 1 i u l 2 j u l 3 k x ljk ≃∑ l 1 =1 L 1 ∑ l 2 =2 L 2 ∑ l 3 =1 L 3 G (l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k
  9. Select u l2j associated with classification, s. Data set 1:human

    oocyte maturation Classification: Cell types Data set 2:four time points of the mouse embryo Classification: time points a l2s ,b l2 : regression coefficients δ js =1 when j ∈ s, otherwise =0 Check which u l2j is coincident with classes, s. u l 2 j =a l 2 s δjs +b l 2
  10. 18 (for data set 1) and 12 (for data set

    2) u l2j are significantly correlated with classifications. UMAP was applied to top 30 u l2j and we got two dimensional embedding as can be seen in the following slides.
  11. We also performed gene selections and biological validation of them

    using enrichment analysis. But no time to present them. Conclusions: Conclusions: We have applied TD to integration of single cell multiomics data sets. Without specific preprocessing, TD successfully obtained low dimensional embedding with which UMAP can generate embedding coincident with classification.