Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

948966d9c690e72faba4fd76e1858c56?s=47 Y-h. Taguchi
October 31, 2021

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

Presentation at ICBBS2021
http://www.icbbs.org/
at 31th Oct. 2021
(On line presentation)

Published paper is
Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. https://doi.org/10.3390/genes12091442

948966d9c690e72faba4fd76e1858c56?s=128

Y-h. Taguchi

October 31, 2021
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis Y-h.

    Taguchi Chuo University, Tokyo, Japan. and Turki Turki King Abdulaziz University, Jeddah, Saudi Arabia Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. https://doi.org/10.3390/genes12091442
  2. Introduction It is difficult to integrate multiomics single cell data

    set because 1) The number of features is huge (~108 for site-wise measurements) 2) Full of missing data (only a few percentages of non-missing values for site-wise measurements). 3) Difficult to integrate distinct number of features, mRNA ~ 104.
  3. Conventional approaches: Conventional approaches: Give up integrating multiomics data. (Analyze

    individual omics data separately). Screening filled features (i.e. excluding features with missing values) Filling missing values artificially (e.g., using Bayes predictors) The proposed approach: The proposed approach: Integrate multiomics data sets full of missing values as well as associated with distinct number of features without any pre-process (as it is) using tensor decomposition (TD).
  4. GSE154762: Dataset 1 GSE154762: Dataset 1 Number of cells: 899

    Gene expression+DNA mathylation+DNA accessibility GSE121708: Dataset 2 Number of cells: 852 (758 for gene expression) Gene expression+DNA mathylation+DNA accessibility
  5. PreProcess Gene expression: nothing DNA methylation: -1:unmethylated, 0:missing values, 1:metylated

    DNA accessibility: average over every 200 nucleotide regions. (It is four histone proteins + a linker protein) Standardized: Gene expression: zero mean, variance of 1 DNA methylation and accessibility for data set 1: Mean absolute values is one Those for data set 2: nothing (because of heterogeneity)
  6. For data set 1 or 2: x ijk ∈ℝN k

    ×M ×3 N k : Number of features of kth omics data: k=1: gene expression, k=2: DNA methylation, k=3: DNA accessibility M:number of cells. Since N k s are not common we need to adjust N k s into one value.
  7. Full of missing values Full of missing values

  8. x ijk =∑ l=1 L λl u lik u l

    jk x ljk =∑ i=1 N k x ijk u lik ∈ℝL× M×K Apply TD to x ljk to get where we emply L=10 x ljk =∑ l 1 =1 L ∑ l 2 =1 M ∑ l 3 =1 3 G(l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k
  9. What is tensor decomposition(TD)? Expand tensor as a series of

    product of vectors, x ijk l:reduced dimension j:cells k:multiomics G k j l l 1 l 2 l 3 = u l 1 l u l 2 j u l 3 k u l 1 i u l 2 j u l 3 k x ljk ≃∑ l 1 =1 L 1 ∑ l 2 =2 L 2 ∑ l 3 =1 L 3 G (l 1 l 2 l 3 )u l 1 l u l 2 j u l 3 k
  10. Select u l2j associated with classification, s. Data set 1:human

    oocyte maturation Classification: Cell types Data set 2:four time points of the mouse embryo Classification: time points a l2s ,b l2 : regression coefficients δ js =1 when j ∈ s, otherwise =0 Check which u l2j is coincident with classes, s. u l 2 j =a l 2 s δjs +b l 2
  11. 18 (for data set 1) and 12 (for data set

    2) u l2j are significantly correlated with classifications. UMAP was applied to top 30 u l2j and we got two dimensional embedding as can be seen in the following slides.
  12. data set 1

  13. data set 2

  14. We also performed gene selections and biological validation of them

    using enrichment analysis. But no time to present them. Conclusions: Conclusions: We have applied TD to integration of single cell multiomics data sets. Without specific preprocessing, TD successfully obtained low dimensional embedding with which UMAP can generate embedding coincident with classification.