Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

Y-h. Taguchi
October 31, 2021

Tensor-Decomposition-based Unsupervised Feature Extraction in Single-cell Multiomics Data Analysis

Presentation at ICBBS2021
http://www.icbbs.org/
at 31th Oct. 2021
(On line presentation)

Published paper is
Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised Feature Extraction in Single-Cell Multiomics Data Analysis. Genes 2021, 12, 1442. https://doi.org/10.3390/genes12091442

Y-h. Taguchi

October 31, 2021
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Tensor-Decomposition-based Unsupervised Feature Extraction
    in Single-cell Multiomics Data Analysis
    Y-h. Taguchi
    Chuo University, Tokyo, Japan.
    and
    Turki Turki
    King Abdulaziz University, Jeddah, Saudi Arabia
    Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based
    Unsupervised Feature Extraction in Single-Cell Multiomics Data
    Analysis. Genes 2021, 12, 1442.
    https://doi.org/10.3390/genes12091442

    View Slide

  2. Introduction
    It is difficult to integrate multiomics single cell data set because
    1) The number of features is huge (~108 for site-wise measurements)
    2) Full of missing data (only a few percentages of non-missing values
    for site-wise measurements).
    3) Difficult to integrate distinct number of features, mRNA ~ 104.

    View Slide

  3. Conventional approaches:
    Conventional approaches:
    Give up integrating multiomics data.
    (Analyze individual omics data separately).
    Screening filled features (i.e. excluding features with missing values)
    Filling missing values artificially (e.g., using Bayes predictors)
    The proposed approach:
    The proposed approach:
    Integrate multiomics data sets full of missing values as well as
    associated with distinct number of features without any pre-process
    (as it is) using tensor decomposition (TD).

    View Slide

  4. GSE154762: Dataset 1
    GSE154762: Dataset 1
    Number of cells: 899
    Gene expression+DNA mathylation+DNA accessibility
    GSE121708: Dataset 2
    Number of cells: 852 (758 for gene expression)
    Gene expression+DNA mathylation+DNA accessibility

    View Slide

  5. PreProcess
    Gene expression: nothing
    DNA methylation: -1:unmethylated, 0:missing values, 1:metylated
    DNA accessibility: average over every 200 nucleotide regions.
    (It is four histone proteins + a linker protein)
    Standardized:
    Gene expression: zero mean, variance of 1
    DNA methylation and accessibility for data set 1:
    Mean absolute values is one
    Those for data set 2: nothing (because of heterogeneity)

    View Slide

  6. For data set 1 or 2:
    x
    ijk
    ∈ℝN
    k
    ×M ×3
    N
    k
    : Number of features of kth omics data:
    k=1: gene expression, k=2: DNA methylation, k=3: DNA accessibility
    M:number of cells.
    Since N
    k
    s are not common we need to adjust N
    k
    s into one value.

    View Slide

  7. Full of missing values
    Full of missing values

    View Slide

  8. x
    ijk
    =∑
    l=1
    L
    λl
    u
    lik
    u
    l jk
    x
    ljk
    =∑
    i=1
    N
    k x
    ijk
    u
    lik
    ∈ℝL× M×K
    Apply TD to x
    ljk
    to get
    where we emply L=10
    x
    ljk
    =∑
    l
    1
    =1
    L

    l
    2
    =1
    M

    l
    3
    =1
    3
    G(l
    1
    l
    2
    l
    3
    )u
    l
    1 l
    u
    l
    2
    j
    u
    l
    3
    k

    View Slide

  9. What is tensor decomposition(TD)?
    Expand tensor as a series of product of vectors,
    x
    ijk
    l:reduced
    dimension
    j:cells
    k:multiomics
    G
    k
    j
    l
    l
    1
    l
    2
    l
    3
    =
    u
    l
    1
    l
    u
    l
    2
    j
    u
    l
    3
    k
    u
    l
    1
    i
    u
    l
    2
    j
    u
    l
    3
    k
    x
    ljk
    ≃∑
    l
    1
    =1
    L
    1 ∑
    l
    2
    =2
    L
    2 ∑
    l
    3
    =1
    L
    3 G (l
    1
    l
    2
    l
    3
    )u
    l
    1
    l
    u
    l
    2
    j
    u
    l
    3
    k

    View Slide

  10. Select u
    l2j
    associated with classification, s.
    Data set 1:human oocyte maturation
    Classification: Cell types
    Data set 2:four time points of the mouse embryo
    Classification: time points
    a
    l2s
    ,b
    l2
    : regression coefficients
    δ
    js
    =1 when j ∈ s, otherwise =0
    Check which u
    l2j
    is coincident with classes, s.
    u
    l
    2
    j
    =a
    l
    2
    s
    δjs
    +b
    l
    2

    View Slide

  11. 18 (for data set 1) and 12 (for data set 2) u
    l2j
    are significantly
    correlated with classifications.
    UMAP was applied to top 30 u
    l2j
    and we got two dimensional
    embedding as can be seen in the following slides.

    View Slide

  12. data set 1

    View Slide

  13. data set 2

    View Slide

  14. We also performed gene selections and biological validation of
    them using enrichment analysis. But no time to present them.
    Conclusions:
    Conclusions:
    We have applied TD to integration of single cell multiomics data
    sets.
    Without specific preprocessing, TD successfully obtained low
    dimensional embedding with which UMAP can generate
    embedding coincident with classification.

    View Slide