Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Integrated Analysis of single cell multi omics data sets using tensor decomposition based unsupervised feature extraction

Y-h. Taguchi
November 27, 2021

Integrated Analysis of single cell multi omics data sets using tensor decomposition based unsupervised feature extraction

Poster presentation at MBSJ44
https://www2.aeplan.co.jp/mbsj2021/

Presentation video is here
https://youtu.be/n2QPBDflkZc

Paper is here
https://doi.org/10.3390/genes12091442

Y-h. Taguchi

November 27, 2021
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. MBSJ 44
    Integrated Analysis of single cell multi omics data sets using tensor
    decomposition based unsupervised feature extraction
    Y-h. Taguchi
    Department of Physics, Chuo University,
    Tokyo Japan
    The contents of this poster was published in:
    Taguchi, Y.-h.; Turki, T. Tensor-Decomposition-Based Unsupervised
    Feature Extraction in Single-Cell Multiomics Data Analysis. Genes
    2021, 12, 1442. doi:10.3390/genes12091442

    View Slide

  2. MBSJ 44
    Introduction
    Integrated analysis of single cell multi-omics data is difficult
    because…..
    1. The number of features in individual omics differ
    (gene~104, DNA methylation/accessibility~108)
    2. Full of missing values
    The percentages of missing values:
    70% for gene expression
    >90% for DNA methylation and accessibility
    Careful pre-processing is usually required….

    View Slide

  3. MBSJ 44
    In this poster, we proposed the usage of tensor decomposition so
    as to enable us to integrate gene expression, DNA methylation
    and accessibility without particular pre-processing.

    View Slide

  4. MBSJ 44
    Singular value decomposition (SVD)
    xij
    N
    M
    (uli)T
    N
    L
    vlj
    L
    M


    x
    ij
    ≃∑
    l=1
    L
    u
    li
    λl
    v
    l j
    L
    L
    ⨉ λl

    View Slide

  5. MBSJ 44
    x
    ijk
    G
    u
    l1i
    u
    l2j
    u
    l3k
    L1
    L2
    L3
    HOSVD (Higher Order Singular Value Decomposition)
    Extension to tensor…..
    N
    M
    K
    x
    ijk
    ≃∑
    l
    1
    =1
    L
    1 ∑
    l
    2
    =1
    L
    2 ∑
    l
    3
    =1
    L
    3 G(l
    1
    l
    2
    l
    3
    )u
    l
    1
    i
    u
    l
    2
    j
    u
    l
    3
    k

    View Slide

  6. MBSJ 44
    Application to integrated analysis of multiomics data set
    Since individual omics data is associated with distinct features.
    N
    k
    : number of features of individual omics data
    M: number of single cells
    K: number of omics (K=3, in the present study)
    x
    ijk
    ∈ℝN
    k
    ×M ×K

    View Slide

  7. MBSJ 44
    Apply SVD to x
    ijk
    Project x
    ijk
    onto u
    li
    .
    x
    ijk
    ≃∑
    l=1
    L
    u
    lik
    λl
    v
    l jk
    x
    ljk
    =∑
    i=1
    N
    u
    li
    x
    ijk
    ∈ℝL×M× K

    View Slide

  8. MBSJ 44
    Apply HOSVD to x
    ljk
    as
    Apply Umap to u
    l2j
    (1≦ l
    2
    ≦ 3L, L=10)
    x
    ljk
    ≃∑
    l
    1
    =L
    L
    1 ∑
    l
    2
    =1
    M

    l
    3
    =1
    K
    G (l
    1
    l
    2
    l
    3
    )u
    l
    1
    l
    u
    l
    2
    j
    u
    l
    3
    k

    View Slide

  9. MBSJ 44
    Dataset 1
    Dataset 1
    The multiomics dataset retrieved from GEO ID GSE154762, which is
    denoted as Dataset 1 in this study, is composed of 899 single cells for
    which gene expression, DNA methylation, and DNA accessibility were
    measured.
    These single cells represent human oocyte maturation.
    Dataset 2
    Dataset 2
    The multiomics dataset retrieved from GEO ID GSE154762, which is
    denoted as Dataset 2 in this study, is composed of 852 single cells for
    which DNA methylation and DNA accessibility were measured, as
    well as 758 single cells for which gene expression was measured.
    These single cells represent the four time points of the mouse embryo.

    View Slide

  10. MBSJ 44
    Dataset 1
    Dataset 1 Dataset 2
    Dataset 2

    View Slide

  11. MBSJ 44
    Identify which u
    l2j
    are coincident with classification.
    Categorical regression:
    δ
    js
    =1 when j ∈ sth category, otherwise 0.
    a
    l2
    ,b
    l2s
    : regression coefficients.
    Dataset 1 & 2:
    18 l
    2
    s are associated with corrected P-values less than 0.05.
    u
    l
    2
    j
    =a
    l
    2
    +∑
    s=1
    S
    b
    l
    2
    s
    δjs

    View Slide

  12. MBSJ 44
    Select l
    1
    that has the largest
    where l
    2
    s are restricted to the selected 18 l
    2
    s, since such l
    1
    should
    be associated with classifications.
    For dataset 1 and 2, l
    1
    =1 has the largest value.

    l
    2

    l
    3
    =1
    K
    |(G (l
    1
    l
    2
    l
    3
    ))|

    View Slide

  13. MBSJ 44
    Gene selection:
    Generate u
    l1i
    from u
    l1l
    Attribute P-value to gene i with assuming that u
    l1i
    obey Gaussian.
    P
    i
    =P
    χ2
    [>∑(u
    1i
    σ5
    )2]
    u
    l
    1
    i
    =∑
    l=1
    L
    u
    l
    1
    l
    u
    li1
    ∈ℝL×M×K

    View Slide

  14. MBSJ 44
    45 genes and 175 genes are selected for dataset 1 and 2, respectively,
    as those associated with adjusted P-values less than 0.01.
    Enrichment analysis was performed toward these genes in order to
    validate selected genes biologically.

    View Slide

  15. MBSJ 44
    Dataset 1
    Forty-seven genes were enriched by H3K36me3 based on “ENCODE
    Histone Modifications 2015”; H3K36m3 is known to play critical
    roles during oocyte maturation [12]. Forty-seven genes were also
    targeted by MYC based on “ENCODE and ChEA Consensus TFs
    from ChIP-X”; Myc is known to play critical roles in oogenesis [13].
    Forty-seven genes were also targeted by TAF7 based on “ENCODE
    and ChEA Consensus TFs from ChIP-X” and “ENCODE TF ChIP-
    seq 2015”; TAF7 is known to play critical roles during oocyte growth
    [14]. Forty-seven genes were also targeted by ATF2 based on
    “ENCODE and ChEA Consensus TFs from ChIP-X”; the expression
    of ATF2 is known to be altered during oocyte development [15].

    View Slide

  16. MBSJ 44
    Dataset 2
    One-hundred and seventy-five genes were enriched by H3K36me3
    based on “ENCODE Histone Modifications 2015”; H3K36m3 is
    known to play critical roles during gastrulation [17]. One-hundred and
    seventy-five genes were also targeted by MYC based on “ENCODE
    and ChEA Consensus TFs from ChIP-X”; Myc is also known to play
    critical roles in gastrulation [18]. One hundred and seventy-five genes
    were also targeted by TAF7 based on “ENCODE and ChEA
    Consensus TFs from ChIP-X” and “ENCODE TF ChIP-seq 2015”;
    TAF7 is known to play critical roles during gastrulation [19]. One-
    hundred and seventy-five genes were also targeted by ATF2 based on
    “ENCODE and ChEA Consensus TFs from ChIP-X”; the expression
    of ATF2 is known to be maintained during gastrulation [20].

    View Slide

  17. MBSJ 44
    This suggests that TD based unsupervised FE correctly selected
    biologically reasonable genes.

    View Slide

  18. MBSJ 44
    TD can deal with data set full of missing values.
    TD can deal with data set full of missing values.

    View Slide

  19. MBSJ 44
    Conclusions
    TD is an effective method that can integrate single cell multiomics
    data set as it is without preprocessing individial omics data set
    specifically.

    View Slide