Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-h. Taguchi
November 04, 2021

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Presentation at InCob2021
http://www.incob2021.cn/

The content is published in PLoS ONE
https://doi.org/10.1371/journal.pone.0275472

Y-h. Taguchi

November 04, 2021
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Projection in genomic analysis: A theoretical basis to rationalize tensor
    decomposition and principal component analysis as feature selection
    tools
    Y-h. Taguchi, Department of Physics, Chuo University, Tokyo, Japan.
    Turki Turki, King Abdulaziz University, Jeddah, Saudi Arabia.
    This was rejected by Conference Journal Truck, but can be read
    This was rejected by Conference Journal Truck, but can be read
    as a preprint.
    as a preprint.
    BioRxiv
    doi: https://doi.org/10.1101/2020.10.02.324616

    View Slide

  2. Table of Contents
    Purpose of this study
    Projection pursuit
    PCA based unsupervised FE
    TD based unsupervised FE
    Comparison with PP
    Rationalization of null hypothesis (Gaussian distribution)

    View Slide

  3. I have published a book on this topics from
    Springer international.
    I am glad if the audience can buy it and learn
    my method.
    Y-h. Taguchi,
    Unsupervised Feature Extraction Applied to
    Bioinformatics
    --- A PCA and TD Based Approach ---
    Springer International (2020)

    View Slide

  4. The purpose of this preprint to rationalize the proposed method,
    Principal component analysis (PCA) and tensor decomposition
    (TD) based unsupervised feature selection in detail described in
    the book mentioned in the previous page.

    View Slide

  5. y ∈ℝM: teacher data (e.g. labeling)
    M: the number of samples
    vs
    x ∈ℝN⨉M: given data
    N: the number of features attributed to M samples.
    How can we make use of x to explain y?

    View Slide

  6. One strategy: Projection pursuit (PP)
    b= y XT ∈ℝN
    b can be used to weight which features are important,
    e.g, ith feature with larger absolute values of b
    i
    is regarded to be
    important.

    View Slide

  7. PCA based unsupervised FE
    XXT ∈ℝN ×N
    XXT u
    l
    =λl
    u
    l
    ∈ℝN
    v
    l
    =XT u
    l
    ∈ℝM
    P
    i
    =P
    χ2
    [>
    (u
    li
    σl
    )2]
    Generate N ⨉ N matrix
    Obtain eigen vector u
    l
    attributed to feature i
    Compute eigen vector v
    l
    attributed to sample j
    Identify which v
    l
    is biologically intersting
    Attribute P values to feature i
    With assuming that u
    l
    obeys Gaussian.
    P
    i
    is corrected by Benjamini-Hochberg criterion and is associated with
    corrected P
    i
    <0.01 are selected.

    View Slide

  8. TD based unsupervised FE
    X∈ℝN ×M ×K
    x
    ijk
    =∑
    l
    1
    =1
    N
    G(l
    1
    l
    2
    l
    3
    )u
    l
    1i
    u
    l
    2 j
    u
    l
    3
    k
    G∈ℝN ×M×K ,u
    l
    1
    i
    ∈ℝN ×N ,u
    l
    2
    j
    ∈ℝM ×M ,u
    l
    3
    k
    ∈ℝK×K
    ith feature attributed to samples with jth and kth experimental
    conditions
    N: number of features, M,K:number of conditions (samples)
    Identify biologically interesting l
    2
    ,l
    3
    and find l
    1
    that shares absolutely
    large G(l
    1
    ,l
    2
    ,l
    3
    ) with identified l
    2
    ,l
    3
    .

    View Slide

  9. P
    i
    =P
    χ2
    [>
    (u
    l
    1 i
    σl
    1
    )2]
    Attribute P values to feature i with assuming that u
    l1
    obeys Gaussian.
    P
    i
    is corrected by Benjamini-Hochberg criterion and is associated
    with corrected P
    i
    <0.01 are selected.

    View Slide

  10. Applying TD based unsupervised
    FE to cancer data sets
    Integration of two
    cancer data sets
    TCGA:
    M:324 (253 tumor, 71 normal)
    mRNA and miRNA
    GEO:
    M:34 (17 tumor, 17 normal)
    mRNA and miRNA

    View Slide

  11. x
    ij
    ∈ℝN ×M
    x
    kj
    ∈ℝK ×M
    x
    ik
    =∑j
    x
    ij
    x
    kj
    ∈ℝN ×K
    x
    ik
    =∑l=1
    min( N , K )
    λl
    u
    li
    u
    lk
    v
    lj
    mRNA
    =∑
    i
    x
    ij
    u
    li
    ,v
    lj
    miRNA
    =∑
    k
    x
    kj
    u
    l k
    P
    i
    =P
    χ2
    [>
    (u
    li
    σl
    )2], P
    k
    =P
    χ2
    [>
    (u
    lk
    σl
    )2]
    N mRNAs
    K miRNAs
    72 mRNAs and 11 miRNAs are selected

    View Slide

  12. Comparison with PP
    y
    j
    =−
    M
    M
    T
    ,1≤ j≤M
    T
    y
    j
    =
    M
    M
    N
    , M
    T
    < j≤M
    b
    i
    =∑
    j
    x
    ij
    y
    j
    b
    k
    =∑
    j
    x
    kj
    y
    j
    P
    i
    =P
    χ2
    [>
    ( b
    i
    σb
    )2]
    P
    k
    =P
    χ2
    [>
    (b
    k
    σb
    )2]
    M
    T
    : number of tumors
    M
    N
    : number of normal kidneys
    73 mRNAs and 18 miRNAs are selected

    View Slide

  13. Q-Q Plot
    mRNA miRNA
    P
    i
    and P
    k
    obey same distribution between PP and TD
    based unsupervised FE

    View Slide

  14. Confusion matrices
    mRNA
    miRNA
    PP and TD based unsupervised FE select almost same mRNAs
    and miRNAs

    View Slide

  15. Rationalization of null hypothesis (Gaussian distribution)

    View Slide

  16. True null distribution was generated with shuffled miRNA and
    P-values attributed to umiRNA
    li
    were computed
    all miRNAs
    1-P 1-P
    Top 500
    miRNAs

    View Slide

  17. all mRNAs
    Top 3000
    mRNAs
    True null distribution was generated with shuffled mRNA and
    P-values attributed to umRNA
    li
    were computed
    1-P 1-P

    View Slide

  18. Confusion matrices
    mRNA
    miRNA
    Null distribution and TD based unsupervised FE select almost
    same mRNAs and miRNAs although threshold values differ

    View Slide

  19. Conclusion
    TD based unsupervised FE is equivalent to PP
    Although null hypothesis of Gaussian distribution is
    not fulfilled, it is empirically coincident with null
    distribution generated by shuffling, although threshold
    P values differ (0.01 for TD based unsupervised FE
    and 0.1 for null distribution)

    View Slide