Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Drug repositioning for SARS-CoV-2 with tensor decomposition based unsupervised feature extraction

Drug repositioning for SARS-CoV-2 with tensor decomposition based unsupervised feature extraction

Presentation at
"Scholars International Webinar on Advances in Drug Discovery and Development"
https://scholarsconferences.com/drugdiscovery-webinar/

28th March 2022
Online

Video presentation
https://youtu.be/pq9doYLLMn0

Y-h. Taguchi

March 28, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Drug repositioning for SARS-CoV-2 with tensor
    decomposition based unsupervised feature extraction
    Y-h. Taguchi
    Department of Physics, Chuo University
    Tokyo 112-8551, Japan
    Tensor decomposition- and principal component analysis-based
    unsupervised feature extraction to select more reasonable
    differentially expressed genes: Optimization of standard
    deviation versus state-of-art methods,
    Y-h. Taguchi, Turki Turki bioRxiv 2022.02.18.481115; doi:
    https://doi.org/10.1101/2022.02.18.481115

    View full-size slide

  2. Introduction
    Introduction
    Bioinformatics is a research field to analyze massive genomics data
    sets using cutting edge computational/statistical/machine learning
    techniques.
    Typical data sets analyzed are

    Gene expression profiles ~104

    DNA methylation ~107

    DNA accessibility ~ 107

    microRNA expression ~ 103
    whereas the number of samples is few (10 to 102).

    View full-size slide

  3. Data analysis of bioinformatics is a typical large p small n problem.
    What is large p small n problem?
    There are only a few examples with many features, it is often difficult
    to distinguish between them.

    View full-size slide

  4. For example, some novels to be classified into either fantasy or
    science fiction are given to you.
    In this case, features are a set of words included in each novel.
    Fantasy:
    Snow White, The Load of the Rings, Knights of the Round Table
    Science Fiction:
    Star Trek, X men, Superman
    Is Star Wars science fiction or fantasy?
    Only based upon small number of examples, labeling a new title is
    very difficult.

    View full-size slide

  5. In bioinformatics, number of samples (~102) whereas the number
    of features is huge (>>104).
    As such a method, we developed tensor decomposition applicable
    to large p small n problem.
    In this talk, I would like to introduce some examples of application
    of the method we proposed, “Tensor Decomposition (TD) based
    unsupervised feature extraction (FE)” to bioinformatics problems.

    View full-size slide

  6. I have published a book on this topics from
    Springer international.
    I am glad if the audience can buy it and learn
    my method.
    Y-h. Taguchi,
    Unsupervised Feature Extraction Applied to
    Bioinformatics
    --- A PCA and TD Based Approach ---
    Springer International (2020)

    View full-size slide

  7. What is a tensor?
    Scholar x: a number
    Vector x
    i
    : a set of scholars in line
    Matrix x
    ij
    : a set of scholars aligned in a table (i.e. rows and columns)
    Tensor x
    ijk
    : a set of scholars aligned in an array more then two rows
    x
    ijk
    i
    j
    k
    1
    (1,2,3,4,...)
    (1 2 3
    4 5 6
    7 8 9
    )

    View full-size slide

  8. Tensor is suitable to store genomics data:
    Gene expression :x
    ijk
    ∈ ℝN⨉M⨉K
    N genes ⨉ M persons ⨉ K tissues
    x
    ijk
    i:genes
    j:persons
    k:tissues

    View full-size slide

  9. What is tensor decomposition(TD)?
    Expand tensor as a series of product of vectors,
    x
    ijk
    i:genes
    j:persons
    k:tissues
    G
    k
    j
    i
    l
    1
    l
    2
    l
    3
    =
    u
    l
    1
    i
    u
    l
    2
    j
    u
    l
    3
    k
    u
    l
    1
    i
    u
    l
    2
    j
    u
    l
    3
    k
    x
    ijk
    ≃∑
    l
    1
    =1
    L
    1 ∑
    l
    2
    =2
    L
    2 ∑
    l
    3
    =1
    L
    3 G (l
    1
    l
    2
    l
    3
    )u
    l
    1
    i
    u
    l
    2
    j
    u
    l
    3
    k

    View full-size slide

  10. Advantages of tensor decomposition(TD):
    We can know
    “Dependence of x
    ijk
    upon i” → u
    l1i
    “Dependence of x
    ijk
    upon j” → u
    l2j
    “Dependence of x
    ijk
    upon k” → u
    l3k
    ← Healthy control vs patient
    ← tissue specificity
    Gene selection

    We can answer the question : Which genes are expressed between
    healthy controls and patients in tissue specific manner?

    View full-size slide

  11. 11
    Application to a real example
    Application to a real example
    Drug repositioning for COVID-19
    Drug repositioning for COVID-19

    View full-size slide

  12. 13
    Data set GSE147507
    Gene expression of human lung cell lines with/without SARS-CoV-2
    infection.
    i:genes(21797)
    j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4:
    A549 MOI 2.0, j=5:A549 ACE2 expressed
    (MOI:Multiplicity of infection)
    k: k=1: Mock, k=2:SARS-CoV-2 infected
    m: three biological replicates

    View full-size slide

  13. 14
    x
    i jk m
    ∈ℝ21797×5×2×3
    x
    i jk m
    ≃∑
    l
    1
    =1
    L
    1

    l
    2
    =1
    L
    2

    l
    3
    =1
    L
    3

    l
    4
    =1
    L
    4
    G(l
    1
    l
    2
    l
    3
    l
    4
    )u
    l
    1
    j
    u
    l
    2
    k
    u
    l
    3
    m
    u
    l
    4
    i
    u
    l1j
    : l
    1
    th cell lines dependence
    u
    l2k
    : l
    2
    th with and without SARS-CoV-2 infection
    u
    l3m
    : l
    3
    th dependence upon biological replicate
    u
    l4i
    : l
    4
    th gene dependence
    G: weight of individual terms

    View full-size slide

  14. 15
    Purpose: identification of l
    1
    ,l
    2
    ,l
    3
    independent of cell
    lines and biological replicates (u
    l1j
    ,u
    l3m
    take constant
    regardless j,m) and dependent upon with or wothout
    SARS-CoV-2 infection(u
    l21
    =-u
    l22

    Heavy “large p small n” problem
    Number of variables(=p): 21797 ~ 104
    Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10
    p/n ~ 103

    View full-size slide

  15. 16
    l
    1
    =1 l
    2
    =2
    l
    3
    =1
    Cell lines With and without
    SARS-CoV-2
    infection
    biological
    replicate
    Independent of cell lines
    and biological replicate,
    but dependent upon
    SARS-CoV-2 infection.

    View full-size slide

  16. 17
    l
    1
    =1 l
    2
    =2 l
    3
    =1
    |G|is the largest in which l
    4

    View full-size slide

  17. 18
    Gene expression independent of cell lines and
    biological replicate, but dependent upon SARS-CoV-2
    infection is associated with u
    5i
    (l
    4
    =5)
    P
    i
    =P
    χ2
    [>
    (u
    5i
    σ5
    )2]
    Computed P-values are corrected with considering multiple comparison
    corrections by Benjamini-Hochberg method.
    3627
    3627 genes with corrected P-values <0.1 are selected among 21,797
    genes.

    View full-size slide

  18. 19
    σ5
    =6.55×10−3
    σ5
    =7.00×10−4
    Computation from u
    5i
    Optimization

    View full-size slide

  19. 20
    Comparisons with conventional methods:
    Comparisons with conventional methods:
    Since we do not know how many genes should be selected, lasso and
    random forest is useless. Instead we employed SAM and limma, which
    are gene selection specific algorithm (adjusted P-values are used ).
    t test SAM limma
    P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01
    Calu3 21754 43 21797 0 335 3789
    NHBE 21797 0 21797 0 342 3906
    A549
    MOI 0.2 21797 0 21797 0 319 4391
    MOI 2.0 21472 325 21797 0 208 4169
    ACE2 expressed 21796 1 21797 0 182 4245

    View full-size slide

  20. 21
    Comparisons with DESeq2:
    Comparisons with DESeq2:
    DESeq2
    P>0.01 P≦0.01
    Calu3 7278 16432
    NHBE 23383 327
    A549
    MOI 0.2 7858 15852
    MOI 2.0 16279 7431
    ACE2 expressed 16201 7509
    After the publication of our
    paper, we have found the
    paper[*] that originally studied
    this GEO data was published
    (when we have done this study,
    only GEO data set was provides
    and no papers were published).
    The paper includes DESeq2
    results. It is similar to limma; it
    detected most of genes as DEGs
    whereas it identified limited
    number of DEGs for NHBE cell
    lines
    [*]Daniel Blanco-Melo et al, Cell, 2020; 181(5):
    1036-1045.e9. doi: 10.1016/j.cell.2020.04.026.
    Imbalanced Host Response to SARS-CoV-2 Drives
    Development of COVID-19

    View full-size slide

  21. 22
    Multiple hits with known SARS-CoV-2 interacting human genes

    View full-size slide

  22. 23
    Drug perturbations from GEO down
    Drug perturbations from GEO up

    View full-size slide

  23. 24
    Drug perturbations from GEO down
    The first one, imatinib, was once identified as a promising drug
    toward COVID-19, although it was rejected later [16]. The second
    one, apratoxin A, was reported to be a promising compound based on
    its protein binding affinity [17]. The third and fourth one, doxycycline,
    was supposed to be a promising drug to-ward COVID-19 [18]. The
    seventh one, trovafloxacin, was reported to be a promising compound
    based on its protein binding affinity [19]. The eighth one,
    doxorubicin, was also reported to be a promising compound based on
    its protein binding affinity [20]. The ninth one, cisplatin, and the tenth
    one, carboplatin, were proposed as a result of drug repositioning [21].
    Seven of the nine compounds identified as the top 10 compounds have
    been previously reported as drugs toward SARS-CoV-2.

    View full-size slide

  24. 25
    Drug perturbations from GEO up
    The first, fourth, and tenth one, estradiol, was reported as a promising
    compound [22]. The second one, tamoxifen, was reported to inhibit
    SARS-CoV-2 infection by suppressing viral entry [23]. The third one,
    apratoxin A, has been listed in the previous page, too. The fifth one,
    MK-886, was reported to be an inhibitor of 3CL protease [24],
    although its efficiency was limited to 40 %. The sixth one, IFN-
    alphacon1, was reported to be an inhibitor of SARS-CoV [25] but not
    for SARS-CoV-2. The seventh one, arachidonic acid, was generally
    expected to inhibit SARS-CoV-2 infection [26]. The eighth one,
    arsenic, was also generally expected to act against the RdRp of
    coronavirus [27]. The ninth one, metoprolo, was reported to be a
    promising drug toward COVID-19 [28].
    Thus, all the top 10 compounds were reported to be promising.

    View full-size slide

  25. 26
    Conclusion
    We have applied “TD based unsupervised FE with optimized SD”
    to drug repositioning that target SARS-CoV-2. 3627 genes
    identified with this method is highly overlapped with human
    genes known to interact with SARS-CoV-2 proteins.
    Almost all drugs listed as being known to target these 3637 genes
    have already been tested as drugs that target SARS-CoV-2.
    Thus the proposed strategy is very promising.

    View full-size slide