Drug repositioning for SARS-CoV-2 with tensor decomposition based unsupervised feature extraction

Drug repositioning for SARS-CoV-2 with tensor decomposition based unsupervised feature
extraction Y-h. Taguchi Department of Physics, Chuo University Tokyo 112-8551, Japan Tensor decomposition- and principal component analysis-based unsupervised feature extraction to select more reasonable differentially expressed genes: Optimization of standard deviation versus state-of-art methods, Y-h. Taguchi, Turki Turki bioRxiv 2022.02.18.481115; doi: https://doi.org/10.1101/2022.02.18.481115

Introduction Introduction Bioinformatics is a research field to analyze massive
genomics data sets using cutting edge computational/statistical/machine learning techniques. Typical data sets analyzed are • Gene expression profiles ~104 • DNA methylation ~107 • DNA accessibility ~ 107 • microRNA expression ~ 103 whereas the number of samples is few (10 to 102).

Data analysis of bioinformatics is a typical large p small
n problem. What is large p small n problem? There are only a few examples with many features, it is often difficult to distinguish between them.

For example, some novels to be classified into either fantasy
or science fiction are given to you. In this case, features are a set of words included in each novel. Fantasy: Snow White, The Load of the Rings, Knights of the Round Table Science Fiction: Star Trek, X men, Superman Is Star Wars science fiction or fantasy? Only based upon small number of examples, labeling a new title is very difficult.

In bioinformatics, number of samples (~102) whereas the number of
features is huge (>>104). As such a method, we developed tensor decomposition applicable to large p small n problem. In this talk, I would like to introduce some examples of application of the method we proposed, “Tensor Decomposition (TD) based unsupervised feature extraction (FE)” to bioinformatics problems.

I have published a book on this topics from Springer
international. I am glad if the audience can buy it and learn my method. Y-h. Taguchi, Unsupervised Feature Extraction Applied to Bioinformatics --- A PCA and TD Based Approach --- Springer International (2020)

What is a tensor? Scholar x: a number Vector x
i : a set of scholars in line Matrix x ij : a set of scholars aligned in a table (i.e. rows and columns) Tensor x ijk : a set of scholars aligned in an array more then two rows x ijk i j k 1 (1,2,3,4,...) (1 2 3 4 5 6 7 8 9 )

Tensor is suitable to store genomics data: Gene expression :x
ijk ∈ ℝN⨉M⨉K N genes ⨉ M persons ⨉ K tissues x ijk i:genes j:persons k:tissues

What is tensor decomposition(TD)? Expand tensor as a series of
product of vectors, x ijk i:genes j:persons k:tissues G k j i l 1 l 2 l 3 = u l 1 i u l 2 j u l 3 k u l 1 i u l 2 j u l 3 k x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =2 L 2 ∑ l 3 =1 L 3 G (l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Advantages of tensor decomposition(TD): We can know “Dependence of x
ijk upon i” → u l1i “Dependence of x ijk upon j” → u l2j “Dependence of x ijk upon k” → u l3k ← Healthy control vs patient ← tissue specificity Gene selection ↑ We can answer the question : Which genes are expressed between healthy controls and patients in tissue specific manner?

11 Application to a real example Application to a real
example Drug repositioning for COVID-19 Drug repositioning for COVID-19

13 Data set　GSE147507 Gene expression of human lung cell lines
with/without SARS-CoV-2 infection. i:genes(21797) j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4: A549 MOI 2.0, j=5:A549 ACE2 expressed (MOI:Multiplicity of infection) k: k=1: Mock, k=2:SARS-CoV-2 infected m: three biological replicates

14 x i jk m ∈ℝ21797×5×2×3 x i jk m
≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell lines dependence u l2k : l 2 th with and without SARS-CoV-2 infection u l3m : l 3 th dependence upon biological replicate u l4i : l 4 th gene dependence G: weight of individual terms

15 Purpose： identification of l 1 ,l 2 ,l 3
independent of cell lines and biological replicates （u l1j ,u l3m take constant regardless j,m） and dependent upon with or wothout SARS-CoV-2 infection（u l21 =-u l22 ） Heavy “large p small n” problem Number of variables(=p): 21797 ~ 104 Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10 p/n ~ 103

16 l 1 =1 l 2 =2 l 3 =1
Cell lines With and without SARS-CoV-2 infection biological replicate Independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection.

17 l 1 =1 l 2 =2 l 3 =1
｜G｜is the largest in which l 4 ？

18 Gene expression independent of cell lines and biological replicate,
but dependent upon SARS-CoV-2 infection is associated with u 5i (l 4 =5) P i =P χ2 [> (u 5i σ5 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. 3627 3627 genes with corrected P-values <0.1 are selected among 21,797 genes.

19 σ5 =6.55×10−3 σ5 =7.00×10−4 Computation from u 5i Optimization

20 Comparisons with conventional methods: Comparisons with conventional methods: Since
we do not know how many genes should be selected, lasso and random forest is useless. Instead we employed SAM and limma, which are gene selection specific algorithm (adjusted P-values are used ). t test SAM limma P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01 Calu3 21754 43 21797 0 335 3789 NHBE 21797 0 21797 0 342 3906 A549 MOI 0.2 21797 0 21797 0 319 4391 MOI 2.0 21472 325 21797 0 208 4169 ACE2 expressed 21796 1 21797 0 182 4245

21 Comparisons with DESeq2: Comparisons with DESeq2: DESeq2 P>0.01 P≦0.01
Calu3 7278 16432 NHBE 23383 327 A549 MOI 0.2 7858 15852 MOI 2.0 16279 7431 ACE2 expressed 16201 7509 After the publication of our paper, we have found the paper[*] that originally studied this GEO data was published (when we have done this study, only GEO data set was provides and no papers were published). The paper includes DESeq2 results. It is similar to limma; it detected most of genes as DEGs whereas it identified limited number of DEGs for NHBE cell lines [*]Daniel Blanco-Melo et al, Cell, 2020; 181(5): 1036-1045.e9. doi: 10.1016/j.cell.2020.04.026. Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19

22 Multiple hits with known SARS-CoV-2 interacting human genes

23 Drug perturbations from GEO down Drug perturbations from GEO
up

24 Drug perturbations from GEO down The first one, imatinib,
was once identified as a promising drug toward COVID-19, although it was rejected later [16]. The second one, apratoxin A, was reported to be a promising compound based on its protein binding affinity [17]. The third and fourth one, doxycycline, was supposed to be a promising drug to-ward COVID-19 [18]. The seventh one, trovafloxacin, was reported to be a promising compound based on its protein binding affinity [19]. The eighth one, doxorubicin, was also reported to be a promising compound based on its protein binding affinity [20]. The ninth one, cisplatin, and the tenth one, carboplatin, were proposed as a result of drug repositioning [21]. Seven of the nine compounds identified as the top 10 compounds have been previously reported as drugs toward SARS-CoV-2.

25 Drug perturbations from GEO up The first, fourth, and
tenth one, estradiol, was reported as a promising compound [22]. The second one, tamoxifen, was reported to inhibit SARS-CoV-2 infection by suppressing viral entry [23]. The third one, apratoxin A, has been listed in the previous page, too. The fifth one, MK-886, was reported to be an inhibitor of 3CL protease [24], although its efficiency was limited to 40 %. The sixth one, IFN- alphacon1, was reported to be an inhibitor of SARS-CoV [25] but not for SARS-CoV-2. The seventh one, arachidonic acid, was generally expected to inhibit SARS-CoV-2 infection [26]. The eighth one, arsenic, was also generally expected to act against the RdRp of coronavirus [27]. The ninth one, metoprolo, was reported to be a promising drug toward COVID-19 [28]. Thus, all the top 10 compounds were reported to be promising.

26 Conclusion We have applied “TD based unsupervised FE with
optimized SD” to drug repositioning that target SARS-CoV-2. 3627 genes identified with this method is highly overlapped with human genes known to interact with SARS-CoV-2 proteins. Almost all drugs listed as being known to target these 3637 genes have already been tested as drugs that target SARS-CoV-2. Thus the proposed strategy is very promising.

Drug repositioning for SARS-CoV-2 with tensor d...

Drug repositioning for SARS-CoV-2 with tensor decomposition based unsupervised feature extraction

Y-h. Taguchi

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript

Drug repositioning for SARS-CoV-2 with tensor decomposition based unsupervised feature

Introduction Introduction Bioinformatics is a research field to analyze massive

Data analysis of bioinformatics is a typical large p small

For example, some novels to be classified into either fantasy

In bioinformatics, number of samples (~102) whereas the number of

I have published a book on this topics from Springer

What is a tensor? Scholar x: a number Vector x

Tensor is suitable to store genomics data: Gene expression :x

What is tensor decomposition(TD)? Expand tensor as a series of

Advantages of tensor decomposition(TD): We can know “Dependence of x

11 Application to a real example Application to a real

12

13 Data set　GSE147507 Gene expression of human lung cell lines

14 x i jk m ∈ℝ21797×5×2×3 x i jk m

15 Purpose： identification of l 1 ,l 2 ,l 3

16 l 1 =1 l 2 =2 l 3 =1

17 l 1 =1 l 2 =2 l 3 =1

18 Gene expression independent of cell lines and biological replicate,

19 σ5 =6.55×10−3 σ5 =7.00×10−4 Computation from u 5i Optimization

20 Comparisons with conventional methods: Comparisons with conventional methods: Since

21 Comparisons with DESeq2: Comparisons with DESeq2: DESeq2 P>0.01 P≦0.01

22 Multiple hits with known SARS-CoV-2 interacting human genes

23 Drug perturbations from GEO down Drug perturbations from GEO

24 Drug perturbations from GEO down The first one, imatinib,

25 Drug perturbations from GEO up The first, fourth, and

26 Conclusion We have applied “TD based unsupervised FE with