Advancing Drug-Target Interactions Prediction: Leveraging a Large-Scale Dataset with a Rapid and Robust Chemogenomic Algorithm

Gwenn Guichaoua, 2nd year PhD Advancing Drug-Target Interactions Prediction: Leveraging
a Large-Scale Dataset with Komet, a Rapid and Robust Chemogenomic Algorithm Supervisors : Véronique Stoven, Chloé Azencott, Olivier Collier (Modal’X Nanterre), Clara Nahmias (IGR) 1 CBIO 26/02/2024

Bad Subtype Luminal A HER2-enriched Chemoterapy Hormonotherapy monoclonal antibodies Luminal
B Triple Negatif TNBC Phenotype Prognosis Treatment ER+ or PR+ HER2- ER+ or PR+ ER- PR- HER2- Good ER- PR- HER2+ ATIP3 protein: a new marker for a category of TNBC 2 Biological sub-typing of the breast cancers Breast cancer: 1 of the 3 most common cancers worldwide A candidate biomarker to de fi ne a new breast cancer subtype, identi fi ed by Clara Nahmias’s team •Low expression of ATIP3 in TNBC [Rodriguez&al, 2009] •Poorer prognosis for tumors that not express ATIP3 (called ATIP3- tumors) [Rodriguez&al, 2019] •70% of ATIP3- tumors resistance to the chemotherapy •ATIP3- resistant tumors more agressive than ATIP3+ tumors resistant Important unmet need for new therapies and therapeutic target Lack of knowledge for understanding the mechanism of ATIP3 ATIP3-

Roadmap of the thesis New sub-type of patients: ATIP3 de
fi cient TNBC Part 1: Find a genetic signature To predict the chemotherapy response Part 2: Chemogenomics Find a new treatment To increase the survival rate 70%, avoid chemotherapy 30%, chemotherapy 3

Find a new treatment For TNBC tumors, de fi cient
in ATIP3 Blocking points Unknown proteins involved in biological mechanisms for ATIP3- TNBC tumors Goal Search for proteins, speci fi c of these tumors and their corresponding molecules (ligands) 4 Data Phenotypic survival screen of molecules by Clara Nahmias’s team in IGR on cells lines TNBC ATIP3- and ATIP3+ in order to fi nd 20 molecules di ff erentially active on ATIP3- TNBC cells Survival TNBC ATIP3+ cells Sum 52 Cell line Sum52 ATIP3- Cells Sum52 Ctrl ATIP3+ Exposition of one of the 100 molecules (drugs) of TOCRIS base Apoptosis 20 di ff erentially active molecules ATIP3- vs ATIP3+ Problem statement Phenotypic screens provide hit molecules but not their targeted proteins/mechanism of action. My goal Predict the proteins targeted by the 20 hits

Prediction of protein-ligand interactions 5 Molecule 20 Protein Challenges The
most complete training base The largest, the most consensual With direct interactions and negative interactions The most e ffi cient algorithm In all scenarii of prediction, in a timely manner, With reasonable computing resources Goal Find unknown proteins targeted by the 20 hits and that may be responsible of phenotype Output: predicted interactions Supervised learning Input: database of interactions Binary classi fi cation problem 1 -1

Plan 6 LCIdb: a large new training database Motivation and
Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem

Why a new training database ? Protein Protein Protein Protein
Molecule Back to bioactivity databases A Consensus Compound/Bioactivity Dataset for Data-Driven Design and Chemogenomics [ Isigkei&al,2022] Extracted from 5 bioactivity databases : ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs + True interactions + Checked datas + More datas - less proteins Binary interactions database Drugbank v1.5.1 [Wishart&al,2018]: 2.513 proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - indirect interactions Direct binding: Kd, Ki, IC50 < 100 nM. No binding: Kd, Ki, IC50 > 10 microM Molecule Protein Molecule Protein 2.069 proteins 274.515 molecules 402.000 interactions + 8.000 interactions - 7

Construction of a Large Consensus Interactions dataset Preprocessing : For
a (molecule,protein) pair 1. Activity check annotation : keep multiple annotated bioactivities within one log unit di ff erence kept 2. Structure check : keep molecule which same SMILES between di ff erent sources 3. Keep IC50, Ki, Kd known 4. Make binary interactions : measure = fi rst Kd, then Ki, then IC50 measure <10nM ( M): interactions + measure > 100 microM ( M) : interactions - <latexit sha1_base64="19OAeTsEV3mWXvQneo58YjqgWMc=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0lErcuiGzdCBfuAWiVJp3UwLyYTodYu/QG3+l/iH+hfeGdMQS2iE5KcOfecO3PvdWOfJ9KyXnPG1PTM7Fx+vrCwuLS8UlxdayRRKjxW9yI/Ei3XSZjPQ1aXXPqsFQvmBK7Pmu7NsYo3b5lIeBSey0HMOoHTD3mPe44kqmVbl8OdyqhwVSxZZUsvcxLYGSghW7Wo+IILdBHBQ4oADCEkYR8OEnrasGEhJq6DIXGCENdxhhEK5E1JxUjhEHtD3z7t2hkb0l7lTLTbo1N8egU5TWyRJyKdIKxOM3U81ZkV+1vuoc6p7jagv5vlCoiVuCb2L99Y+V+fqkWih0NdA6eaYs2o6rwsS6q7om5ufqlKUoaYOIW7FBeEPe0c99nUnkTXrnrr6PibVipW7b1Mm+Jd3ZIGbP8c5yRo7Jbtg/L+2V6pepSNOo8NbGKb5llBFSeooa7n+IgnPBunRmLcGfefUiOXedbxbRkPH3mckXo=</latexit> 10 7 <latexit sha1_base64="3V19iHXrJMEsQ7O1Yf/AR3qR19A=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0mkPpZFN26ECvYBtUqSTutgXkwmQq1d+gNu9b/EP9C/8M6YglpEJyQ5c+45d+be68Y+T6RlveaMqemZ2bn8fGFhcWl5pbi61kiiVHis7kV+JFqukzCfh6wuufRZKxbMCVyfNd2bYxVv3jKR8Cg8l4OYdQKnH/Ie9xxJVMu2Loc7lVHhqliyypZe5iSwM1BCtmpR8QUX6CKChxQBGEJIwj4cJPS0YcNCTFwHQ+IEIa7jDCMUyJuSipHCIfaGvn3atTM2pL3KmWi3R6f49ApymtgiT0Q6QVidZup4qjMr9rfcQ51T3W1AfzfLFRArcU3sX76x8r8+VYtED4e6Bk41xZpR1XlZllR3Rd3c/FKVpAwxcQp3KS4Ie9o57rOpPYmuXfXW0fE3rVSs2nuZNsW7uiUN2P45zknQ2C3b++W9s0qpepSNOo8NbGKb5nmAKk5QQ13P8RFPeDZOjcS4M+4/pUYu86zj2zIePgBydpF3</latexit> 10 4 2.069 proteins 274.515 molecules 402.000 interactions + 8.000 interactions - 8

Analysis of our dataset LCIdb Representation of the molecular space
with the t-SNE algorithm on Tanimoto molecule features ( 𝔭 j )j ( 𝔪 i )i Comparisons with literature medium-sized datasets Coverage of the molecular space 9

Coverage of the protein space of our dataset LCIdb LCIdb
Drugbank BIOSNAP BindingDB LCIdb Drugbank BIOSNAP LCIdb Drugbank t-SNE algorithm on protein features derived from the LAkernel Protein kinase G-protein coupled receptor 1 Cytochrome P450 Tubulin Ligand-gated ion channel PI3/PI4-kinase SDRs Major facilitator Sodium chanel ARTD/PARP Aldo/keto reductase Cyclic nucleoDde phosphodiesterase Transient receptor Calcium channel alpha-1 subunit Calycin Integrin alpha chain G-protein coupled receptor 2 G-protein coupled receptor 3 adenylyl/guanylyl cyclase Nuclear hormone receptor Cyclins Bcl-2 Alpha-carbonic anhydrase Phospholipase A2 Histone deacetylase Small GTPase ABC transporter 10

DTI prediction pipeline 12 ( 𝔪 ik , 𝔭 jk
) Training dataset

Step1 : Embeddings for protein and molecule ψP ψM 13
Fixed embeddings Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Derived from kernel theory (Local Alignement kernel to compute similarity between 2 proteins [Saigo&al,2004] Learned embeddings Computed from pre-trained networks on another task (for example ProtBert [Elnaggar&al,2021]) Learned by neural networks in DTI prediction pipeline DeepPurpose [Huang&al,2020] 𝔭 𝔪

14 Choice of the Tanimoto kernel (similarity between molecules and
: ) kM 𝔪 𝔪 ′ kM ( 𝔪 , 𝔪 ′ ) Step1 : Embeddings for protein and molecule ψP ψM Derived from kernel theory kM ( 𝔪 , 𝔪 ′ ) molecules in the training set 𝔪 𝔪 ′ Nystrom approximation [Scholkopf et al, 1999] Choice of landmarks molecules in the training set mM ̂ 𝔪 Compute the small kernel where ̂ KM ∈ ℝmM ×mM ( ̂ KM )ℓ,t := kM ( ̂ 𝔪 ℓ , ̂ 𝔪 t ) From the SVD of , Compute the extrapolation matrix where ̂ KM = U diag(σ)U⊤ E ∈ ℝmM ×dM E := U[: , : dM ]diag(σ−1/2 s )dM s=1 and dimension reduction ψM ( 𝔪 ) := ( mM ∑ ℓ=1 Eℓ,s kM ( ̂ 𝔪 ℓ , 𝔪 )) dM s=1 ∈ ℝdM 𝔪 Molecule embedding kM ( ̂ 𝔪 ℓ , ̂ 𝔪 t ) = ⟨ψM ( ̂ 𝔪 ℓ ), ψM ( ̂ 𝔪 t )⟩ kM ( 𝔪 , ̂ 𝔪 t ) ≈ ⟨ψM ( 𝔪 ), ψM ( ̂ 𝔪 t )⟩ If , dM = mM

Step 2 : Features for (molecule, protein) pairs 15 ψM
( 𝔪 ) := m ψP ( 𝔭 ) := p z Mixing of molecule and protein embeddings z = mp⊤ m p z dZ = dM × dP Using a tensor product Linear mixing Concatenation of the embeddings z Non linear mixing Using a neural network

Step 3 : DTI prediction model 16 Tree-based methods [Shi
et al, 2019] Network-based inference approaches [Cheng et al, 2012] Linear model min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM with Hinge loss : ℓ(y′ , y) = max(0,1 − yy′ ) Logistic loss zk yk [MolTrans, Huang et al, 2021] w ( 𝔪 ik , 𝔭 jk ) Training dataset 16

Step 3 : Komet model 17 min w∈ℝdZ nZ ∑
k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 Optimization problem (Zw)k = ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP Code in PyTorch running on GPU https://komet.readthedocs.io Full batch BFGS method to solve the optimization problem

Parameters set up of the model Komet Large-sized dataset Number
of molecules More di ffi cult scenario of prediction (Orphan case) nM = 143k 19 Impact of landmarks Same performance Save computational time and ressources mM = 3000 Impact of dimension Same performance Save by a factor of 2: Time and RAM dM = 1000

Impact of molecule and protein embeddings Comparison fi xed and
learned embeddings Komet on the LCIdb_Orphan Dataset Fixed embeddings better than DL embeddings for this speci fi c problem (Drug-like molecules and human druggable proteins) 20 ψP ψM dM = 1000 mM = 3000 AUPR Stability of the Komet performance (except for Prot5XLUniref50)

Comparison to Deep Learning algorithms ConPlex model [Singh, 2023] MolTrans
model [Huang, 2021] 21 Almost same structure pipeline in 3 steps Optimization algorithm SGD less precise, less stable and quick than BFGS Everything trained vs fi xed features

Performance comparison (AUPR) with DL algorithms On literature medium-scale datasets
Evaluation Train/Val/Test (70%,10%, 20%) On an External dataset On large-scale datasets 22

Case study : A scaffold Hopping problem Komet recovers more
out-of-sca ff old ligands [Pinel, 2023] LCIdb better training dataset Komet outperforms in all criteria 23

Conclusion 24 Chemogenomics enlarge and consolidate the set of targeted
proteins Perspectives: Analysis of the target proteins predicted for the 20 di ff erentially active molecules Initial problem understanding biological mechanisms associated to a set of 20 di ff erentially active molecules found by Phenotypic survival screen Contributions: A large new molecule/protein interactions dataset Komet: Fast & State of the Art https://komet.readthedocs.io

Acknowledgments Project supported by the Île-de-France Region as part of
the “DIM AI4IDF” Sylvie RODRIGUES-FERREIRA Clara NAHMIAS Véronique STOVEN Chloé AZENCOTT Thanks for your attention! Olivier COLLIER

Cumulative Histogram Curves (CHC) Rank of unknown active Cumulative Proportion
Komet on LCIdb Kernel SVM on Drugbank ConPlex on BindingDB and contrastive on DUD-E ConPlex on LCIdb and contrastive on DUD-E 100 200 300 400 0.0 0.2 0.4 0.6. 0.8. 1

Advancing Drug-Target Interactions Prediction: ...

Advancing Drug-Target Interactions Prediction: Leveraging a Large-Scale Dataset with a Rapid and Robust Chemogenomic Algorithm

Guichaoua

More Decks by Guichaoua

Other Decks in Research

Featured

Transcript

Gwenn Guichaoua, 2nd year PhD Advancing Drug-Target Interactions Prediction: Leveraging

Bad Subtype Luminal A HER2-enriched Chemoterapy Hormonotherapy monoclonal antibodies Luminal

Roadmap of the thesis New sub-type of patients: ATIP3 de

Find a new treatment For TNBC tumors, de fi cient

Prediction of protein-ligand interactions 5 Molecule 20 Protein Challenges The

Plan 6 LCIdb: a large new training database Motivation and

Why a new training database ? Protein Protein Protein Protein

Construction of a Large Consensus Interactions dataset Preprocessing : For

Analysis of our dataset LCIdb Representation of the molecular space

Coverage of the protein space of our dataset LCIdb LCIdb

Plan 11 LCIdb: a large new training database Motivation and

DTI prediction pipeline 12 ( 𝔪 ik , 𝔭 jk

Step1 : Embeddings for protein and molecule ψP ψM 13

14 Choice of the Tanimoto kernel (similarity between molecules and

Step 2 : Features for (molecule, protein) pairs 15 ψM

Step 3 : DTI prediction model 16 Tree-based methods [Shi

Step 3 : Komet model 17 min w∈ℝdZ nZ ∑

Plan 18 LCIdb: a large new training database Motivation and

Parameters set up of the model Komet Large-sized dataset Number

Impact of molecule and protein embeddings Comparison fi xed and

Comparison to Deep Learning algorithms ConPlex model [Singh, 2023] MolTrans

Performance comparison (AUPR) with DL algorithms On literature medium-scale datasets

Case study : A scaffold Hopping problem Komet recovers more

Conclusion 24 Chemogenomics enlarge and consolidate the set of targeted

Acknowledgments Project supported by the Île-de-France Region as part of

Cumulative Histogram Curves (CHC) Rank of unknown active Cumulative Proportion