Slide 1

Slide 1 text

Gwenn Guichaoua, 2nd year PhD Advancing Drug-Target Interactions Prediction: Leveraging a Large-Scale Dataset with Komet, a Rapid and Robust Chemogenomic Algorithm Supervisors : Véronique Stoven, Chloé Azencott, Olivier Collier (Modal’X Nanterre), Clara Nahmias (IGR) 1 CBIO 26/02/2024

Slide 2

Slide 2 text

Bad Subtype Luminal A HER2-enriched Chemoterapy Hormonotherapy monoclonal antibodies Luminal B Triple Negatif TNBC Phenotype Prognosis Treatment ER+ or PR+ HER2- ER+ or PR+ ER- PR- HER2- Good ER- PR- HER2+ ATIP3 protein: a new marker for a category of TNBC 2 Biological sub-typing of the breast cancers Breast cancer: 1 of the 3 most common cancers worldwide A candidate biomarker to de fi ne a new breast cancer subtype, identi fi ed by Clara Nahmias’s team •Low expression of ATIP3 in TNBC [Rodriguez&al, 2009] •Poorer prognosis for tumors that not express ATIP3 (called ATIP3- tumors) [Rodriguez&al, 2019] •70% of ATIP3- tumors resistance to the chemotherapy •ATIP3- resistant tumors more agressive than ATIP3+ tumors resistant Important unmet need for new therapies and therapeutic target Lack of knowledge for understanding the mechanism of ATIP3 ATIP3-

Slide 3

Slide 3 text

Roadmap of the thesis New sub-type of patients: ATIP3 de fi cient TNBC Part 1: Find a genetic signature To predict the chemotherapy response Part 2: Chemogenomics Find a new treatment To increase the survival rate 70%, avoid chemotherapy 30%, chemotherapy 3

Slide 4

Slide 4 text

Find a new treatment For TNBC tumors, de fi cient in ATIP3 Blocking points Unknown proteins involved in biological mechanisms for ATIP3- TNBC tumors Goal Search for proteins, speci fi c of these tumors and their corresponding molecules (ligands) 4 Data Phenotypic survival screen of molecules by Clara Nahmias’s team in IGR on cells lines TNBC ATIP3- and ATIP3+ in order to fi nd 20 molecules di ff erentially active on ATIP3- TNBC cells Survival TNBC ATIP3+ cells Sum 52 Cell line Sum52 ATIP3- Cells Sum52 Ctrl ATIP3+ Exposition of one of the 100 molecules (drugs) of TOCRIS base Apoptosis 20 di ff erentially active molecules ATIP3- vs ATIP3+ Problem statement Phenotypic screens provide hit molecules but not their targeted proteins/mechanism of action. My goal Predict the proteins targeted by the 20 hits

Slide 5

Slide 5 text

Prediction of protein-ligand interactions 5 Molecule 20 Protein Challenges The most complete training base The largest, the most consensual With direct interactions and negative interactions The most e ffi cient algorithm In all scenarii of prediction, in a timely manner, With reasonable computing resources Goal Find unknown proteins targeted by the 20 hits and that may be responsible of phenotype Output: predicted interactions Supervised learning Input: database of interactions Binary classi fi cation problem 1 -1

Slide 6

Slide 6 text

Plan 6 LCIdb: a large new training database Motivation and Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem

Slide 7

Slide 7 text

Why a new training database ? Protein Protein Protein Protein Molecule Back to bioactivity databases A Consensus Compound/Bioactivity Dataset for Data-Driven Design and Chemogenomics [ Isigkei&al,2022] Extracted from 5 bioactivity databases : ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs + True interactions + Checked datas + More datas - less proteins Binary interactions database Drugbank v1.5.1 [Wishart&al,2018]: 2.513 proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - indirect interactions Direct binding: Kd, Ki, IC50 < 100 nM. No binding: Kd, Ki, IC50 > 10 microM Molecule Protein Molecule Protein 2.069 proteins 274.515 molecules 402.000 interactions + 8.000 interactions - 7

Slide 8

Slide 8 text

Construction of a Large Consensus Interactions dataset Preprocessing : For a (molecule,protein) pair 1. Activity check annotation : keep multiple annotated bioactivities within one log unit di ff erence kept 2. Structure check : keep molecule which same SMILES between di ff erent sources 3. Keep IC50, Ki, Kd known 4. Make binary interactions : measure = fi rst Kd, then Ki, then IC50 measure <10nM ( M): interactions + measure > 100 microM ( M) : interactions - AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0lErcuiGzdCBfuAWiVJp3UwLyYTodYu/QG3+l/iH+hfeGdMQS2iE5KcOfecO3PvdWOfJ9KyXnPG1PTM7Fx+vrCwuLS8UlxdayRRKjxW9yI/Ei3XSZjPQ1aXXPqsFQvmBK7Pmu7NsYo3b5lIeBSey0HMOoHTD3mPe44kqmVbl8OdyqhwVSxZZUsvcxLYGSghW7Wo+IILdBHBQ4oADCEkYR8OEnrasGEhJq6DIXGCENdxhhEK5E1JxUjhEHtD3z7t2hkb0l7lTLTbo1N8egU5TWyRJyKdIKxOM3U81ZkV+1vuoc6p7jagv5vlCoiVuCb2L99Y+V+fqkWih0NdA6eaYs2o6rwsS6q7om5ufqlKUoaYOIW7FBeEPe0c99nUnkTXrnrr6PibVipW7b1Mm+Jd3ZIGbP8c5yRo7Jbtg/L+2V6pepSNOo8NbGKb5llBFSeooa7n+IgnPBunRmLcGfefUiOXedbxbRkPH3mckXo= 10 7 AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0mkPpZFN26ECvYBtUqSTutgXkwmQq1d+gNu9b/EP9C/8M6YglpEJyQ5c+45d+be68Y+T6RlveaMqemZ2bn8fGFhcWl5pbi61kiiVHis7kV+JFqukzCfh6wuufRZKxbMCVyfNd2bYxVv3jKR8Cg8l4OYdQKnH/Ie9xxJVMu2Loc7lVHhqliyypZe5iSwM1BCtmpR8QUX6CKChxQBGEJIwj4cJPS0YcNCTFwHQ+IEIa7jDCMUyJuSipHCIfaGvn3atTM2pL3KmWi3R6f49ApymtgiT0Q6QVidZup4qjMr9rfcQ51T3W1AfzfLFRArcU3sX76x8r8+VYtED4e6Bk41xZpR1XlZllR3Rd3c/FKVpAwxcQp3KS4Ie9o57rOpPYmuXfXW0fE3rVSs2nuZNsW7uiUN2P45zknQ2C3b++W9s0qpepSNOo8NbGKb5nmAKk5QQ13P8RFPeDZOjcS4M+4/pUYu86zj2zIePgBydpF3 10 4 2.069 proteins 274.515 molecules 402.000 interactions + 8.000 interactions - 8

Slide 9

Slide 9 text

Analysis of our dataset LCIdb Representation of the molecular space with the t-SNE algorithm on Tanimoto molecule features ( 𝔭 j )j ( 𝔪 i )i Comparisons with literature medium-sized datasets Coverage of the molecular space 9

Slide 10

Slide 10 text

Coverage of the protein space of our dataset LCIdb LCIdb Drugbank BIOSNAP BindingDB LCIdb Drugbank BIOSNAP LCIdb Drugbank t-SNE algorithm on protein features derived from the LAkernel Protein kinase G-protein coupled receptor 1 Cytochrome P450 Tubulin Ligand-gated ion channel PI3/PI4-kinase SDRs Major facilitator Sodium chanel ARTD/PARP Aldo/keto reductase Cyclic nucleoDde phosphodiesterase Transient receptor Calcium channel alpha-1 subunit Calycin Integrin alpha chain G-protein coupled receptor 2 G-protein coupled receptor 3 adenylyl/guanylyl cyclase Nuclear hormone receptor Cyclins Bcl-2 Alpha-carbonic anhydrase Phospholipase A2 Histone deacetylase Small GTPase ABC transporter 10

Slide 11

Slide 11 text

Plan 11 LCIdb: a large new training database Motivation and Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem

Slide 12

Slide 12 text

DTI prediction pipeline 12 ( 𝔪 ik , 𝔭 jk ) Training dataset

Slide 13

Slide 13 text

Step1 : Embeddings for protein and molecule ψP ψM 13 Fixed embeddings Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Derived from kernel theory (Local Alignement kernel to compute similarity between 2 proteins [Saigo&al,2004] Learned embeddings Computed from pre-trained networks on another task (for example ProtBert [Elnaggar&al,2021]) Learned by neural networks in DTI prediction pipeline DeepPurpose [Huang&al,2020] 𝔭 𝔪

Slide 14

Slide 14 text

14 Choice of the Tanimoto kernel (similarity between molecules and : ) kM 𝔪 𝔪 ′  kM ( 𝔪 , 𝔪 ′  ) Step1 : Embeddings for protein and molecule ψP ψM Derived from kernel theory kM ( 𝔪 , 𝔪 ′  ) molecules in the training set 𝔪 𝔪 ′  Nystrom approximation [Scholkopf et al, 1999] Choice of landmarks molecules in the training set mM ̂ 𝔪 Compute the small kernel where ̂ KM ∈ ℝmM ×mM ( ̂ KM )ℓ,t := kM ( ̂ 𝔪 ℓ , ̂ 𝔪 t ) From the SVD of , Compute the extrapolation matrix where ̂ KM = U diag(σ)U⊤ E ∈ ℝmM ×dM E := U[: , : dM ]diag(σ−1/2 s )dM s=1 and dimension reduction ψM ( 𝔪 ) := ( mM ∑ ℓ=1 Eℓ,s kM ( ̂ 𝔪 ℓ , 𝔪 )) dM s=1 ∈ ℝdM 𝔪 Molecule embedding kM ( ̂ 𝔪 ℓ , ̂ 𝔪 t ) = ⟨ψM ( ̂ 𝔪 ℓ ), ψM ( ̂ 𝔪 t )⟩ kM ( 𝔪 , ̂ 𝔪 t ) ≈ ⟨ψM ( 𝔪 ), ψM ( ̂ 𝔪 t )⟩ If , dM = mM

Slide 15

Slide 15 text

Step 2 : Features for (molecule, protein) pairs 15 ψM ( 𝔪 ) := m ψP ( 𝔭 ) := p z Mixing of molecule and protein embeddings z = mp⊤ m p z dZ = dM × dP Using a tensor product Linear mixing Concatenation of the embeddings z Non linear mixing Using a neural network

Slide 16

Slide 16 text

Step 3 : DTI prediction model 16 Tree-based methods [Shi et al, 2019] Network-based inference approaches [Cheng et al, 2012] Linear model min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM with Hinge loss : ℓ(y′  , y) = max(0,1 − yy′  ) Logistic loss zk yk [MolTrans, Huang et al, 2021] w ( 𝔪 ik , 𝔭 jk ) Training dataset 16

Slide 17

Slide 17 text

Step 3 : Komet model 17 min w∈ℝdZ nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 Optimization problem (Zw)k = ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP Code in PyTorch running on GPU https://komet.readthedocs.io Full batch BFGS method to solve the optimization problem

Slide 18

Slide 18 text

Plan 18 LCIdb: a large new training database Motivation and Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem

Slide 19

Slide 19 text

Parameters set up of the model Komet Large-sized dataset Number of molecules More di ffi cult scenario of prediction (Orphan case) nM = 143k 19 Impact of landmarks Same performance Save computational time and ressources mM = 3000 Impact of dimension Same performance Save by a factor of 2: Time and RAM dM = 1000

Slide 20

Slide 20 text

Impact of molecule and protein embeddings Comparison fi xed and learned embeddings Komet on the LCIdb_Orphan Dataset Fixed embeddings better than DL embeddings for this speci fi c problem (Drug-like molecules and human druggable proteins) 20 ψP ψM dM = 1000 mM = 3000 AUPR Stability of the Komet performance (except for Prot5XLUniref50)

Slide 21

Slide 21 text

Comparison to Deep Learning algorithms ConPlex model [Singh, 2023] MolTrans model [Huang, 2021] 21 Almost same structure pipeline in 3 steps Optimization algorithm SGD less precise, less stable and quick than BFGS Everything trained vs fi xed features

Slide 22

Slide 22 text

Performance comparison (AUPR) with DL algorithms On literature medium-scale datasets Evaluation Train/Val/Test (70%,10%, 20%) On an External dataset On large-scale datasets 22

Slide 23

Slide 23 text

Case study : A scaffold Hopping problem Komet recovers more out-of-sca ff old ligands [Pinel, 2023] LCIdb better training dataset Komet outperforms in all criteria 23

Slide 24

Slide 24 text

Conclusion 24 Chemogenomics enlarge and consolidate the set of targeted proteins Perspectives: Analysis of the target proteins predicted for the 20 di ff erentially active molecules Initial problem understanding biological mechanisms associated to a set of 20 di ff erentially active molecules found by Phenotypic survival screen Contributions: A large new molecule/protein interactions dataset Komet: Fast & State of the Art https://komet.readthedocs.io

Slide 25

Slide 25 text

Acknowledgments Project supported by the Île-de-France Region as part of the “DIM AI4IDF” Sylvie RODRIGUES-FERREIRA Clara NAHMIAS Véronique STOVEN Chloé AZENCOTT Thanks for your attention! Olivier COLLIER

Slide 26

Slide 26 text

Cumulative Histogram Curves (CHC) Rank of unknown active Cumulative Proportion Komet on LCIdb Kernel SVM on Drugbank ConPlex on BindingDB and contrastive on DUD-E ConPlex on LCIdb and contrastive on DUD-E 100 200 300 400 0.0 0.2 0.4 0.6. 0.8. 1