Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advancing Drug-Target Interactions Prediction: Leveraging a Large-Scale Dataset with a Rapid and Robust Chemogenomic Algorithm

Guichaoua
February 26, 2024

Advancing Drug-Target Interactions Prediction: Leveraging a Large-Scale Dataset with a Rapid and Robust Chemogenomic Algorithm

Presentation at CBIO meeting, 26/02/2024

Guichaoua

February 26, 2024
Tweet

More Decks by Guichaoua

Other Decks in Research

Transcript

  1. Gwenn Guichaoua, 2nd year PhD Advancing Drug-Target Interactions Prediction: Leveraging

    a Large-Scale Dataset with Komet, a Rapid and Robust Chemogenomic Algorithm Supervisors : Véronique Stoven, Chloé Azencott, Olivier Collier (Modal’X Nanterre), Clara Nahmias (IGR) 1 CBIO 26/02/2024
  2. Bad Subtype Luminal A HER2-enriched Chemoterapy Hormonotherapy monoclonal antibodies Luminal

    B Triple Negatif TNBC Phenotype Prognosis Treatment ER+ or PR+ HER2- ER+ or PR+ ER- PR- HER2- Good ER- PR- HER2+ ATIP3 protein: a new marker for a category of TNBC 2 Biological sub-typing of the breast cancers Breast cancer: 1 of the 3 most common cancers worldwide A candidate biomarker to de fi ne a new breast cancer subtype, identi fi ed by Clara Nahmias’s team •Low expression of ATIP3 in TNBC [Rodriguez&al, 2009] •Poorer prognosis for tumors that not express ATIP3 (called ATIP3- tumors) [Rodriguez&al, 2019] •70% of ATIP3- tumors resistance to the chemotherapy •ATIP3- resistant tumors more agressive than ATIP3+ tumors resistant Important unmet need for new therapies and therapeutic target Lack of knowledge for understanding the mechanism of ATIP3 ATIP3-
  3. Roadmap of the thesis New sub-type of patients: ATIP3 de

    fi cient TNBC Part 1: Find a genetic signature To predict the chemotherapy response Part 2: Chemogenomics Find a new treatment To increase the survival rate 70%, avoid chemotherapy 30%, chemotherapy 3
  4. Find a new treatment For TNBC tumors, de fi cient

    in ATIP3 Blocking points Unknown proteins involved in biological mechanisms for ATIP3- TNBC tumors Goal Search for proteins, speci fi c of these tumors and their corresponding molecules (ligands) 4 Data Phenotypic survival screen of molecules by Clara Nahmias’s team in IGR on cells lines TNBC ATIP3- and ATIP3+ in order to fi nd 20 molecules di ff erentially active on ATIP3- TNBC cells Survival TNBC ATIP3+ cells Sum 52 Cell line Sum52 ATIP3- Cells Sum52 Ctrl ATIP3+ Exposition of one of the 100 molecules (drugs) of TOCRIS base Apoptosis 20 di ff erentially active molecules ATIP3- vs ATIP3+ Problem statement Phenotypic screens provide hit molecules but not their targeted proteins/mechanism of action. My goal Predict the proteins targeted by the 20 hits
  5. Prediction of protein-ligand interactions 5 Molecule 20 Protein Challenges The

    most complete training base The largest, the most consensual With direct interactions and negative interactions The most e ffi cient algorithm In all scenarii of prediction, in a timely manner, With reasonable computing resources Goal Find unknown proteins targeted by the 20 hits and that may be responsible of phenotype Output: predicted interactions Supervised learning Input: database of interactions Binary classi fi cation problem 1 -1
  6. Plan 6 LCIdb: a large new training database Motivation and

    Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem
  7. Why a new training database ? Protein Protein Protein Protein

    Molecule Back to bioactivity databases A Consensus Compound/Bioactivity Dataset for Data-Driven Design and Chemogenomics [ Isigkei&al,2022] Extracted from 5 bioactivity databases : ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs + True interactions + Checked datas + More datas - less proteins Binary interactions database Drugbank v1.5.1 [Wishart&al,2018]: 2.513 proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - indirect interactions Direct binding: Kd, Ki, IC50 < 100 nM. No binding: Kd, Ki, IC50 > 10 microM Molecule Protein Molecule Protein 2.069 proteins 274.515 molecules 402.000 interactions + 8.000 interactions - 7
  8. Construction of a Large Consensus Interactions dataset Preprocessing : For

    a (molecule,protein) pair 1. Activity check annotation : keep multiple annotated bioactivities within one log unit di ff erence kept 2. Structure check : keep molecule which same SMILES between di ff erent sources 3. Keep IC50, Ki, Kd known 4. Make binary interactions : measure = fi rst Kd, then Ki, then IC50 measure <10nM ( M): interactions + measure > 100 microM ( M) : interactions - <latexit sha1_base64="19OAeTsEV3mWXvQneo58YjqgWMc=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0lErcuiGzdCBfuAWiVJp3UwLyYTodYu/QG3+l/iH+hfeGdMQS2iE5KcOfecO3PvdWOfJ9KyXnPG1PTM7Fx+vrCwuLS8UlxdayRRKjxW9yI/Ei3XSZjPQ1aXXPqsFQvmBK7Pmu7NsYo3b5lIeBSey0HMOoHTD3mPe44kqmVbl8OdyqhwVSxZZUsvcxLYGSghW7Wo+IILdBHBQ4oADCEkYR8OEnrasGEhJq6DIXGCENdxhhEK5E1JxUjhEHtD3z7t2hkb0l7lTLTbo1N8egU5TWyRJyKdIKxOM3U81ZkV+1vuoc6p7jagv5vlCoiVuCb2L99Y+V+fqkWih0NdA6eaYs2o6rwsS6q7om5ufqlKUoaYOIW7FBeEPe0c99nUnkTXrnrr6PibVipW7b1Mm+Jd3ZIGbP8c5yRo7Jbtg/L+2V6pepSNOo8NbGKb5llBFSeooa7n+IgnPBunRmLcGfefUiOXedbxbRkPH3mckXo=</latexit> 10 7 <latexit sha1_base64="3V19iHXrJMEsQ7O1Yf/AR3qR19A=">AAACy3icjVHLSsNAFD2Nr1pfVZdugkVwY0mkPpZFN26ECvYBtUqSTutgXkwmQq1d+gNu9b/EP9C/8M6YglpEJyQ5c+45d+be68Y+T6RlveaMqemZ2bn8fGFhcWl5pbi61kiiVHis7kV+JFqukzCfh6wuufRZKxbMCVyfNd2bYxVv3jKR8Cg8l4OYdQKnH/Ie9xxJVMu2Loc7lVHhqliyypZe5iSwM1BCtmpR8QUX6CKChxQBGEJIwj4cJPS0YcNCTFwHQ+IEIa7jDCMUyJuSipHCIfaGvn3atTM2pL3KmWi3R6f49ApymtgiT0Q6QVidZup4qjMr9rfcQ51T3W1AfzfLFRArcU3sX76x8r8+VYtED4e6Bk41xZpR1XlZllR3Rd3c/FKVpAwxcQp3KS4Ie9o57rOpPYmuXfXW0fE3rVSs2nuZNsW7uiUN2P45zknQ2C3b++W9s0qpepSNOo8NbGKb5nmAKk5QQ13P8RFPeDZOjcS4M+4/pUYu86zj2zIePgBydpF3</latexit> 10 4 2.069 proteins 274.515 molecules 402.000 interactions + 8.000 interactions - 8
  9. Analysis of our dataset LCIdb Representation of the molecular space

    with the t-SNE algorithm on Tanimoto molecule features ( 𝔭 j )j ( 𝔪 i )i Comparisons with literature medium-sized datasets Coverage of the molecular space 9
  10. Coverage of the protein space of our dataset LCIdb LCIdb

    Drugbank BIOSNAP BindingDB LCIdb Drugbank BIOSNAP LCIdb Drugbank t-SNE algorithm on protein features derived from the LAkernel Protein kinase G-protein coupled receptor 1 Cytochrome P450 Tubulin Ligand-gated ion channel PI3/PI4-kinase SDRs Major facilitator Sodium chanel ARTD/PARP Aldo/keto reductase Cyclic nucleoDde phosphodiesterase Transient receptor Calcium channel alpha-1 subunit Calycin Integrin alpha chain G-protein coupled receptor 2 G-protein coupled receptor 3 adenylyl/guanylyl cyclase Nuclear hormone receptor Cyclins Bcl-2 Alpha-carbonic anhydrase Phospholipase A2 Histone deacetylase Small GTPase ABC transporter 10
  11. Plan 11 LCIdb: a large new training database Motivation and

    Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem
  12. Step1 : Embeddings for protein and molecule ψP ψM 13

    Fixed embeddings Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Derived from kernel theory (Local Alignement kernel to compute similarity between 2 proteins [Saigo&al,2004] Learned embeddings Computed from pre-trained networks on another task (for example ProtBert [Elnaggar&al,2021]) Learned by neural networks in DTI prediction pipeline DeepPurpose [Huang&al,2020] 𝔭 𝔪
  13. 14 Choice of the Tanimoto kernel (similarity between molecules and

    : ) kM 𝔪 𝔪 ′  kM ( 𝔪 , 𝔪 ′  ) Step1 : Embeddings for protein and molecule ψP ψM Derived from kernel theory kM ( 𝔪 , 𝔪 ′  ) molecules in the training set 𝔪 𝔪 ′  Nystrom approximation [Scholkopf et al, 1999] Choice of landmarks molecules in the training set mM ̂ 𝔪 Compute the small kernel where ̂ KM ∈ ℝmM ×mM ( ̂ KM )ℓ,t := kM ( ̂ 𝔪 ℓ , ̂ 𝔪 t ) From the SVD of , Compute the extrapolation matrix where ̂ KM = U diag(σ)U⊤ E ∈ ℝmM ×dM E := U[: , : dM ]diag(σ−1/2 s )dM s=1 and dimension reduction ψM ( 𝔪 ) := ( mM ∑ ℓ=1 Eℓ,s kM ( ̂ 𝔪 ℓ , 𝔪 )) dM s=1 ∈ ℝdM 𝔪 Molecule embedding kM ( ̂ 𝔪 ℓ , ̂ 𝔪 t ) = ⟨ψM ( ̂ 𝔪 ℓ ), ψM ( ̂ 𝔪 t )⟩ kM ( 𝔪 , ̂ 𝔪 t ) ≈ ⟨ψM ( 𝔪 ), ψM ( ̂ 𝔪 t )⟩ If , dM = mM
  14. Step 2 : Features for (molecule, protein) pairs 15 ψM

    ( 𝔪 ) := m ψP ( 𝔭 ) := p z Mixing of molecule and protein embeddings z = mp⊤ m p z dZ = dM × dP Using a tensor product Linear mixing Concatenation of the embeddings z Non linear mixing Using a neural network
  15. Step 3 : DTI prediction model 16 Tree-based methods [Shi

    et al, 2019] Network-based inference approaches [Cheng et al, 2012] Linear model min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM with Hinge loss : ℓ(y′  , y) = max(0,1 − yy′  ) Logistic loss zk yk [MolTrans, Huang et al, 2021] w ( 𝔪 ik , 𝔭 jk ) Training dataset 16
  16. Step 3 : Komet model 17 min w∈ℝdZ nZ ∑

    k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 Optimization problem (Zw)k = ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP Code in PyTorch running on GPU https://komet.readthedocs.io Full batch BFGS method to solve the optimization problem
  17. Plan 18 LCIdb: a large new training database Motivation and

    Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Embeddings of proteins and molecules Interaction module DTI classi fi cation Results Parameters set-up of the model Impact of molecule and protein features Comparison of ML algorithms Case Study: A sca ff old hopping problem
  18. Parameters set up of the model Komet Large-sized dataset Number

    of molecules More di ffi cult scenario of prediction (Orphan case) nM = 143k 19 Impact of landmarks Same performance Save computational time and ressources mM = 3000 Impact of dimension Same performance Save by a factor of 2: Time and RAM dM = 1000
  19. Impact of molecule and protein embeddings Comparison fi xed and

    learned embeddings Komet on the LCIdb_Orphan Dataset Fixed embeddings better than DL embeddings for this speci fi c problem (Drug-like molecules and human druggable proteins) 20 ψP ψM dM = 1000 mM = 3000 AUPR Stability of the Komet performance (except for Prot5XLUniref50)
  20. Comparison to Deep Learning algorithms ConPlex model [Singh, 2023] MolTrans

    model [Huang, 2021] 21 Almost same structure pipeline in 3 steps Optimization algorithm SGD less precise, less stable and quick than BFGS Everything trained vs fi xed features
  21. Performance comparison (AUPR) with DL algorithms On literature medium-scale datasets

    Evaluation Train/Val/Test (70%,10%, 20%) On an External dataset On large-scale datasets 22
  22. Case study : A scaffold Hopping problem Komet recovers more

    out-of-sca ff old ligands [Pinel, 2023] LCIdb better training dataset Komet outperforms in all criteria 23
  23. Conclusion 24 Chemogenomics enlarge and consolidate the set of targeted

    proteins Perspectives: Analysis of the target proteins predicted for the 20 di ff erentially active molecules Initial problem understanding biological mechanisms associated to a set of 20 di ff erentially active molecules found by Phenotypic survival screen Contributions: A large new molecule/protein interactions dataset Komet: Fast & State of the Art https://komet.readthedocs.io
  24. Acknowledgments Project supported by the Île-de-France Region as part of

    the “DIM AI4IDF” Sylvie RODRIGUES-FERREIRA Clara NAHMIAS Véronique STOVEN Chloé AZENCOTT Thanks for your attention! Olivier COLLIER
  25. Cumulative Histogram Curves (CHC) Rank of unknown active Cumulative Proportion

    Komet on LCIdb Kernel SVM on Drugbank ConPlex on BindingDB and contrastive on DUD-E ConPlex on LCIdb and contrastive on DUD-E 100 200 300 400 0.0 0.2 0.4 0.6. 0.8. 1