Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Protein - Ligand Affinity Prediction_Strategizing Data Usage for Virtual Screening, Elix, CBI 2023

Elix
October 25, 2023

Protein - Ligand Affinity Prediction_Strategizing Data Usage for Virtual Screening, Elix, CBI 2023

Elix

October 25, 2023
Tweet

More Decks by Elix

Other Decks in Research

Transcript

  1. Protein - ligand affinity prediction Strategizing data usage for virtual

    screening Thomas Auzard (presenter) - Elix, Inc. David Jimenez Barrero, Nazim Medzhidov, PhD - Elix, Inc. Naoki Tarui, PhD - SEEDSUPPLY Chem-Bio Informatics Society (CBI) Annual Meeting 2023, Tokyo, Japan|October 25th, 2023
  2. © 2023 Elix Inc. Protein - ligand binding prediction: strategizing

    data usage 2 Proposed solution for virtual hit screening under various data availability scenarios
  3. © 2023 Elix Inc. 3 Proprietary protein-ligand binding dataset Positive

    samples Negative samples • Proprietary training dataset ◦ 689 proteins (GPCRs and SLC transporters) ◦ Binder or non-binder molecules (SMILES) • Testing dataset ◦ 446,559 molecules ◦ Activity for 4 proteins ▪ GPR87, MC4R, GLP2R, SLC40A1 ▪ Respectively 26, 4, 7, and 9 positive hits
  4. © 2023 Elix Inc. 4 Model training and inference pipeline

    [1] Esposito C, Landrum GA, Schneider N, Stiefl N, Riniker S. GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. J Chem Inf Model. 2021;61(6):2623-2640. doi:10.1021/acs.jcim.1c00160
  5. © 2023 Elix Inc. 5 Does it work for any

    unseen target? Neighborhood of GPR87 and MCR4R. Each point represents a protein. Neighbors of MC4R have a better balance of positive and negative samples than GPR87. Protein GPR87 MC4R GLP2R SLC40A1 Prediction / ground truth 0 / 26 3 / 7 0 / 4 2 / 9 Shortlist size 49521 (11.0%) 37885 (8.4%) 45496 (10.2%) 43794 (9.8%) Confidence 40% 64% 40% 42% Here, 3 different architectures were combined. 64% and 40% confidence levels are respectively equivalent to 20 and 12 models agreeing on the prediction. The 4 test proteins are unseen (not in the training dataset). GPR87 MC4R Coverage of the target protein in the training dataset impacts the performances. How so? • Defining protein coverage ◦ Presence of similar protein in the training data ◦ Quality of the coverage ▪ Positive / negative balance ▪ Else?
  6. © 2023 Elix Inc. 6 Training dataset augmentation scenario •

    Adding samples (including positive) for the target to the training dataset • 3 training modalities 1. Target unseen 2. Target seen 3. Fine-tuning on target Can we improve the performances with additional data collection?
  7. © 2023 Elix Inc. 7 Training dataset augmentation scenario results

    Fine-tuning Retraining • Re-trained and fine-tuned models systematically beat baseline model ◦ GPR87: 5 / 15 hits in ~3100 mols ◦ MC4R: 2 / 3 hits in ~80 000 mols • GPR87 showed the best improvement ◦ Worst baseline • 50% of the augmentation data belongs to GPR87 ◦ Predominance of GPR87 samples could have biased the models towards GPR87 Improving coverage can improve performances. However, the model seems to be biased to this added coverage. How so?
  8. © 2023 Elix Inc. 8 Understanding data bias with protein

    clustering • Cluster preparation ◦ Based on protein similarity: inverse of euclidean distance of ESM2-generated feature vectors • 9 clusters based on similarity, 2 clusters based on dissimilarity Protein similarity map, using multidimensional scaling (MDS). The bigger the dot, the higher the intra-similarity of the cluster. This is only a representation, not a actual depiction of the distances between proteins. … …
  9. © 2023 Elix Inc. 9 Protein clustering - impact of

    training coverage • Intra-cluster similarity ◦ Sensitivity: strong correlation ◦ Precision: moderate correlation 4 lowest similarity clusters, removed in 3 graphs below High intra-cluster similarity is correlated with good performances. So, adding more coverage should boost the performances, right? High performing clusters Topology of those clusters define a good cluster • Cluster topology (among high intra-similarity clusters) ◦ No significant correlation with performance metrics ◦ Reasonable %positive (> 15%)
  10. © 2023 Elix Inc. 10 Focus on SLC7A1 - is

    it always worth collecting more data? • SLC7A1 (part of training data) ◦ 23 positive samples ◦ Belongs to cluster 3 ▪ High intra-similarity ▪ 18 proteins ▪ 37% of positive samples ◦ Coverage from clusters 4, 5, 7 • 3 training modalities 1. Without cluster 3 2. With samples of SLC7A1 3. With cluster of SLC7A1
  11. © 2023 Elix Inc. 11 Improving coverage with similar proteins

    - results • No improvements with enrichment ◦ Baseline is already “good” ◦ SLC7A1 is covered by clusters 4, 5 and 7 Clusters 3, 4, 5 and 7 cover similar space. Compared to cluster 1 or 2, they receive coverage from other clusters. Improving coverage has its limits. What about the impact of distant data points? Fine-tuned with SLC7A1 Samples Fine-tuned with Clusters
  12. © 2023 Elix Inc. 12 Tailored clusters from proprietary dataset

    • Training model on limited protein space • Preparation of custom clusters for protein of interest 1. Protein of interest is the centroid of the cluster 2. A good cluster need to satisfy topology metrics • Low to no impact for GPR87 and MC4R • Lowered performance for SLC40A1 Better use the whole training dataset in both case. Intra-cluster similarity Size % of positive 11% 36% 38%
  13. © 2023 Elix Inc. 13 Conclusion • Protein coverage, good

    cluster: depends on the data → Similar analysis should be performed with available training ip data and library to screen • Performances conditioned by models ◦ Protein-ligand interaction not used! ▪ Pharmacophore models, … • Another use case leveraging our approach ◦ Hard to express human proteins → Use animal proteins for experiments ◦ Data for animal protein could be used for enriching training data → More confidence in the binding affinity with ML
  14. © 2023 Elix Inc. 18 Models architecture Molecular Graph Molecular

    Features Protein Features Graph Features Protein-Molecule Features Aggregation Linear Network Aggregation Prediction Molecular SMILES Protein Sequence ESM-2 Network Graph Convolutional NN Augmented tiered GCN Feature aggregation • GCN pretrained on publicly available GPCR data • Combine numerous featurization and aggregation process for both ligand and protein The augmented tiered GCN performing slightly better and being lighter, this model will be the default unless stated. Ligand featurization Ligand featurization Protein featurization Protein featurization Aggregation Aggregation
  15. © 2023 Elix Inc. 19 Appendix Average intra-distance Number of

    unique samples / total number of samples Number of proteins % of positive samples Cluster 1 1.89 1703 / 1728 33 16.72% Cluster 2 2.26 5882 / 6192 117 33.43% Cluster 3 1.64 859 / 864 18 37.38% Cluster 4 1.51 432 / 432 9 15.74% Cluster 5 1.36 240 / 240 5 14.16% Cluster 6 1.54 432 / 432 9 29.86% Cluster 7 1.02 232 / 240 5 46.67% Cluster 8 1.41 236 / 240 5 18.75% Cluster 9 1.51 320 / 336 7 37.80% Cluster 10 5.36 262 / 288 5 35.42% Cluster 11 4.43 5746 / 5953 117 19.05% Detailed distributions of the clusters, obtained with manual inspection of DBSCAN clustering. Protein similarity map, using multidimensional scaling (MDS). The bigger the dot, the higher the intra-similarity of the cluster. This is only a representation, not a actual depiction of the distances between proteins.
  16. © 2023 Elix Inc. 20 Average intra-distance Number of unique

    samples / total number of samples Number of proteins % of positive samples GPR87 1.22 717 / 720 13 11.11% MC4R 1.25 240 / 240 5 35.58% SLC40A1 1.82 192 / 192 4 38.02% Detailed distributions of the custom clusters. SLC40A1 is a control cluster: it has only 4 proteins and the threshold distance for selection was 2.0.