Protein - Ligand Affinity Prediction_Strategizing Data Usage for Virtual Screening, Elix, CBI 2023

Slide 1

Slide 1 text

Protein - ligand aﬃnity prediction Strategizing data usage for virtual screening Thomas Auzard (presenter) - Elix, Inc. David Jimenez Barrero, Nazim Medzhidov, PhD - Elix, Inc. Naoki Tarui, PhD - SEEDSUPPLY Chem-Bio Informatics Society (CBI) Annual Meeting 2023, Tokyo, Japan｜October 25th, 2023

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2023 Elix Inc. 3 Proprietary protein-ligand binding dataset Positive samples Negative samples ● Proprietary training dataset ○ 689 proteins (GPCRs and SLC transporters) ○ Binder or non-binder molecules (SMILES) ● Testing dataset ○ 446,559 molecules ○ Activity for 4 proteins ■ GPR87, MC4R, GLP2R, SLC40A1 ■ Respectively 26, 4, 7, and 9 positive hits

Slide 4

Slide 4 text

© 2023 Elix Inc. 4 Model training and inference pipeline [1] Esposito C, Landrum GA, Schneider N, Stieﬂ N, Riniker S. GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. J Chem Inf Model. 2021;61(6):2623-2640. doi:10.1021/acs.jcim.1c00160

Slide 5

Slide 5 text

© 2023 Elix Inc. 5 Does it work for any unseen target? Neighborhood of GPR87 and MCR4R. Each point represents a protein. Neighbors of MC4R have a better balance of positive and negative samples than GPR87. Protein GPR87 MC4R GLP2R SLC40A1 Prediction / ground truth 0 / 26 3 / 7 0 / 4 2 / 9 Shortlist size 49521 (11.0%) 37885 (8.4%) 45496 (10.2%) 43794 (9.8%) Confidence 40% 64% 40% 42% Here, 3 different architectures were combined. 64% and 40% confidence levels are respectively equivalent to 20 and 12 models agreeing on the prediction. The 4 test proteins are unseen (not in the training dataset). GPR87 MC4R Coverage of the target protein in the training dataset impacts the performances. How so? ● Defining protein coverage ○ Presence of similar protein in the training data ○ Quality of the coverage ■ Positive / negative balance ■ Else?

Slide 6

Slide 6 text

© 2023 Elix Inc. 6 Training dataset augmentation scenario ● Adding samples (including positive) for the target to the training dataset ● 3 training modalities 1. Target unseen 2. Target seen 3. Fine-tuning on target Can we improve the performances with additional data collection?

Slide 7

Slide 7 text

© 2023 Elix Inc. 7 Training dataset augmentation scenario results Fine-tuning Retraining ● Re-trained and ﬁne-tuned models systematically beat baseline model ○ GPR87: 5 / 15 hits in ~3100 mols ○ MC4R: 2 / 3 hits in ~80 000 mols ● GPR87 showed the best improvement ○ Worst baseline ● 50% of the augmentation data belongs to GPR87 ○ Predominance of GPR87 samples could have biased the models towards GPR87 Improving coverage can improve performances. However, the model seems to be biased to this added coverage. How so?

Slide 8

Slide 8 text

© 2023 Elix Inc. 8 Understanding data bias with protein clustering ● Cluster preparation ○ Based on protein similarity: inverse of euclidean distance of ESM2-generated feature vectors ● 9 clusters based on similarity, 2 clusters based on dissimilarity Protein similarity map, using multidimensional scaling (MDS). The bigger the dot, the higher the intra-similarity of the cluster. This is only a representation, not a actual depiction of the distances between proteins. … …

Slide 9

Slide 9 text

© 2023 Elix Inc. 9 Protein clustering - impact of training coverage ● Intra-cluster similarity ○ Sensitivity: strong correlation ○ Precision: moderate correlation 4 lowest similarity clusters, removed in 3 graphs below High intra-cluster similarity is correlated with good performances. So, adding more coverage should boost the performances, right? High performing clusters Topology of those clusters deﬁne a good cluster ● Cluster topology (among high intra-similarity clusters) ○ No signiﬁcant correlation with performance metrics ○ Reasonable %positive (> 15%)

Slide 10

Slide 10 text

© 2023 Elix Inc. 10 Focus on SLC7A1 - is it always worth collecting more data? ● SLC7A1 (part of training data) ○ 23 positive samples ○ Belongs to cluster 3 ■ High intra-similarity ■ 18 proteins ■ 37% of positive samples ○ Coverage from clusters 4, 5, 7 ● 3 training modalities 1. Without cluster 3 2. With samples of SLC7A1 3. With cluster of SLC7A1

Slide 11

Slide 11 text

© 2023 Elix Inc. 11 Improving coverage with similar proteins - results ● No improvements with enrichment ○ Baseline is already “good” ○ SLC7A1 is covered by clusters 4, 5 and 7 Clusters 3, 4, 5 and 7 cover similar space. Compared to cluster 1 or 2, they receive coverage from other clusters. Improving coverage has its limits. What about the impact of distant data points? Fine-tuned with SLC7A1 Samples Fine-tuned with Clusters

Slide 12

Slide 12 text

© 2023 Elix Inc. 12 Tailored clusters from proprietary dataset ● Training model on limited protein space ● Preparation of custom clusters for protein of interest 1. Protein of interest is the centroid of the cluster 2. A good cluster need to satisfy topology metrics ● Low to no impact for GPR87 and MC4R ● Lowered performance for SLC40A1 Better use the whole training dataset in both case. Intra-cluster similarity Size % of positive 11% 36% 38%

Slide 13

Slide 13 text

© 2023 Elix Inc. 13 Conclusion ● Protein coverage, good cluster: depends on the data → Similar analysis should be performed with available training ip data and library to screen ● Performances conditioned by models ○ Protein-ligand interaction not used! ■ Pharmacophore models, … ● Another use case leveraging our approach ○ Hard to express human proteins → Use animal proteins for experiments ○ Data for animal protein could be used for enriching training data → More conﬁdence in the binding aﬃnity with ML

Slide 14

Slide 14 text

14 Thank you for your attention

Slide 15

Slide 15 text

株式会社Elix http://ja.elix-inc.com/ Elix Inc. https://www.elix-inc.com 15 SEEDSUPPLY https://www.seedsupply.co.jp &

Slide 16

Slide 16 text

16 Appendix

Slide 17

Slide 17 text

Slide 18

Slide 18 text

© 2023 Elix Inc. 18 Models architecture Molecular Graph Molecular Features Protein Features Graph Features Protein-Molecule Features Aggregation Linear Network Aggregation Prediction Molecular SMILES Protein Sequence ESM-2 Network Graph Convolutional NN Augmented tiered GCN Feature aggregation ● GCN pretrained on publicly available GPCR data ● Combine numerous featurization and aggregation process for both ligand and protein The augmented tiered GCN performing slightly better and being lighter, this model will be the default unless stated. Ligand featurization Ligand featurization Protein featurization Protein featurization Aggregation Aggregation

Slide 19

Slide 19 text

© 2023 Elix Inc. 19 Appendix Average intra-distance Number of unique samples / total number of samples Number of proteins % of positive samples Cluster 1 1.89 1703 / 1728 33 16.72% Cluster 2 2.26 5882 / 6192 117 33.43% Cluster 3 1.64 859 / 864 18 37.38% Cluster 4 1.51 432 / 432 9 15.74% Cluster 5 1.36 240 / 240 5 14.16% Cluster 6 1.54 432 / 432 9 29.86% Cluster 7 1.02 232 / 240 5 46.67% Cluster 8 1.41 236 / 240 5 18.75% Cluster 9 1.51 320 / 336 7 37.80% Cluster 10 5.36 262 / 288 5 35.42% Cluster 11 4.43 5746 / 5953 117 19.05% Detailed distributions of the clusters, obtained with manual inspection of DBSCAN clustering. Protein similarity map, using multidimensional scaling (MDS). The bigger the dot, the higher the intra-similarity of the cluster. This is only a representation, not a actual depiction of the distances between proteins.

Slide 20

Slide 20 text

© 2023 Elix Inc. 20 Average intra-distance Number of unique samples / total number of samples Number of proteins % of positive samples GPR87 1.22 717 / 720 13 11.11% MC4R 1.25 240 / 240 5 35.58% SLC40A1 1.82 192 / 192 4 38.02% Detailed distributions of the custom clusters. SLC40A1 is a control cluster: it has only 4 proteins and the threshold distance for selection was 2.0.

Slide 21

Slide 21 text

株式会社Elix http://ja.elix-inc.com/ 21