Jobim Talk 2024/06/26

Drug-Target Interactions Prediction at Scale The Komet Algorithm With the
LCIdb Dataset 1 Jobim 26/06/2024 Gwenn Guichaoua1,2,3, Philippe Pinel 1,2,3,4, Brice Hoffmann 4, Chloé Azencott 1,2,3, Véronique Stoven1,2,3 1 Institut Curie, PSL Research University, 75428 Paris, France 2 Center for Computational Biology, Mines Paris, PSL Research University, 75006 Paris, France 3 INSERM U900, 75005 Paris, France 4 Iktos SAS, 75017 Paris, France

Context 2 Pinel et al, BioRxiv 2024 Drug-Target interaction (DTI)
Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression

Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification

Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression At large scales In the chemical and protein spaces DTI Predictions Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification

Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression Applications De-orphanising a phenotypic drug Anticipating unexpected o ff -targets by predicting drug interaction pro fi les Find unwanted side e ff ects O ff er drug repositioning opportunities De-orphanising a new therapeutic target At large scales In the chemical and protein spaces DTI Predictions Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification

v Supervised learning Input database of interactions Binary classi fi
cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins

cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Output predicted interactions

cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions

cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions Predict drugs’ protein interaction pro fi le De-orphan

cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions Reasonable computing resources Challenges The most complete training base The most e ff i cient algorithm Large and diverse Con fi dent Interactions negative positive All prediction scenarios Timely manner Predict drugs’ protein interaction pro fi le De-orphan

Plan 4 LCIdb: a large new training database Motivation and
Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Features of proteins and molecules Mixing for pair’s features DTI classi fi cation Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison with ML algorithms

Why a new training database ? Binary interactions databases 2.513
proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - only interactions + - medium-sized Molecule Protein 5 Drugbank v1.5.1 [Wishart et al, 2018] BIOSNAP [Zitnik et al, 2018]

Why a new training database ? Binary interactions databases 2.513
proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - only interactions + - medium-sized Molecule Protein 5 Drugbank v1.5.1 [Wishart et al, 2018] BIOSNAP [Zitnik et al, 2018] Back to bioactivity databases + Large-sized + Experimentally measures, including thermodynamic values - May have di ff erent bioactivity measures for a DTI - All molecules and proteins Kd , Ki , IC50 BindingDB [Tiqing et al, 2007], PubChem [Kim et al, 2019], ChEMBL [Mendez et al, 2019]

Building of a Large Consensus Interactions dataset 6 PubChem BindingDB
Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022]

Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein.

Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept

Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept Binary labelling of DTIs • measure = fi rst Kd, then Ki, then IC50 • measure <10 nM ( M): Positive DTI • measure > 100 M ( M): Negative DTI 10−7 μ 10−4

Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept Binary labelling of DTIs • measure = fi rst Kd, then Ki, then IC50 • measure <10 nM ( M): Positive DTI • measure > 100 M ( M): Negative DTI 10−7 μ 10−4 LCIdb 2,069 proteins 274,515 molecules 402,538 Positive DTI 8,296 Negative DTI

Analysis of our dataset LCIdb Representation of the molecular space
with the t-SNE algorithm on Tanimoto molecule features Comparisons with literature medium-sized datasets Coverage of the molecular space 7 LCIdb and BindingDB LCIdb and DrugBank LCIdb and BIOSNAP

Coverage of the protein space of our dataset LCIdb LCIdb
Drugbank BIOSNAP BindingDB LCIdb Drugbank BIOSNAP LCIdb Drugbank t-SNE algorithm on protein features derived from the LAkernel Protein kinase G-protein coupled receptor 1 Cytochrome P450 Tubulin Ligand-gated ion channel PI3/PI4-kinase SDRs Major facilitator Sodium chanel ARTD/PARP Aldo/keto reductase Cyclic nucleoDde phosphodiesterase Transient receptor Calcium channel alpha-1 subunit Calycin Integrin alpha chain G-protein coupled receptor 2 G-protein coupled receptor 3 adenylyl/guanylyl cyclase Nuclear hormone receptor Cyclins Bcl-2 Alpha-carbonic anhydrase Phospholipase A2 Histone deacetylase Small GTPase ABC transporter 8 LCIdb and BindingDB LCIdb and DrugBank LCIdb and BIOSNAP

Komet: a Large-scale DTI prediction method Features of proteins and
molecules Mixing for pair’s features DTI classi fi cation Plan 9 LCIdb: a large new training database Motivation and Construction Coverage of the protein and molecule spaces Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison of ML algorithms

General DTI prediction pipeline 10 Step 1 Features for protein
and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset

Step1 : Features for protein and molecule 11 Features compatible
with large-scale Computed on simple representations for protein and molecule

Step1 : Features for protein and molecule 11 Fixed features
Learned features Features compatible with large-scale Computed on simple representations for protein and molecule

Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Features compatible with large-scale Computed on simple representations for protein and molecule

Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule

Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Computed from pre-trained networks on another task (ProtBert [Elnaggar&al,2021]) Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule

Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Computed from pre-trained networks on another task (ProtBert [Elnaggar&al,2021]) (DeepPurpose [Huang&al,2020]) Learned by neural networks in DTI prediction pipeline Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule

12 Features for protein and molecule for Komet Protein Sequence
𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles Fixed features derived from kernel theory

𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory

𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory np = 2069 pjk dp = 2069

12 Approximated features derived from molecule kernel Features for protein
and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory np = 2069 pjk dp = 2069

and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory Fast computed np = 2069 pjk dp = 2069

and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory Fast computed Method depends on 2 parameters dM ≤ mM dimension of the encoding dM # landmarks molecules from the training set mM np = 2069 pjk dp = 2069 Expressivity Quality and precision

Empirical molecule kernel nM = 274k KM = 13 Nyström
approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM XM ≈ ˜ XM where ˜ XM ∈ ℝnM ×dM dM nM = 274k ˜ XM dM KM = XM X⊤ M λ mM computing, storage and factorisation impossible Reduction dimension

Step 2 : Features for (molecule, protein) pairs 14 ψP
( 𝔭 ) = p Mixing of molecule and protein features ψM ( 𝔪 ) = m

( 𝔭 ) = p z Mixing of molecule and protein features ψM ( 𝔪 ) = m

( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z ψM ( 𝔪 ) = m

( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network ψM ( 𝔪 ) = m

( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network z = mp⊤ m p z dZ = dM × dP Fixed features Using a tensor product ψM ( 𝔪 ) = m

( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network z = mp⊤ m p z dZ = dM × dP Fixed features Using a tensor product ψM ( 𝔪 ) = m Capture well information about interaction Favorable mathematical properties Tensor product : key element of Komet

Step 3 : DTI prediction model Tree-based methods [Shi et
al, 2019] Neural network with Logistic loss [Huang et al, 2021],[Singh et al, 2023] ( 𝔪 ik , 𝔭 jk ) Training dataset 15 SVM with Hinge loss : [Jacob et al, 2008],[Playe et al, 2018] ℓ(y′ , y) = max(0,1 − yy′ ) Linear model Network-based inference approaches [Cheng et al, 2012]

Step 3 : Komet model 16 Optimisation problem Z ∈
ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM

Step 3 : Komet model 16 Optimisation problem Problem is
too big for both storage and computation of Z Zw Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM

Step 3 : Komet model 16 Optimisation problem (Zw)k =
⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM

⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM

⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM

⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP GPU Code in PyTorch https://komet.readthedocs.io Full batch BFGS method to solve the optimisation problem min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM

Plan 17 LCIdb: a large new training database Motivation and
Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Features of proteins and molecules Mixing for pair’s features DTI classi fi cation Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison of ML algorithms

AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up
of the model Komet dM ≤ mM 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 18

of the model Komet dM ≤ mM Peak_GPU (Gb) 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 18

of the model Komet dM ≤ mM Peak_GPU (Gb) Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 18

of the model Komet dM ≤ mM Peak_GPU (Gb) Impact of reduction dimension Train Time (s) Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 0 500 1000 1500 2000 2500 3000 dM 5 15 20 10 18

of the model Komet dM ≤ mM Peak_GPU (Gb) Impact of reduction dimension Train Time (s) Same performance Save by a factor of 2: Time and GPU dM = 1000 Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 0 500 1000 1500 2000 2500 3000 dM 5 15 20 10 18

19 AUPR Training Time Impact of the mixing for pair’s
features Comparison concatenated and Komet features on large-sized datasets

19 AUPR Training Time Tensor product key element for the
expressivity of pairs’ feature Impact of the mixing for pair’s features Comparison concatenated and Komet features on large-sized datasets

Comparison with Deep Learning algorithms 20 On large-scale datasets AUPR
Training Time [Singh, 2023] [Huang, 2021]

Comparison with Deep Learning algorithms 20 On an External dataset
AUPR On large-scale datasets AUPR Training Time DrugBank_Ext : DrugBank without DTIs of LCIdb and BindingDB [Singh, 2023] [Huang, 2021]

Why does it work ? Tensor product key element for
BIOSNAP BindingDB LCIdb LCIdb Drugbank Optimisation (scale) Capturing information about interaction Large and diverse training dataset LCIdb and BindingDB LCIdb and Drugbank_Ext Molecule Molecule Protein Protein 21

Conclusion 22 Initial problem DTI prediction at large scale in
protein and molecule spaces

protein and molecule spaces Contributions: A large new molecule/protein interactions dataset https://zenodo.org/records/10731712 Komet: Fast & State of the Art https://komet.readthedocs.io

protein and molecule spaces Contributions: A large new molecule/protein interactions dataset https://zenodo.org/records/10731712 Komet: Fast & State of the Art https://komet.readthedocs.io Perspectives: Analysis of the target proteins predicted for the molecules found in phenotypic drug screening

Acknowledgments Project supported by the Île-de-France Region as part of
the “DIM AI4IDF” Clara NAHMIAS Sylvie RODRIGUES-FERREIRA Philippe PINEL Véronique STOVEN Chloé AZENCOTT Brice Ho ff mann Thanks for your attention!

24 Approximated features derived from kernel kM ( 𝔪 ,
𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel

Empirical molecule kernel nM = 274k KM = 24 Approximated
features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel

features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel KM = XM X⊤ M

features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel KM = XM X⊤ M computing, storage and factorisation impossible

approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel computing, storage and factorisation impossible

approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible

approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible

approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible

approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM computing, storage and factorisation impossible

approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM λ mM computing, storage and factorisation impossible

approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM λ mM computing, storage and factorisation impossible Reduction dimension

approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′ ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′ )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM XM ≈ ˜ XM where ˜ XM ∈ ℝnM ×dM dM nM = 274k ˜ XM dM λ mM computing, storage and factorisation impossible Reduction dimension

From kernel back to features Protein kernel Local Alignement kernel
[Saigo&al,2004] Factorisation of the empirical protein kernel Singular value decomposition (SVD) : Empirical protein kernel Empirical features in dimension : dP Approximation kP ( 𝔭 , 𝔭 ′ ) = ⟨ψP ( 𝔭 ), ψP ( 𝔭 ′ )⟩ KP = XP XT P KP ∈ ℝnP ×nP XP ∈ ℝnP ×dP KP = Udiag(λ)U⊤ λ XP = Udiag( λ) nP = 2069 dP = nP dP < < nP XP ˜ XP ˜ XP = U[: , : dP ]diag( λ[: dP ])

Impact of molecule and protein features Comparison fi xed and
learned features Komet on the LCIdb_Orphan Dataset Fixed features better than DL features for this speci fi c problem (Drug-like molecules and human druggable proteins) 26 Stability of the Komet performance dM = 1000 mM = 3000 Protein features Tanimoto kernel Molecule feature AUPR

Comparison to Deep Learning algorithms ConPlex model [Singh, 2023] MolTrans
model [Huang, 2021] 27 Almost same structure pipeline in 3 steps Optimization algorithm SGD less precise, less stable and quick than BFGS Everything trained vs fi xed features

Case study : A scaffold Hopping problem Komet recovers more
out-of-sca ff old ligands [Pinel, 2023] LCIdb better training dataset Komet outperforms in all criteria 28

Jobim Talk 2024/06/26

Jobim Talk 2024/06/26

More Decks by Guichaoua

Other Decks in Research

Featured

Transcript