Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification
Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression At large scales In the chemical and protein spaces DTI Predictions Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification
Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression Applications De-orphanising a phenotypic drug Anticipating unexpected o ff -targets by predicting drug interaction pro fi les Find unwanted side e ff ects O ff er drug repositioning opportunities De-orphanising a new therapeutic target At large scales In the chemical and protein spaces DTI Predictions Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification
cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions
cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions
cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions Predict drugs’ protein interaction pro fi le De-orphan
cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions Reasonable computing resources Challenges The most complete training base The most e ff i cient algorithm Large and diverse Con fi dent Interactions negative positive All prediction scenarios Timely manner Predict drugs’ protein interaction pro fi le De-orphan
Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Features of proteins and molecules Mixing for pair’s features DTI classi fi cation Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison with ML algorithms
proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - only interactions + - medium-sized Molecule Protein 5 Drugbank v1.5.1 [Wishart et al, 2018] BIOSNAP [Zitnik et al, 2018] Back to bioactivity databases + Large-sized + Experimentally measures, including thermodynamic values - May have di ff erent bioactivity measures for a DTI - All molecules and proteins Kd , Ki , IC50 BindingDB [Tiqing et al, 2007], PubChem [Kim et al, 2019], ChEMBL [Mendez et al, 2019]
Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein.
Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein.
Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept
Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept Binary labelling of DTIs • measure = fi rst Kd, then Ki, then IC50 • measure <10 nM ( M): Positive DTI • measure > 100 M ( M): Negative DTI 10−7 μ 10−4
Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept Binary labelling of DTIs • measure = fi rst Kd, then Ki, then IC50 • measure <10 nM ( M): Positive DTI • measure > 100 M ( M): Negative DTI 10−7 μ 10−4 LCIdb 2,069 proteins 274,515 molecules 402,538 Positive DTI 8,296 Negative DTI
with the t-SNE algorithm on Tanimoto molecule features Comparisons with literature medium-sized datasets Coverage of the molecular space 7 LCIdb and BindingDB LCIdb and DrugBank LCIdb and BIOSNAP
molecules Mixing for pair’s features DTI classi fi cation Plan 9 LCIdb: a large new training database Motivation and Construction Coverage of the protein and molecule spaces Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison of ML algorithms
and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Features compatible with large-scale Computed on simple representations for protein and molecule
Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule
Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Computed from pre-trained networks on another task (ProtBert [Elnaggar&al,2021]) Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule
Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Computed from pre-trained networks on another task (ProtBert [Elnaggar&al,2021]) (DeepPurpose [Huang&al,2020]) Learned by neural networks in DTI prediction pipeline Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule
𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory
𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory np = 2069 pjk dp = 2069
and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory np = 2069 pjk dp = 2069
and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory Fast computed np = 2069 pjk dp = 2069
and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory Fast computed Method depends on 2 parameters dM ≤ mM dimension of the encoding dM # landmarks molecules from the training set mM np = 2069 pjk dp = 2069 Expressivity Quality and precision
approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM XM ≈ ˜ XM where ˜ XM ∈ ℝnM ×dM dM nM = 274k ˜ XM dM KM = XM X⊤ M λ mM computing, storage and factorisation impossible Reduction dimension
( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network ψM ( 𝔪 ) = m
( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network z = mp⊤ m p z dZ = dM × dP Fixed features Using a tensor product ψM ( 𝔪 ) = m
( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network z = mp⊤ m p z dZ = dM × dP Fixed features Using a tensor product ψM ( 𝔪 ) = m Capture well information about interaction Favorable mathematical properties Tensor product : key element of Komet
al, 2019] Neural network with Logistic loss [Huang et al, 2021],[Singh et al, 2023] ( 𝔪 ik , 𝔭 jk ) Training dataset 15 SVM with Hinge loss : [Jacob et al, 2008],[Playe et al, 2018] ℓ(y′  , y) = max(0,1 − yy′  ) Linear model Network-based inference approaches [Cheng et al, 2012]
⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP GPU Code in PyTorch https://komet.readthedocs.io Full batch BFGS method to solve the optimisation problem min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Features of proteins and molecules Mixing for pair’s features DTI classi fi cation Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison of ML algorithms
of the model Komet dM ≤ mM Peak_GPU (Gb) Impact of reduction dimension Train Time (s) Same performance Save by a factor of 2: Time and GPU dM = 1000 Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 0 500 1000 1500 2000 2500 3000 dM 5 15 20 10 18
BIOSNAP BindingDB LCIdb LCIdb Drugbank Optimisation (scale) Capturing information about interaction Large and diverse training dataset LCIdb and BindingDB LCIdb and Drugbank_Ext Molecule Molecule Protein Protein 21
protein and molecule spaces Contributions: A large new molecule/protein interactions dataset https://zenodo.org/records/10731712 Komet: Fast & State of the Art https://komet.readthedocs.io
protein and molecule spaces Contributions: A large new molecule/protein interactions dataset https://zenodo.org/records/10731712 Komet: Fast & State of the Art https://komet.readthedocs.io Perspectives: Analysis of the target proteins predicted for the molecules found in phenotypic drug screening
approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible
approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible
approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible
approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM computing, storage and factorisation impossible
approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM λ mM computing, storage and factorisation impossible
approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM λ mM computing, storage and factorisation impossible Reduction dimension
learned features Komet on the LCIdb_Orphan Dataset Fixed features better than DL features for this speci fi c problem (Drug-like molecules and human druggable proteins) 26 Stability of the Komet performance dM = 1000 mM = 3000 Protein features Tanimoto kernel Molecule feature AUPR
model [Huang, 2021] 27 Almost same structure pipeline in 3 steps Optimization algorithm SGD less precise, less stable and quick than BFGS Everything trained vs fi xed features