Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jobim Talk 2024/06/26

Jobim Talk 2024/06/26

Guichaoua

June 28, 2024
Tweet

More Decks by Guichaoua

Other Decks in Research

Transcript

  1. Drug-Target Interactions Prediction at Scale The Komet Algorithm With the

    LCIdb Dataset 1 Jobim 26/06/2024 Gwenn Guichaoua1,2,3, Philippe Pinel 1,2,3,4, Brice Hoffmann 4, Chloé Azencott 1,2,3, Véronique Stoven1,2,3 1 Institut Curie, PSL Research University, 75428 Paris, France 2 Center for Computational Biology, Mines Paris, PSL Research University, 75006 Paris, France 3 INSERM U900, 75005 Paris, France 4 Iktos SAS, 75017 Paris, France
  2. Context 2 Pinel et al, BioRxiv 2024 Drug-Target interaction (DTI)

    Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression
  3. Context 2 Pinel et al, BioRxiv 2024 Drug-Target interaction (DTI)

    Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification
  4. Context 2 Pinel et al, BioRxiv 2024 Drug-Target interaction (DTI)

    Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression At large scales In the chemical and protein spaces DTI Predictions Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification
  5. Context 2 Pinel et al, BioRxiv 2024 Drug-Target interaction (DTI)

    Small molecule (drug) that interacts with a protein (target) Modulates protein function to prevent disease progression Applications De-orphanising a phenotypic drug Anticipating unexpected o ff -targets by predicting drug interaction pro fi les Find unwanted side e ff ects O ff er drug repositioning opportunities De-orphanising a new therapeutic target At large scales In the chemical and protein spaces DTI Predictions Drug discovery process Hit Discovery Hit to Lead Lead Optimisation Drug Development Target Identification
  6. v Supervised learning Input database of interactions Binary classi fi

    cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins
  7. v Supervised learning Input database of interactions Binary classi fi

    cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Output predicted interactions
  8. v Supervised learning Input database of interactions Binary classi fi

    cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions
  9. v Supervised learning Input database of interactions Binary classi fi

    cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions
  10. v Supervised learning Input database of interactions Binary classi fi

    cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions Predict drugs’ protein interaction pro fi le De-orphan
  11. v Supervised learning Input database of interactions Binary classi fi

    cation problem 1 -1 DTIs prediction at large scale 3 Molecules Proteins Goal Predict proteins’ molecular interaction pro fi le Find Off-targets Output predicted interactions Reasonable computing resources Challenges The most complete training base The most e ff i cient algorithm Large and diverse Con fi dent Interactions negative positive All prediction scenarios Timely manner Predict drugs’ protein interaction pro fi le De-orphan
  12. Plan 4 LCIdb: a large new training database Motivation and

    Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Features of proteins and molecules Mixing for pair’s features DTI classi fi cation Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison with ML algorithms
  13. Why a new training database ? Binary interactions databases 2.513

    proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - only interactions + - medium-sized Molecule Protein 5 Drugbank v1.5.1 [Wishart et al, 2018] BIOSNAP [Zitnik et al, 2018]
  14. Why a new training database ? Binary interactions databases 2.513

    proteins 4.813 molecules 13.716 interactions + + well curated + FDA-approved drugs - only interactions + - medium-sized Molecule Protein 5 Drugbank v1.5.1 [Wishart et al, 2018] BIOSNAP [Zitnik et al, 2018] Back to bioactivity databases + Large-sized + Experimentally measures, including thermodynamic values - May have di ff erent bioactivity measures for a DTI - All molecules and proteins Kd , Ki , IC50 BindingDB [Tiqing et al, 2007], PubChem [Kim et al, 2019], ChEMBL [Mendez et al, 2019]
  15. Building of a Large Consensus Interactions dataset 6 PubChem BindingDB

    Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022]
  16. Building of a Large Consensus Interactions dataset 6 PubChem BindingDB

    Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein.
  17. Building of a Large Consensus Interactions dataset 6 PubChem BindingDB

    Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein.
  18. Building of a Large Consensus Interactions dataset 6 PubChem BindingDB

    Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept
  19. Building of a Large Consensus Interactions dataset 6 PubChem BindingDB

    Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept Binary labelling of DTIs • measure = fi rst Kd, then Ki, then IC50 • measure <10 nM ( M): Positive DTI • measure > 100 M ( M): Negative DTI 10−7 μ 10−4
  20. Building of a Large Consensus Interactions dataset 6 PubChem BindingDB

    Probes&Drugs IUPHAR/BPS ChEMBL Consensus dataset [Isigkei et al,2022] Chemical structure quality fi lter • molecular weights between 100 and 900 g/mol • Structure check: same SMILES representation in all sources • Keep molecules which target at least 1 human protein. Bioactivity fi lter • Keep negative logarithm of Kd, Ki, IC50 known • Keep multiple annotated bioactivities within 1 log unit di ff erence kept Binary labelling of DTIs • measure = fi rst Kd, then Ki, then IC50 • measure <10 nM ( M): Positive DTI • measure > 100 M ( M): Negative DTI 10−7 μ 10−4 LCIdb 2,069 proteins 274,515 molecules 402,538 Positive DTI 8,296 Negative DTI
  21. Analysis of our dataset LCIdb Representation of the molecular space

    with the t-SNE algorithm on Tanimoto molecule features Comparisons with literature medium-sized datasets Coverage of the molecular space 7 LCIdb and BindingDB LCIdb and DrugBank LCIdb and BIOSNAP
  22. Coverage of the protein space of our dataset LCIdb LCIdb

    Drugbank BIOSNAP BindingDB LCIdb Drugbank BIOSNAP LCIdb Drugbank t-SNE algorithm on protein features derived from the LAkernel Protein kinase G-protein coupled receptor 1 Cytochrome P450 Tubulin Ligand-gated ion channel PI3/PI4-kinase SDRs Major facilitator Sodium chanel ARTD/PARP Aldo/keto reductase Cyclic nucleoDde phosphodiesterase Transient receptor Calcium channel alpha-1 subunit Calycin Integrin alpha chain G-protein coupled receptor 2 G-protein coupled receptor 3 adenylyl/guanylyl cyclase Nuclear hormone receptor Cyclins Bcl-2 Alpha-carbonic anhydrase Phospholipase A2 Histone deacetylase Small GTPase ABC transporter 8 LCIdb and BindingDB LCIdb and DrugBank LCIdb and BIOSNAP
  23. Komet: a Large-scale DTI prediction method Features of proteins and

    molecules Mixing for pair’s features DTI classi fi cation Plan 9 LCIdb: a large new training database Motivation and Construction Coverage of the protein and molecule spaces Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison of ML algorithms
  24. General DTI prediction pipeline 10 Step 1 Features for protein

    and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
  25. General DTI prediction pipeline 10 Step 1 Features for protein

    and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
  26. General DTI prediction pipeline 10 Step 1 Features for protein

    and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
  27. General DTI prediction pipeline 10 Step 1 Features for protein

    and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
  28. General DTI prediction pipeline 10 Step 1 Features for protein

    and molecule Step 2 Features for (molecule, protein) pairs Step 3 DTI prediction model Nystrom approximation Dimension reduction Kronecker kernel LCIdb ( ik , jk ) SMILES Protein Sequence z k Combined Features Prediction φ Loss ℓ Training Supervised Model Parameter w ik ψ M ψ P y k Local Alignment kernel Tanimoto kernel jk m ik Molecule Features Protein Features p jk # of proteins : 2,060 # of molecules : 271,180 # of Positive DTI : 396,798 # of Negative DTI: 7,965 (+ 388,833 balanced) SVM z k = m ik p⊤ jk m ik p jk z k Training dataset
  29. Step1 : Features for protein and molecule 11 Features compatible

    with large-scale Computed on simple representations for protein and molecule
  30. Step1 : Features for protein and molecule 11 Fixed features

    Learned features Features compatible with large-scale Computed on simple representations for protein and molecule
  31. Step1 : Features for protein and molecule 11 Fixed features

    Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Features compatible with large-scale Computed on simple representations for protein and molecule
  32. Step1 : Features for protein and molecule 11 Fixed features

    Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule
  33. Step1 : Features for protein and molecule 11 Fixed features

    Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Computed from pre-trained networks on another task (ProtBert [Elnaggar&al,2021]) Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule
  34. Step1 : Features for protein and molecule 11 Fixed features

    Learned features Encode various characteristics (Morgan fi ngerprints ECFP4 [Rogers&al,2010]) Computed from pre-trained networks on another task (ProtBert [Elnaggar&al,2021]) (DeepPurpose [Huang&al,2020]) Learned by neural networks in DTI prediction pipeline Derived from kernel theory (Tanimoto kernel / Local Alignement kernel [Saigo&al,2004] ) Features compatible with large-scale Computed on simple representations for protein and molecule
  35. 12 Features for protein and molecule for Komet Protein Sequence

    𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles Fixed features derived from kernel theory
  36. 12 Features for protein and molecule for Komet Protein Sequence

    𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory
  37. 12 Features for protein and molecule for Komet Protein Sequence

    𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory np = 2069 pjk dp = 2069
  38. 12 Approximated features derived from molecule kernel Features for protein

    and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory np = 2069 pjk dp = 2069
  39. 12 Approximated features derived from molecule kernel Features for protein

    and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory Fast computed np = 2069 pjk dp = 2069
  40. 12 Approximated features derived from molecule kernel Features for protein

    and molecule for Komet Protein Sequence 𝔪 ik ψM ψP Local Alignment kernel Tanimoto kernel 𝔭 jk mik Molecule Features Protein Features pjk Smiles LCIdb training dataset ( 𝔪 ik , 𝔭 jk ) # of proteins : 2,069 # of molecules : 274,515 # of Positive DTI : 402,538 # of Negative DTI: 8,296 (+ 394,242 balanced) LCIdb Fixed features derived from kernel theory Fast computed Method depends on 2 parameters dM ≤ mM dimension of the encoding dM # landmarks molecules from the training set mM np = 2069 pjk dp = 2069 Expressivity Quality and precision
  41. Empirical molecule kernel nM = 274k KM = 13 Nyström

    approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM XM ≈ ˜ XM where ˜ XM ∈ ℝnM ×dM dM nM = 274k ˜ XM dM KM = XM X⊤ M λ mM computing, storage and factorisation impossible Reduction dimension
  42. Step 2 : Features for (molecule, protein) pairs 14 ψP

    ( 𝔭 ) = p Mixing of molecule and protein features ψM ( 𝔪 ) = m
  43. Step 2 : Features for (molecule, protein) pairs 14 ψP

    ( 𝔭 ) = p z Mixing of molecule and protein features ψM ( 𝔪 ) = m
  44. Step 2 : Features for (molecule, protein) pairs 14 ψP

    ( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z ψM ( 𝔪 ) = m
  45. Step 2 : Features for (molecule, protein) pairs 14 ψP

    ( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network ψM ( 𝔪 ) = m
  46. Step 2 : Features for (molecule, protein) pairs 14 ψP

    ( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network z = mp⊤ m p z dZ = dM × dP Fixed features Using a tensor product ψM ( 𝔪 ) = m
  47. Step 2 : Features for (molecule, protein) pairs 14 ψP

    ( 𝔭 ) = p z Mixing of molecule and protein features Linear mixing Concatenation of the features z Non linear mixing Learned features Using a neural network z = mp⊤ m p z dZ = dM × dP Fixed features Using a tensor product ψM ( 𝔪 ) = m Capture well information about interaction Favorable mathematical properties Tensor product : key element of Komet
  48. Step 3 : DTI prediction model Tree-based methods [Shi et

    al, 2019] Neural network with Logistic loss [Huang et al, 2021],[Singh et al, 2023] ( 𝔪 ik , 𝔭 jk ) Training dataset 15 SVM with Hinge loss : [Jacob et al, 2008],[Playe et al, 2018] ℓ(y′  , y) = max(0,1 − yy′  ) Linear model Network-based inference approaches [Cheng et al, 2012]
  49. Step 3 : Komet model 16 Optimisation problem Z ∈

    ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
  50. Step 3 : Komet model 16 Optimisation problem Problem is

    too big for both storage and computation of Z Zw Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
  51. Step 3 : Komet model 16 Optimisation problem (Zw)k =

    ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
  52. Step 3 : Komet model 16 Optimisation problem (Zw)k =

    ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
  53. Step 3 : Komet model 16 Optimisation problem (Zw)k =

    ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
  54. Step 3 : Komet model 16 Optimisation problem (Zw)k =

    ⟨w, zk ⟩ℝdZ = ⟨w, mik p⊤ jk ⟩ℝdP×dM = ⟨Wpjk , mik ⟩ℝdM E ffi cient computation [Airola&Pahikkala,2017] Problem is too big for both storage and computation of Z Zw can be computed in only operations (qj )nP j=1 nP × dZ ⏟ qjk Complexity Explicit Zw Implicit computation nZ × dZ nP × dZ + nZ × dM Z ∈ ℝnZ ×dZ zk nZ = 460k dZ = dM × dP GPU Code in PyTorch https://komet.readthedocs.io Full batch BFGS method to solve the optimisation problem min w∈ℝ(dP×dM) nZ ∑ k=1 ℓ(⟨w, zk ⟩, yk ) + λ 2 ∥w∥2 SVM
  55. Plan 17 LCIdb: a large new training database Motivation and

    Construction Coverage of the protein and molecule spaces Komet: a Large-scale DTI prediction method Features of proteins and molecules Mixing for pair’s features DTI classi fi cation Results Parameters set-up of the model Impact of the mixing for pair’s features Comparison of ML algorithms
  56. AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up

    of the model Komet dM ≤ mM 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 18
  57. AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up

    of the model Komet dM ≤ mM Peak_GPU (Gb) 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 18
  58. AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up

    of the model Komet dM ≤ mM Peak_GPU (Gb) Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 18
  59. AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up

    of the model Komet dM ≤ mM Peak_GPU (Gb) Impact of reduction dimension Train Time (s) Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 0 500 1000 1500 2000 2500 3000 dM 5 15 20 10 18
  60. AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up

    of the model Komet dM ≤ mM Peak_GPU (Gb) Impact of reduction dimension Train Time (s) Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 0 500 1000 1500 2000 2500 3000 dM 5 15 20 10 18
  61. AUPR Impact of molecule landmarks LCIdb_Orphan Dataset Parameters set up

    of the model Komet dM ≤ mM Peak_GPU (Gb) Impact of reduction dimension Train Time (s) Same performance Save by a factor of 2: Time and GPU dM = 1000 Same performance Save computational ressources mM = 3000 0 500 1000 1500 2000 2500 3000 dM 0.82 0.84 0.86 0.88 0 500 1000 1500 2000 2500 3000 5 15 20 25 10 30 dM 0 500 1000 1500 2000 2500 3000 dM 5 15 20 10 18
  62. 19 AUPR Training Time Impact of the mixing for pair’s

    features Comparison concatenated and Komet features on large-sized datasets
  63. 19 AUPR Training Time Tensor product key element for the

    expressivity of pairs’ feature Impact of the mixing for pair’s features Comparison concatenated and Komet features on large-sized datasets
  64. Comparison with Deep Learning algorithms 20 On an External dataset

    AUPR On large-scale datasets AUPR Training Time DrugBank_Ext : DrugBank without DTIs of LCIdb and BindingDB [Singh, 2023] [Huang, 2021]
  65. Why does it work ? Tensor product key element for

    BIOSNAP BindingDB LCIdb LCIdb Drugbank Optimisation (scale) Capturing information about interaction Large and diverse training dataset LCIdb and BindingDB LCIdb and Drugbank_Ext Molecule Molecule Protein Protein 21
  66. Conclusion 22 Initial problem DTI prediction at large scale in

    protein and molecule spaces Contributions: A large new molecule/protein interactions dataset https://zenodo.org/records/10731712 Komet: Fast & State of the Art https://komet.readthedocs.io
  67. Conclusion 22 Initial problem DTI prediction at large scale in

    protein and molecule spaces Contributions: A large new molecule/protein interactions dataset https://zenodo.org/records/10731712 Komet: Fast & State of the Art https://komet.readthedocs.io Perspectives: Analysis of the target proteins predicted for the molecules found in phenotypic drug screening
  68. Acknowledgments Project supported by the Île-de-France Region as part of

    the “DIM AI4IDF” Clara NAHMIAS Sylvie RODRIGUES-FERREIRA Philippe PINEL Véronique STOVEN Chloé AZENCOTT Brice Ho ff mann Thanks for your attention!
  69. 24 Approximated features derived from kernel kM ( 𝔪 ,

    𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel
  70. Empirical molecule kernel nM = 274k KM = 24 Approximated

    features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel
  71. Empirical molecule kernel nM = 274k KM = 24 Approximated

    features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel KM = XM X⊤ M
  72. Empirical molecule kernel nM = 274k KM = 24 Approximated

    features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel KM = XM X⊤ M computing, storage and factorisation impossible
  73. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel computing, storage and factorisation impossible
  74. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible
  75. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible
  76. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM computing, storage and factorisation impossible
  77. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM computing, storage and factorisation impossible
  78. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM λ mM computing, storage and factorisation impossible
  79. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM λ mM computing, storage and factorisation impossible Reduction dimension
  80. Empirical molecule kernel nM = 274k KM = 24 Nyström

    approximation Singular value decomposition (SVD) ̂ KM = Udiag(λ)U⊤ Approximated features derived from kernel kM ( 𝔪 , 𝔪 ′  ) = ⟨ψM ( 𝔪 ), ψM ( 𝔪 ′  )⟩ Molecule kernel <latexit sha1_base64="xFnqVH93OHwwcQZCasLSlO47rBI=">AAACynicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6ceGign2ALTJJpzU0LyYTsRR3/oBb/TDxD/QvvDNOQS2iE5KcOfecO3Pv9dIwyKTjvBasmdm5+YXiYmlpeWV1rby+0cySXPi84SdhItoey3gYxLwhAxnydio4i7yQt7zhqYq3brnIgiS+lKOUdyM2iIN+4DNJVKvD0lQkd9flilN19LKngWtABWbVk/ILOughgY8cEThiSMIhGDJ6ruDCQUpcF2PiBKFAxznuUSJvTipOCkbskL4D2l0ZNqa9yplpt0+nhPQKctrYIU9COkFYnWbreK4zK/a33GOdU91tRH/P5IqIlbgh9i/fRPlfn6pFoo9jXUNANaWaUdX5Jkuuu6Jubn+pSlKGlDiFexQXhH3tnPTZ1p5M1656y3T8TSsVq/a+0eZ4V7ekAbs/xzkNmntV97B6cLFfqZ2YURexhW3s0jyPUMMZ6mjoKh/xhGfr3BLWyBp/Sq2C8Wzi27IePgBFE5JB</latexit> ⇡ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ <latexit sha1_base64="qUbluARsbWKNRtFd+wji/TxcnTc=">AAACyXicjVHLSsNAFD2Nr1pfVZdugkVwVRLxtSy6EdxUsA9oiyTTaR2bl8lErMWVP+BWf0z8A/0L74wpqEV0QpIz595zZu69buSJRFrWa86Ymp6ZncvPFxYWl5ZXiqtr9SRMY8ZrLPTCuOk6CfdEwGtSSI83o5g7vuvxhjs4VvHGDY8TEQbnchjxju/0A9ETzJFE1dtS+Dy5KJassqWXOQnsDJSQrWpYfEEbXYRgSOGDI4Ak7MFBQk8LNixExHUwIi4mJHSc4x4F0qaUxSnDIXZA3z7tWhkb0F55JlrN6BSP3piUJrZIE1JeTFidZup4qp0V+5v3SHuquw3p72ZePrESl8T+pRtn/lenapHo4VDXIKimSDOqOpa5pLor6ubml6okOUTEKdyleEyYaeW4z6bWJLp21VtHx990pmLVnmW5Kd7VLWnA9s9xToL6TtneL++d7ZYqR9mo89jAJrZpngeo4ARV1Mj7Co94wrNxalwbt8bdZ6qRyzTr+LaMhw//0ZG/</latexit> ⇥ ̂ K−1 M Z⊤ Z Z⊤ Z random landmarks molecules mM ̂ KM mM KM ≈ XM X⊤ M where XM = ZUdiag(1/ λ) nM = 274k mM XM XM ≈ ˜ XM where ˜ XM ∈ ℝnM ×dM dM nM = 274k ˜ XM dM λ mM computing, storage and factorisation impossible Reduction dimension
  81. From kernel back to features Protein kernel Local Alignement kernel

    [Saigo&al,2004] Factorisation of the empirical protein kernel Singular value decomposition (SVD) : Empirical protein kernel Empirical features in dimension : dP Approximation kP ( 𝔭 , 𝔭 ′  ) = ⟨ψP ( 𝔭 ), ψP ( 𝔭 ′  )⟩ KP = XP XT P KP ∈ ℝnP ×nP XP ∈ ℝnP ×dP KP = Udiag(λ)U⊤ λ XP = Udiag( λ) nP = 2069 dP = nP dP < < nP XP ˜ XP ˜ XP = U[: , : dP ]diag( λ[: dP ])
  82. Impact of molecule and protein features Comparison fi xed and

    learned features Komet on the LCIdb_Orphan Dataset Fixed features better than DL features for this speci fi c problem (Drug-like molecules and human druggable proteins) 26 Stability of the Komet performance dM = 1000 mM = 3000 Protein features Tanimoto kernel Molecule feature AUPR
  83. Comparison to Deep Learning algorithms ConPlex model [Singh, 2023] MolTrans

    model [Huang, 2021] 27 Almost same structure pipeline in 3 steps Optimization algorithm SGD less precise, less stable and quick than BFGS Everything trained vs fi xed features
  84. Case study : A scaffold Hopping problem Komet recovers more

    out-of-sca ff old ligands [Pinel, 2023] LCIdb better training dataset Komet outperforms in all criteria 28