Efficient and Scalable Framework for Activity Prediction with kMol, Elix, CBI 2023

Slide 1

Slide 1 text

Eﬃcient and Scalable Framework for Activity Prediction with kMol Elix, Inc. Jun Jin Choong Research Engineer October 25th 2023

Slide 2

Slide 2 text

Table of Contents ● Introduction ○ Why kMol? ● Use Cases: ○ Federated Learning ○ Usage beyond federated learning ● Conclusion 2

Slide 3

Slide 3 text

Complexity: It is difficult to find the best model to solve the problem 3 Introduction, Why kMol? In this work, we present you kMol, a machine learning library build for this very purpose. Speed: Drug Discovery these days demand for much faster and efficient computational methods Scalability: Scalability is a problem for many pharmaceutical companies.

Slide 4

Slide 4 text

4 Problem Statement Case 1: Data, Security and Privacy Concerns ● Due to data security and privacy reasons, data cannot be shared externally. ● Some data are conﬁdential and cannot be shared Pharmaceutical companies would like to utilize state-of-the-art models but, ● Training deep models requires a lot of data for better performance. ● Not everyone has the ability to train lots of data on big machines

Slide 5

Slide 5 text

5 Problem Statement Case 2: Domain Expertise Utilizing deep learning models are potentially complicated and requires expert knowledge. It is not easy to develop existing models from white papers. ● Cost of implementation ● Expert knowledge and scalability of models are constraints for most pharmaceutical companies.

Slide 6

Slide 6 text

6 kMol for Federated Learning kMoL is a machine learning library for drug discovery and life sciences, with federated learning capabilities. It’s a scalable and highly customizable library with batteries included. kMol is an open-source machine learning library. It can be found at https://github.com/elix-tech/kmol kMol was developed in collaboration with researchers from Kyoto University. The main goal of kMol was to establish a federated learning framework. However, continual development of kMol evolved its capabilities beyond just federated learning

Slide 7

Slide 7 text

Case 2: Domain Expertise. kMol is developed by Elix actively supported by a group of talented AI Researchers. Questions and answers can be directed towards our Github repository and further customizations can be provided by Elix’s consultation services. 7 Our Solution Case 1: kMol approaches security and privacy concerns by introducing Federated Learning capabilities. Model architectures compatible with kMol will have this capability enabled by default. Source codes are also available open-source and can be scrutinized. kMol is designed to be easily maintained and open for scrutiny.

Slide 8

Slide 8 text

8 kMol and Federated Learning

Slide 9

Slide 9 text

9 Preliminaries: Federated Learning Federated learning is an approach to circumvent conventional method for training machine learning models by using a collective strategy. Ultimately, we are interested in the ﬁnal state of the model; a fully trained model with state-of-the-art performance. kMol’s Approach to Federated Learning in practice. Global Model - The Master node aggregating all training model weights across different distributed worker nodes. Local Model - Identical copies of global model, but trained on a different set of data For every epoch, the trained model is sent to the global node for aggregation. Data security is preserved

Slide 10

Slide 10 text

10 kMol kMol is a library meant to be run on the command-line. Prerequisites - Some Linux command line knowledge is required Installation - Comes with batteries included (i.e. example conﬁguration scripts) - Installation is straightforward - Two lines to perform the installation - or run with Docker

Slide 11

Slide 11 text

11 Configuration in kMol (1) Configurations Sample configurations are available in the /data directory Configurations are available for - Federated Learning (MILA) - ADME - AMES - Ligand-Protein Activity Prediction

Slide 12

Slide 12 text

12 Configuration in kMol (2) ● Settings of kMol can be shared between users easily. It is written in JSON. As of version 1.1.4, YAML is also supported. ● Configuration covers: ○ Model configuration ○ Data configuration ○ Featurization/Preprocessing ● Configurations are also extensible, allowing one to import existing configuration and making minor changes only. The parent configuration file can be loaded and parameters can be override in the child configuration.

Slide 13

Slide 13 text

13 Running kMol kMol is simply launched with kmol Additional commands can be found in documentation. kMol is capable of performing hyperparameter optimization and other related subtasks Evaluation can be performed on trained models. The checkpoints has to be configured in the configuration file. A fully trained model can be used to perform prediction as well

Slide 14

Slide 14 text

14 Federated Learning with kMol kMol can be executed in a federated learning scenario by launching a server and multiple clients. The client-server model works by associating a shared configuration file between all members of the federated learning network nodes. The target localhost:8024 in this case is the aggregating server. Server: By default grpc_configuration can be left empty and it will perform federated learning on a local machine. Client 1, Client 2: Client configuration would have a similar setup. Example: 80-20 Tox21 Configuration Client 1 Client 2 Server

Slide 15

Slide 15 text

15 Federated Learning with kMol Example Two clients are would start training and the aggregator (server) will wait for each client to complete the speciﬁed epochs and aggregate based on the choice of the aggregator Upon aggregation, checkpoints are shared to all clients.

Slide 16

Slide 16 text

16 Federated Learning with kMol - Transparency In cases where concerns of sharing checkpoints is crucial, kMol supports upload of checkpoints to Box.

Slide 17

Slide 17 text

17 Beyond Federated Learning: Scalability and Extensibility of kMol

Slide 18

Slide 18 text

18 Recent Developments For the past few years, a lot of development has went into making kMol better. We have thus far included the following features: - State-of-the-art Graph Models - State-of-the-art Activity Prediction of Protein-Ligand Architectures (Developed by Elix) - Distributed computation of kMol (compatible with Fugaku) - Visualization tools such as Integrated Gradients ClusterGCN Explainability with Integrated Gradients

Slide 19

Slide 19 text

19 Recent Developments More recently the following are to be supported: - Activity prediction with 3D Information (i.e. from docking simulation results) - MSA Feature extraction from AlphaFold/OpenFold’s dataset MSA Features GPHDK... Protein sequence Compound Structure Docking structure Graph or 3D-Graph kMol Featurizer Model Token or bag-of-words or AF2 feature 3D Graph Descriptors Interaction descriptor Activity value Pipeline Integration with 3D Information

Slide 20

Slide 20 text

20 Conclusion • kMoL is a machine learning library for drug discovery and life sciences, with federated learning capabilities. It’s a scalable and highly customizable library with batteries included. • It is actively being developed by Elix in collaboration with researchers from Kyoto University. • Lots of room for improvement, but kMol is mainly presented as a library for research purposes. Its federated learning capabilities are also suited for enterprise environment • Source code is open-source can any form of contributions are welcome