Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction, Elix, CBI 2021

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody
CDR-H3 Loop Prediction David Jimenez & Nazim Medzhidov, Ph.D Elix Inc. 27/10/2021

2 Introduction • Antibodies are proteins of the immune system
that can bind to a huge variety of antigens with high affinity and specificity. • Antibody structure, particularly the structure of the Complementarity Determining Regions (CDRs), determines the strength of antigen recognition • The knowledge of antibody 3D structure is important when designing/optimizing a therapeutic candidate Challenges: • H3 CDR loop has crucial role in antigen binding, however its structure is observed in various conformations, making it the most challenging part of the antibody to model. • Antibody realm has relatively scarce structure annotated data compared to general proteins, which makes difficult the training and generalization of models.

3 Recent progress • DeepH3[1] is a neural network based
on RaptorX (general protein structure prediction) used for antibody structure prediction. • Predicts inter-residue distances and angles as 26 discretized classes in the range: ◦ Distance: [4Å, 16Å] ◦ Omega and Theta: [-180°, 180°] ◦ Phi: [0, 180°] • Relatively shallow architecture to compensate for scarcity of structural annotated data. x3 x21 1D ResNet 1D to 2D Transformation 2D ResNet 2D Conv 2D Conv 2D Conv 2D Conv Input Sequence d 𝜑 𝜃 𝜔 N Cβ C Cɑ N Cβ C Cɑ N Cβ C Cɑ N Cβ C Cɑ ω d 1 2 1 2 Θ 12 Θ 21 φ 21 φ 12 [1] Ruffolo J.A, et al., Bioinformatics (2020)

4 Can we improve an H3 loop predicting model’s performance
by leveraging similar unlabeled datasets?

Antibody TRAnsformer Cdr - H3 (ATRACH3): • Augment an antibody
H3 loop structure prediction model with a language model, trained unsupervised on a large dataset of protein sequences. 5 Proposed Approach: ATRACH3 Language Model (ESM-1B) H3 Loop Prediction Model (DeepH3) Input Protein Sequence Representation Space d 𝜑 𝜔 𝜃 Trained Unsupervised Trained Supervised Proxy Task

6 ATRACH3 Antibody TRAnsformer Cdr - H3 (ATRACH3): • Selected
Language Model: ESM-1B[2]; Contextual language model trained unsupervised on large protein datasets to reconstruct sequences with masked amino acids. ◦ While the model cannot observe protein structure directly, it observes patterns in the sequences which are determined by structure. ◦ The model spans a representation space reﬂecting structural knowledge. • Selected Antibody Structure Prediction Model: DeepH3; learns to predict the inter residue distance and angles. ◦ It is “hooked” to the second-to-last layer of ESM-1B, which contains a richer representation, not only of the underlying amino acid sequence, but also encoded features relating to structural data. • Datasets: ◦ Unlabeled dataset: The UniProt Archive (UniParc)[3] with approximately 250 million sequences. ◦ Labeled dataset: SAbDab[4] dataset containing all the structure-labeled antibody sequences in the Protein Data Bank. After pre-processing, 1433 sequences were selected. ◦ Test Set: Rosetta antibody benchmark dataset[5] comprising of 49 curated antibody targets. [2] Rives A, et al., PNAS (2020) [3] Leinonen R, et al., Bioinformatics (2004) [4] Dunbar J, et al., Nucleic Acids Res. (2014) [5] Marze N.A, et al., Prot. Eng. Des. Selection (2016)

7 Label Imbalance and Focal Loss Distance Label Distribution 4Å
10Å 16Å Omega Label Distribution -180 0 180 Theta Label Distribution -180 0 180 Phi Label Distribution 0 90 180 Label distribution histograms DeepH3 trained with Cross Entropy Loss VS Focal Loss *CCC; Circular Correlation Coeﬃcient

8 Results with 95:5 (Training Set: Validation Set) Ratio DeepH3
trained with focal loss was compared with the results of ATRACH3 trained with focal loss ATRACH3 improved 1.16%, 10.9%, 3.66%, and 6.45% in Distance, Omega, Theta and Phi, respectively. On average it improved 4.9% over DeepH3 trained on the same conditions. *CCC; Circular Correlation Coeﬃcient

9 Can the unsupervised training on a large protein data
compensate on the H3 loop prediction task with fewer data?

10 ATRACH3 Performance with fewer data To test the eﬃcacy
of ATRACH3 in reduced data situations we further reduced the available training set size to: 95%, 90%, 80%, 66%, 50% and 33% of the original training set as follows: Antibody Annotated Structure Dataset

11 Results on Test Set with reduced training data points

12 Summary and Future Directions Findings: • Using Focal loss
improved baseline performance (DeepH3) • Focal loss was used in ATRACH3 • Extending DeepH3 to leverage a similar dataset using unsupervised learning improved inter-residue angles and distance predictions for antibody H3-loop • When trained with datasets of a smaller size, ATRACH3 was able to outperform the baseline on all 4 tasks. Furthermore the performance of ATRACH3 seems to decrease less rapidly when the data is reduced, compared to DeepH3. Future Directions: • Investigate ATRACH3 performance when trained unsupervised on a focused dataset of antibody sequences

Leveraging Self-Supervised Contextual Language ...

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction, Elix, CBI 2021

Elix

More Decks by Elix

Other Decks in Technology

Featured

Transcript

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody

2 Introduction • Antibodies are proteins of the immune system

3 Recent progress • DeepH3[1] is a neural network based

4 Can we improve an H3 loop predicting model’s performance

Antibody TRAnsformer Cdr - H3 (ATRACH3): • Augment an antibody

6 ATRACH3 Antibody TRAnsformer Cdr - H3 (ATRACH3): • Selected

7 Label Imbalance and Focal Loss Distance Label Distribution 4Å

8 Results with 95:5 (Training Set: Validation Set) Ratio DeepH3

9 Can the unsupervised training on a large protein data

10 ATRACH3 Performance with fewer data To test the eﬃcacy

11 Results on Test Set with reduced training data points

12 Summary and Future Directions Findings: • Using Focal loss