Slide 1

Slide 1 text

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction David Jimenez & Nazim Medzhidov, Ph.D Elix Inc. 27/10/2021

Slide 2

Slide 2 text

2 Introduction ● Antibodies are proteins of the immune system that can bind to a huge variety of antigens with high affinity and specificity. ● Antibody structure, particularly the structure of the Complementarity Determining Regions (CDRs), determines the strength of antigen recognition ● The knowledge of antibody 3D structure is important when designing/optimizing a therapeutic candidate Challenges: ● H3 CDR loop has crucial role in antigen binding, however its structure is observed in various conformations, making it the most challenging part of the antibody to model. ● Antibody realm has relatively scarce structure annotated data compared to general proteins, which makes difficult the training and generalization of models.

Slide 3

Slide 3 text

3 Recent progress ● DeepH3[1] is a neural network based on RaptorX (general protein structure prediction) used for antibody structure prediction. ● Predicts inter-residue distances and angles as 26 discretized classes in the range: ○ Distance: [4Å, 16Å] ○ Omega and Theta: [-180°, 180°] ○ Phi: [0, 180°] ● Relatively shallow architecture to compensate for scarcity of structural annotated data. x3 x21 1D ResNet 1D to 2D Transformation 2D ResNet 2D Conv 2D Conv 2D Conv 2D Conv Input Sequence d 𝜑 𝜃 𝜔 N Cβ C Cɑ N Cβ C Cɑ N Cβ C Cɑ N Cβ C Cɑ ω d 1 2 1 2 Θ 12 Θ 21 φ 21 φ 12 [1] Ruffolo J.A, et al., Bioinformatics (2020)

Slide 4

Slide 4 text

4 Can we improve an H3 loop predicting model’s performance by leveraging similar unlabeled datasets?

Slide 5

Slide 5 text

Antibody TRAnsformer Cdr - H3 (ATRACH3): ● Augment an antibody H3 loop structure prediction model with a language model, trained unsupervised on a large dataset of protein sequences. 5 Proposed Approach: ATRACH3 Language Model (ESM-1B) H3 Loop Prediction Model (DeepH3) Input Protein Sequence Representation Space d 𝜑 𝜔 𝜃 Trained Unsupervised Trained Supervised Proxy Task

Slide 6

Slide 6 text

6 ATRACH3 Antibody TRAnsformer Cdr - H3 (ATRACH3): ● Selected Language Model: ESM-1B[2]; Contextual language model trained unsupervised on large protein datasets to reconstruct sequences with masked amino acids. ○ While the model cannot observe protein structure directly, it observes patterns in the sequences which are determined by structure. ○ The model spans a representation space reflecting structural knowledge. ● Selected Antibody Structure Prediction Model: DeepH3; learns to predict the inter residue distance and angles. ○ It is “hooked” to the second-to-last layer of ESM-1B, which contains a richer representation, not only of the underlying amino acid sequence, but also encoded features relating to structural data. ● Datasets: ○ Unlabeled dataset: The UniProt Archive (UniParc)[3] with approximately 250 million sequences. ○ Labeled dataset: SAbDab[4] dataset containing all the structure-labeled antibody sequences in the Protein Data Bank. After pre-processing, 1433 sequences were selected. ○ Test Set: Rosetta antibody benchmark dataset[5] comprising of 49 curated antibody targets. [2] Rives A, et al., PNAS (2020) [3] Leinonen R, et al., Bioinformatics (2004) [4] Dunbar J, et al., Nucleic Acids Res. (2014) [5] Marze N.A, et al., Prot. Eng. Des. Selection (2016)

Slide 7

Slide 7 text

7 Label Imbalance and Focal Loss Distance Label Distribution 4Å 10Å 16Å Omega Label Distribution -180 0 180 Theta Label Distribution -180 0 180 Phi Label Distribution 0 90 180 Label distribution histograms DeepH3 trained with Cross Entropy Loss VS Focal Loss *CCC; Circular Correlation Coefficient

Slide 8

Slide 8 text

8 Results with 95:5 (Training Set: Validation Set) Ratio DeepH3 trained with focal loss was compared with the results of ATRACH3 trained with focal loss ATRACH3 improved 1.16%, 10.9%, 3.66%, and 6.45% in Distance, Omega, Theta and Phi, respectively. On average it improved 4.9% over DeepH3 trained on the same conditions. *CCC; Circular Correlation Coefficient

Slide 9

Slide 9 text

9 Can the unsupervised training on a large protein data compensate on the H3 loop prediction task with fewer data?

Slide 10

Slide 10 text

10 ATRACH3 Performance with fewer data To test the efficacy of ATRACH3 in reduced data situations we further reduced the available training set size to: 95%, 90%, 80%, 66%, 50% and 33% of the original training set as follows: Antibody Annotated Structure Dataset

Slide 11

Slide 11 text

11 Results on Test Set with reduced training data points

Slide 12

Slide 12 text

12 Summary and Future Directions Findings: ● Using Focal loss improved baseline performance (DeepH3) ● Focal loss was used in ATRACH3 ● Extending DeepH3 to leverage a similar dataset using unsupervised learning improved inter-residue angles and distance predictions for antibody H3-loop ● When trained with datasets of a smaller size, ATRACH3 was able to outperform the baseline on all 4 tasks. Furthermore the performance of ATRACH3 seems to decrease less rapidly when the data is reduced, compared to DeepH3. Future Directions: ● Investigate ATRACH3 performance when trained unsupervised on a focused dataset of antibody sequences