Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction, Elix, CBI 2021

Elix
October 27, 2021

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction, Elix, CBI 2021

Elix

October 27, 2021
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody

    CDR-H3 Loop Prediction David Jimenez & Nazim Medzhidov, Ph.D Elix Inc. 27/10/2021
  2. 2 Introduction • Antibodies are proteins of the immune system

    that can bind to a huge variety of antigens with high affinity and specificity. • Antibody structure, particularly the structure of the Complementarity Determining Regions (CDRs), determines the strength of antigen recognition • The knowledge of antibody 3D structure is important when designing/optimizing a therapeutic candidate Challenges: • H3 CDR loop has crucial role in antigen binding, however its structure is observed in various conformations, making it the most challenging part of the antibody to model. • Antibody realm has relatively scarce structure annotated data compared to general proteins, which makes difficult the training and generalization of models.
  3. 3 Recent progress • DeepH3[1] is a neural network based

    on RaptorX (general protein structure prediction) used for antibody structure prediction. • Predicts inter-residue distances and angles as 26 discretized classes in the range: ◦ Distance: [4Å, 16Å] ◦ Omega and Theta: [-180°, 180°] ◦ Phi: [0, 180°] • Relatively shallow architecture to compensate for scarcity of structural annotated data. x3 x21 1D ResNet 1D to 2D Transformation 2D ResNet 2D Conv 2D Conv 2D Conv 2D Conv Input Sequence d 𝜑 𝜃 𝜔 N Cβ C Cɑ N Cβ C Cɑ N Cβ C Cɑ N Cβ C Cɑ ω d 1 2 1 2 Θ 12 Θ 21 φ 21 φ 12 [1] Ruffolo J.A, et al., Bioinformatics (2020)
  4. 4 Can we improve an H3 loop predicting model’s performance

    by leveraging similar unlabeled datasets?
  5. Antibody TRAnsformer Cdr - H3 (ATRACH3): • Augment an antibody

    H3 loop structure prediction model with a language model, trained unsupervised on a large dataset of protein sequences. 5 Proposed Approach: ATRACH3 Language Model (ESM-1B) H3 Loop Prediction Model (DeepH3) Input Protein Sequence Representation Space d 𝜑 𝜔 𝜃 Trained Unsupervised Trained Supervised Proxy Task
  6. 6 ATRACH3 Antibody TRAnsformer Cdr - H3 (ATRACH3): • Selected

    Language Model: ESM-1B[2]; Contextual language model trained unsupervised on large protein datasets to reconstruct sequences with masked amino acids. ◦ While the model cannot observe protein structure directly, it observes patterns in the sequences which are determined by structure. ◦ The model spans a representation space reflecting structural knowledge. • Selected Antibody Structure Prediction Model: DeepH3; learns to predict the inter residue distance and angles. ◦ It is “hooked” to the second-to-last layer of ESM-1B, which contains a richer representation, not only of the underlying amino acid sequence, but also encoded features relating to structural data. • Datasets: ◦ Unlabeled dataset: The UniProt Archive (UniParc)[3] with approximately 250 million sequences. ◦ Labeled dataset: SAbDab[4] dataset containing all the structure-labeled antibody sequences in the Protein Data Bank. After pre-processing, 1433 sequences were selected. ◦ Test Set: Rosetta antibody benchmark dataset[5] comprising of 49 curated antibody targets. [2] Rives A, et al., PNAS (2020) [3] Leinonen R, et al., Bioinformatics (2004) [4] Dunbar J, et al., Nucleic Acids Res. (2014) [5] Marze N.A, et al., Prot. Eng. Des. Selection (2016)
  7. 7 Label Imbalance and Focal Loss Distance Label Distribution 4Å

    10Å 16Å Omega Label Distribution -180 0 180 Theta Label Distribution -180 0 180 Phi Label Distribution 0 90 180 Label distribution histograms DeepH3 trained with Cross Entropy Loss VS Focal Loss *CCC; Circular Correlation Coefficient
  8. 8 Results with 95:5 (Training Set: Validation Set) Ratio DeepH3

    trained with focal loss was compared with the results of ATRACH3 trained with focal loss ATRACH3 improved 1.16%, 10.9%, 3.66%, and 6.45% in Distance, Omega, Theta and Phi, respectively. On average it improved 4.9% over DeepH3 trained on the same conditions. *CCC; Circular Correlation Coefficient
  9. 9 Can the unsupervised training on a large protein data

    compensate on the H3 loop prediction task with fewer data?
  10. 10 ATRACH3 Performance with fewer data To test the efficacy

    of ATRACH3 in reduced data situations we further reduced the available training set size to: 95%, 90%, 80%, 66%, 50% and 33% of the original training set as follows: Antibody Annotated Structure Dataset
  11. 11 Results on Test Set with reduced training data points

  12. 12 Summary and Future Directions Findings: • Using Focal loss

    improved baseline performance (DeepH3) • Focal loss was used in ATRACH3 • Extending DeepH3 to leverage a similar dataset using unsupervised learning improved inter-residue angles and distance predictions for antibody H3-loop • When trained with datasets of a smaller size, ATRACH3 was able to outperform the baseline on all 4 tasks. Furthermore the performance of ATRACH3 seems to decrease less rapidly when the data is reduced, compared to DeepH3. Future Directions: • Investigate ATRACH3 performance when trained unsupervised on a focused dataset of antibody sequences