Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction, Elix, CBI 2021

Elix
October 27, 2021

Leveraging Self-Supervised Contextual Language Models for Deep Neural Network Antibody CDR-H3 Loop Prediction, Elix, CBI 2021

Elix

October 27, 2021
Tweet

More Decks by Elix

Other Decks in Technology

Transcript

  1. Leveraging Self-Supervised
    Contextual Language Models for
    Deep Neural Network Antibody
    CDR-H3 Loop Prediction
    David Jimenez & Nazim Medzhidov, Ph.D
    Elix Inc.
    27/10/2021

    View Slide

  2. 2
    Introduction
    ● Antibodies are proteins of the immune system that can bind to a huge variety of antigens with high affinity and
    specificity.
    ● Antibody structure, particularly the structure of the Complementarity Determining Regions (CDRs), determines
    the strength of antigen recognition
    ● The knowledge of antibody 3D structure is important when designing/optimizing a therapeutic candidate
    Challenges:
    ● H3 CDR loop has crucial role in antigen binding, however its structure is observed in various conformations,
    making it the most challenging part of the antibody to model.
    ● Antibody realm has relatively scarce structure annotated data compared to general proteins, which makes difficult
    the training and generalization of models.

    View Slide

  3. 3
    Recent progress
    ● DeepH3[1] is a neural network based on RaptorX
    (general protein structure prediction) used for antibody
    structure prediction.
    ● Predicts inter-residue distances and angles as 26
    discretized classes in the range:
    ○ Distance: [4Å, 16Å]
    ○ Omega and Theta: [-180°, 180°]
    ○ Phi: [0, 180°]
    ● Relatively shallow architecture to compensate for
    scarcity of structural annotated data.
    x3
    x21
    1D ResNet
    1D to 2D
    Transformation
    2D ResNet
    2D Conv 2D Conv 2D Conv 2D Conv
    Input
    Sequence
    d 𝜑 𝜃 𝜔
    N

    C

    N

    C

    N

    C

    N

    C

    ω
    d
    1
    2
    1
    2
    Θ
    12
    Θ
    21
    φ
    21
    φ
    12
    [1] Ruffolo J.A, et al., Bioinformatics (2020)

    View Slide

  4. 4
    Can we improve an H3 loop predicting model’s
    performance by leveraging similar unlabeled datasets?

    View Slide

  5. Antibody TRAnsformer Cdr - H3 (ATRACH3):
    ● Augment an antibody H3 loop structure prediction model with a language model, trained unsupervised
    on a large dataset of protein sequences.
    5
    Proposed Approach: ATRACH3
    Language Model
    (ESM-1B)
    H3 Loop
    Prediction Model
    (DeepH3)
    Input
    Protein
    Sequence
    Representation
    Space
    d
    𝜑
    𝜔
    𝜃
    Trained Unsupervised
    Trained Supervised
    Proxy Task

    View Slide

  6. 6
    ATRACH3
    Antibody TRAnsformer Cdr - H3 (ATRACH3):
    ● Selected Language Model: ESM-1B[2]; Contextual language model trained unsupervised on large protein
    datasets to reconstruct sequences with masked amino acids.
    ○ While the model cannot observe protein structure directly, it observes patterns in the sequences which are
    determined by structure.
    ○ The model spans a representation space reflecting structural knowledge.
    ● Selected Antibody Structure Prediction Model: DeepH3; learns to predict the inter residue distance and
    angles.
    ○ It is “hooked” to the second-to-last layer of ESM-1B, which contains a richer representation, not only of the
    underlying amino acid sequence, but also encoded features relating to structural data.
    ● Datasets:
    ○ Unlabeled dataset: The UniProt Archive (UniParc)[3] with approximately 250 million sequences.
    ○ Labeled dataset: SAbDab[4] dataset containing all the structure-labeled antibody sequences in the Protein Data
    Bank. After pre-processing, 1433 sequences were selected.
    ○ Test Set: Rosetta antibody benchmark dataset[5] comprising of 49 curated antibody targets.
    [2] Rives A, et al., PNAS (2020) [3] Leinonen R, et al., Bioinformatics (2004) [4] Dunbar J, et al., Nucleic Acids Res. (2014) [5] Marze N.A, et al., Prot. Eng. Des. Selection (2016)

    View Slide

  7. 7
    Label Imbalance and Focal Loss
    Distance Label Distribution
    4Å 10Å 16Å
    Omega Label Distribution
    -180 0 180
    Theta Label Distribution
    -180 0 180
    Phi Label Distribution
    0 90 180
    Label distribution
    histograms
    DeepH3 trained with Cross
    Entropy Loss VS Focal Loss
    *CCC; Circular Correlation Coefficient

    View Slide

  8. 8
    Results with 95:5 (Training Set: Validation Set) Ratio
    DeepH3 trained with focal loss was compared with the
    results of ATRACH3 trained with focal loss
    ATRACH3 improved 1.16%, 10.9%, 3.66%, and 6.45% in
    Distance, Omega, Theta and Phi, respectively.
    On average it improved 4.9% over DeepH3 trained on the
    same conditions.
    *CCC; Circular Correlation Coefficient

    View Slide

  9. 9
    Can the unsupervised training on a large protein data
    compensate on the H3 loop prediction task with fewer data?

    View Slide

  10. 10
    ATRACH3 Performance with fewer data
    To test the efficacy of ATRACH3 in reduced data situations we further reduced the available training set size to:
    95%, 90%, 80%, 66%, 50% and 33% of the original training set as follows:
    Antibody Annotated Structure Dataset

    View Slide

  11. 11
    Results on Test Set with reduced training data points

    View Slide

  12. 12
    Summary and Future Directions
    Findings:
    ● Using Focal loss improved baseline performance (DeepH3)
    ● Focal loss was used in ATRACH3
    ● Extending DeepH3 to leverage a similar dataset using unsupervised learning improved inter-residue angles
    and distance predictions for antibody H3-loop
    ● When trained with datasets of a smaller size, ATRACH3 was able to outperform the baseline on all 4 tasks.
    Furthermore the performance of ATRACH3 seems to decrease less rapidly when the data is reduced,
    compared to DeepH3.
    Future Directions:
    ● Investigate ATRACH3 performance when trained unsupervised on a focused dataset of antibody sequences

    View Slide