Slide 6
Slide 6 text
6
ATRACH3
Antibody TRAnsformer Cdr - H3 (ATRACH3):
● Selected Language Model: ESM-1B[2]; Contextual language model trained unsupervised on large protein
datasets to reconstruct sequences with masked amino acids.
○ While the model cannot observe protein structure directly, it observes patterns in the sequences which are
determined by structure.
○ The model spans a representation space reflecting structural knowledge.
● Selected Antibody Structure Prediction Model: DeepH3; learns to predict the inter residue distance and
angles.
○ It is “hooked” to the second-to-last layer of ESM-1B, which contains a richer representation, not only of the
underlying amino acid sequence, but also encoded features relating to structural data.
● Datasets:
○ Unlabeled dataset: The UniProt Archive (UniParc)[3] with approximately 250 million sequences.
○ Labeled dataset: SAbDab[4] dataset containing all the structure-labeled antibody sequences in the Protein Data
Bank. After pre-processing, 1433 sequences were selected.
○ Test Set: Rosetta antibody benchmark dataset[5] comprising of 49 curated antibody targets.
[2] Rives A, et al., PNAS (2020) [3] Leinonen R, et al., Bioinformatics (2004) [4] Dunbar J, et al., Nucleic Acids Res. (2014) [5] Marze N.A, et al., Prot. Eng. Des. Selection (2016)