Visualize BERT Attention

55c8dbd8bc4374c44d83b328fbba4b21?s=47 Antje Barth
September 26, 2020

Visualize BERT Attention

BERT is a revolutionary machine learning model for Natural Language Understanding (NLU) and Natural Language Processing (NLP).

In this talk, I describe how to use Amazon SageMaker Model Debugger to visualize how BERT learns to understand the language of a given set of documents.


Antje Barth

September 26, 2020


  1. Visualize BERT Attention for Natural Language Understanding (NLU) Use Cases

    using Amazon SageMaker Antje Barth Sr. Developer Advocate AI/ML Amazon Web Services @anbarth
  2. About me Sr. Developer Advocate for AI and Machine Learning

    Co-author of the O'Reilly Book, "Data Science on AWS.” Co-founder of the Düsseldorf chapter of Women in Big Data. Chris Fregly @cfregly
  3. Agenda Introduction to NLP Algorithms and Concepts BERT Attention Is

    All You Need Demo Visualize BERT attentions in model training using Amazon SageMaker
  4. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  5. Problem statement • Natural Language Processing (NLP) is a major

    field in AI • NLP apps require a language model in order to predict the next word • Vocabulary size can be hundreds of thousands of words … in millions of documents • Can we build a compact mathematical representation of language, that will help with a variety of domain-specific NLP tasks?
  6. « You shall know a word by the company it

    keeps », Firth (1957) • Word vectors are built from co-occurrence counts Also called word embeddings High dimensional: at least 50, up to 300 • Words with similar meanings should have similar vectors “car” ≈ “automobile” ≈ “sedan” • The distance between vectors for the same concepts should be similar distance (“Paris”, ”France”) ≈ distance(“Berlin”, ”Germany”) distance(“hot”, ”hotter”) ≈ distance(“cold”, ”colder”)
  7. High-level steps 1. Start from a large text corpus (100s

    of millions of words, even billions) 2. Preprocess the corpus into tokens Tokenize: « hello, world! » à « <BOS>hello<SP>world<SP>!<EOS>» 3. Build the vocabulary from the tokens 4. Learn vector representations for the vocabulary … or use pre-trained models with existing vector representations.
  8. Popular NLP use cases Representation learning Machine Translation Text Classification

    Language Modeling Sentiment Analysis Named Entity Recognition Question Answering
  9. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  10. Evolution of NLP algorithms Shallow neural network Continuous bag-of-words and

    continuous skip-gram Extension of Word2Vec: Each word is treated as a set of sub- words (character n-grams) Word2Vec Jan 2013 GloVe Jan 2014 FastText Jul 2016 Transformer Jun 2017 ELMo Feb 2018 BERT Oct 2018 Global Vectors for Word Representation Matrix factorization
  11. Limitations of Word2Vec (and family) • Some words have different

    meanings (aka polysemy) « Kevin, stop throwing rocks! » vs. « Machine Learning rocks » Word2Vec encodes the different meanings of a word as the same vector • Bidirectional context is not taken into account Previous words (left-to-right) and next words (right-to-left)
  12. Evolution of NLP algorithms Word2Vec Jan 2013 GloVe Jan 2014

    FastText Jul 2016 Transformer Jun 2017 ELMo Feb 2018 BERT Oct 2018 ”Embeddings from Language Models” (Pseudo-)bi-directional context using two uni-directional LSTMs ”Attention Is All You Need” Replace LSTM with Transformers implementing true bi-directional attention
  13. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. []
  14. BERT Bidirectional Encoder Representations from Transformers BERT improves on ELMo

    • Replace LSTM with Transformers, which deal better with long-term dependencies • Truly bidirectional architecture: left-to-right and right-to-left contexts are learned by the same network • Pre-Training: Model is trained on unlabeled data over different pre-training tasks. • Fine-Tuning: Model is initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream NLP tasks. • Pre-trained models: BERT Base and BERT Large Layers Hidden Units Parameters BERT base 12 768 110M BERT large 24 1024 340M
  15. BERT Pre-Training and Fine-Tuning i.e. Wikipedia, Books-corpus Unsupervised Training BERT

    Pre-Training Supervised Training BERT Fine-Tuning Words are randomly masked during training to improve learning Sentences are randomly paired to improve Next Sentence Prediction i.e. Customer Reviews
  16. BioBERT (Biomedical) BERTje (Dutch) SciBERT (Scientific) GermanBERT (German) ClinicalBERT (Healthcare)

    BERT (Original) PatentBERT (Patents) CamemBERT (French) BERT MANIA!
  17. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Under the covers
  18. BERT Create Input Representations • Terminology: “Sentence” can be an

    arbitrary span of contiguous text, rather than an actual linguistic sentence. “Sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together. • BERT uses WordPiece embeddings with a 30,000 token vocabulary to convert the raw input text into tokens. Is also adds the following special tokens: [CLS] The first token of every sequence is always a special classification token. Sentence pairs are packed together into a single sequence, but are separated by [SEP] and a learned embedding is added to every token indicating whether it belongs to sentence A or B.
  19. BERT Example Input Representations For a given token, its input

    representation is constructed by summing the corresponding token, segment, and position embeddings. This example shows the creation of BERT embeddings for • Sentiment analysis (text classification task). • Using a fine-tuned BertBase model.
  20. BERT Single Layer Components • Attention module Transformer models are

    sequence models that rely on the attention mechanism to capture long term dependencies. • Feed-forward module Contain most of the parameters of the model. • Residual connections and layer normalization modules Each module is wrapped with a skip connection and layer normalization is applied after it. Feed-Forward Layer Norm Attention
  21. • BERT uses multiple (12 or 24) layers, each layer

    with multiple (12 or 16) attention heads. • Model weights are not shared between layers à max. 24*16 = 384 different attention mechanisms • The outputs of all heads in the same layer are combined and run through a fully-connected feed-forward module. • Multiple attention heads can capture the different meaning of one word in different sentences: Tom gives Sarah chocolates. Sarah gives Tom chocolates. BERT Model With 12 or 16 Layers
  22. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. []
  23. Query, Key and Value Vectors • Each row in Input

    X is the embedding vector of one token • The Query Vector (Q) represents what kind of information we are looking for (=token paying attention) • The Key Vector (K) represents the relevance to the query (=token paying attention to), • The Value Vector (V) represents the actual contents of the input. • BERT uses weights WQ, WK, WV to multiple the same input X to get Query, Key, and Value Vectors. These weights are used to generalize the model.
  24. Attention Attention is simply a function that takes the value

    vector X as input and returns another sequence Y of the same length, composed of vectors of the same length of those in X. Example: Sentiment Analysis (Text Classification Task): Each vector in Y is a weighted average of the vectors in X: The attention weights are the dot product of Query Vector to each Key Vector. Attention weights show how much the model attends to each input in X when computing the weighted average. Attention =
  25. Attention Weights How is BERT calculating the attention weights? A

    compatibility function assigns a score to each pair of words indicating how strongly they should attend to one another. • The compatibility score is the dot product of the query vector of one word and the key vector of the other. • Attention weights are the normalized scores (after appyling Softmax function). • So basically a cartesian product O(n^2). Attention = Attention weights for Query Vector “movie” to each Key Vector
  26. Sample Attention Patterns • Learned by Attention Heads • Delimiter-focused

    attention patterns (aka NoOp) Attention head focuses on the [SEP] tokens when it can’t find anything else in the input sentence to focus on. • Bag of Words attention pattern Attention is divided fairly evenly across all words in the same sentence. • Next-word attention patterns Most attention is focused on the next word in the input sequence i.e. Head 0 i.e. Head 7 i.e. Head 8
  27. BERT Putting All Pieces Together (Example: Text Classification) (1) Convert

    raw text input into (n) tokens using WordPiece tokenization. (2) For each token, lookup BERT token embedding (768-dimensional representation), segment embedding, and position embedding. The BERT input embedding is the element-wise sum of these 3 embeddings. (3) The input sequence for the BERT model consists of (n) BERT embeddings. (4) These BERT embeddings are distributed across all attention heads in the first model layer. (5) To calculate attention, each input embedding is transformed into a query, key and value vector by multiplying with three different weights (used to generalize the model). (6) The attention head computes the dot product between for each query vector with each key vector. The resulting scores get normalized by applying a Softmax function. The result are the attention weights. (7) The attention weights are used in combination with the value vectors to compute the weighted sum aka. Attention. (8) The individual attention scores are then fed into the next BERT layer as the new input. (9) As the embeddings pass through the BERT model layers, they pick up more and more contextual information with each layer. (10) The final layer output embedding can then be used as input for the specific task, e.g. as input for a Classification layer.
  28. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  29. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  30. BertViz Open source tool for visualizing attention in Transformer models.

    • Attention Head view: Visualizes the attention patterns produced by one or more attention heads in a given transformer layer. • Model view: Provides a birds-eye view of attention across all of the model’s layers and heads. • Neuron view: Visualizes the individual neurons in the query and key vectors and shows how they are used to compute attention Source:
  31. Example: BERT-based paraphrase detection model Sentence 1: ”the eminem tune

    was totally awesome.” Sentence 2: “dude, that song from eminem was sick!” (sick as a synonym of awesome here). BERT classified those two inputs as paraphrase = FALSE. Source: []
  32. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  33. Amazon SageMaker Helps you build, train, and deploy models Collect

    and prepare training data Fully managed data processing jobs/ data labeling workflows Choose or bring your own ML algorithm Collaborative notebooks, built-in algorithms/models Set up and manage environments for training One-click training Train, debug, and tune models Debugging and optimization Manage training runs Visually track and compare experiments Deploy model in production One-click deployment and auto-scaling Monitor models Automatically spot concept drift Validate predictions Add human review of predictions Scale and manage the production environment Fully managed with auto-scaling for 75% less WEB-BASED IDE FOR ML AUTOMATICALLY BUILD AND TRAIN MODELS
  34. Amazon SageMaker Debugger Relevant data capture Zero code change Automatic

    error detection Built-in rules to detect training issue Real-time monitoring Debug data while training is ongoing Amazon SageMaker Studio Integration Analyze & debug from Amazon SageMaker Studio 101000101 000011100 001
  35. Real-time monitoring Built-in rules: • Vanishing gradients • Exploding gradients

    • Activations: Tanh and sigmoid saturation Dying ReLU • Weight update ratio • Loss: Not decreasing Early-stopping Overfitting • Class imbalance • Tensor: Variance, only zeros, no update across steps • Stalled Training weights, gradients, biases, losses Training Environment
  36. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  37. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  38. AllenNLP • A framework to explain predictions of NLP models

    • Uses gradient-based approach which is better for explaining classification predictions, such as sentiment analysis. • Live Demo: Sentiment Analysis with RoBERTa: • Why is this sentence “i love this movie so much” classified as positive? à AllenNLP visualizes the top n most important words that led to that prediction.
  39. More resources Data Science on AWS workshop: Github Repo: Monitor BERT Attention: aws/workshop/blob/59b412cf7715a2175d765eaf9217df627ecc3cda/07_train/wip/sm_bert_viz/bert_attention_hea d_view.ipynb AllenNLP for NLP Explainability: aws/workshop/blob/ffaac4e5ef20922864613f28290c5abb0df07f5f/07_train/wip/99_AllenNLP_RoBERTa_Prediction .ipynb
  40. Thank you! © 2020, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Antje Barth @anbarth