Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

研究室のEMNLP読み会での資料です。

ryoma yoshimura

January 27, 2020
Tweet

More Decks by ryoma yoshimura

Other Decks in Research

Transcript

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
 Nils Reimers and Iryna

    Gurevych
 EMNLP 2019
 
 2020/01/27 EMNLP論文読み会 
 紹介者: 吉村
 

  2. Introduction
 • BERT (Devlin et al., 2018) and RoBERTa (Liu

    et al., 2019) has SOTA performance on sentence-pair regression tasks.
 • Problem: Massive computational overhead
 ◦ Finding the most similar pair in a collection of 10,000 sentences requires about 65 hours with BERT.
 ◦ Not applicable to large-scale semantic similarity comparison, clustering, and information retrieval via semantic search.
 • Proposal: Sentence-BERT (SBERT)
 ◦ Using siamese and triplet network structures.
 ◦ Fine-tuning on NLI and MNLI data.
 • Experiment: STS task and transfer learning task.
 ◦ Outperforms other SOTA sentence embeddings methods. 
 ◦ 65 hours with BERT but 5 seconds with SBERT.
 ◦ BERT embeddings is unsuitable to be used with common similarity measures like cos-similarity.

  3. Related Work
 • InferSent (Conneau et al., 2017)
 ◦ Siamese

    BiLSTM network with max-pooling over the output.
 ◦ SNLI dataset and the MultiGenre NLI dataset.
 ◦ Outperforms unsupervised method like SkipThought.
 • Universal Sentence Encoder (Cer et al., 2018)
 ◦ transformer network and augments unsupervised learning with training on SNLI.
 • BERT (Dvlin et al., 2018)
 ◦ Researchers have started to input individual sentences into BERT and to derive fixedsize sentence embeddings.
 ◦ Commonly used approach is to average the BERT output layer or by using the output of the first token (the [CLS] token).
 ◦ There is so far no evaluation if these methods lead to useful sentence embeddings.

  4. Model
 
 • Adds a pooling operation to the output

    of BERT/RoBERTa.
 ◦ Pooling strategy (CLS-token, MEAN, MAX)
 • Create siamese and triplet networks (Schroff et al, 2015)
 ◦ Update weight such that the sentence embeddings are semantically meaningful and can be compared with cosine-similarity.
 • The network structure depends on the available training data.
 ◦ Classificartion
 ◦ Regression 
 ◦ Triplet

  5. Classification and Regression
 • Classification Objective Function
 ◦ Cross entropy

    loss.
 Wt: parameter
 u, v: sentence embedding 
 • Regression Objective Function
 ◦ Mean squared error loss.

  6. Triplet 
 • Loss function
 
 
 • Triplet loss

    tunes the network such that the distance between a and p is smaller than that a and n. 
 • As metric they use Euclidean distance and set ∊ = 1.
 
 a: anchor sentence
 p: positive sentence 
 n: negative sentence 
 s x : sentence embeddings 
 ∊: margin
 ‖・‖: a distance metric 
 

  7. Training details
 • Data
 ◦ SNLI and the Multi-Genre NLI

    dataset 
 ◦ SNLI: 570K sent-pairs with labels contradiction, entailment, neutral.
 ◦ Multi-NLI: 430K covers a range of spoken and written text.
 • Fine-tuning SBERT with a 3-way softmax-classifier objective function for one epoch.
 • Hyper parameter
 ◦ Batch-size: 16
 ◦ Optimizer: Adam
 ◦ Learning rate: 2e-5
 ◦ Linear learning rate warmup over 10%
 ◦ Pooling strategy: MEAN

  8. Evaluation - Unsupervised STS
 • Directly using the output of

    BERT leads to rather poor performances.
 • Using siamese network substantially improves the correlation.
 • SBERT performs worse than Universal Sentence Encoder in SICK-R.
 ◦ USE is trained on various dataset(news, QA, discussion forums)
 ◦ SBERT is pre-trained only on Wikipedia and NLI 
 • Only minor difference between SBERT and SRoBERTa

  9. Evaluation - Supervised STS
 • Two set up
 ◦ Only

    training on STSn
 ◦ First training on NLI, then training on STSb
 • Later strategy leads to a slight improvement of 1-2 points.
 • Two-step approach had an especially large impact for the BERT cross-encoder.
 

  10. Evaluation - Argument Facet Similarity
 • Argument Facet Similarity (AFS)

    corpus. (Misra et al., 2016)
 ◦ 6,000 sentential argument pairs.
 ◦ From social media dialogs on three controversial topics.
 ▪ gun control, gay marriage, and death penalty. ◦ Annotated on a scale from 0 to 5
 ▪ 0: different topic, 5: completely equivalent. ◦ The similarity notion is fairly different to STS datasets.
 ▪ To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning. ▪ Simple unsupervised methods perform badly on this dataset (Reimers et al., 2019)
  11. Evaluation - Argument Facet Similarity
 
 
 • BERT is

    able to use attention to compare directly both sentences.
 • SBERT must map individual sentences such that arguments with similar claims and reasons are close. 
 • It appears to require more than just two topics for training to work on-par with BERT.
 
 
 

  12. Evaluation - Wikipedia Sections Distinction
 • Large dataset of labeled

    sentence triplets (Dor et al., 2018)
 ◦ sentences in the same section are thematically closer than sentences in different sections.
 ◦ The anchor and the positive example come from the same section
 ◦ The negative example comes from a different section of the same article.
 • Train: 1.8 Million / Test: 222,957
 • Evaluation Metric: accuracy

  13. Evaluation - Wikipedia Sections Distinction
 • Dor et al fine-tuned

    a BiLSTM with triplet loss to derive sentence embeddings for this dataset.
 • SBERT clearly outperforms the BiLSTM approach.

  14. Evaluation - SentEval
 • SentEval (Conneau and Kiela, 2018)
 ◦

    Toolkit to evaluate the quality of sentence embesings.
 • The purpose of SBERT SE are not to be used for transfer learning for other tasks.
 ◦ Fine-tuning BERT as described by Devlin is more suitable.
 ◦ However, SentEval can still give an impression on the quality of our SE for various tasks.
 • Experiment flollowing seven SentEval transfer tasks
 ◦ MR: Sentiment prediction for movie reviews snippets. 
 ◦ CR: Sentiment Prediction of customer product reviews. 
 ◦ SUBJ: Subjectivity prediction of sentences from movie reviews. 
 ◦ MPQA: Phase level opinion polarity classification from newswire. 
 ◦ SST: Stanford Sentiment Treebank with binary labels. 
 ◦ TREC: Fine grained question-type classification from TREC. 
 ◦ MRPC: Microsoft Research Paraphrase Corpus. 

  15. Evaluation - SentEval
 • SBERT achieve the best performance in

    5 out of 7 tasks.
 • Even though transfer learning is not the purpose of SBERT, it outperforms other SOTA sentence embeddings.
 • SE from SBERT capture well sentiment information.
 • USE was pre-trained on QA data.
 

  16. Evaluation - SentEval
 • They conclude that average BERT and

    CLS-token output from BERT embeddings are infeasible to be used with cos-similarity or with Manhatten / Euclidean distance. 
 • Using the described fine-tuning setup with a siamese network structure on NLI datasets yields sentence embeddings that achieve a SOTA for the SentEval toolkit.

  17. Ablation Study
 • Minor impact on classification.
 • Large impact

    on regression.
 ◦ Conneau et al., 2017 found is beneficial for the InferSent to use MAX instead of MEAN.
 Pooling Strategy
 Concatenation
 • InferSent and USE both use 
 (u, v, |u-v|, u∗v).
 ◦ In this architecture, adding the u∗v decreased the performance.
 • |u-v| is the most important.
 

  18. Computational Efficiency
 • Intel i7-5820K CPU@ 3.30GHz, Nvidia Tesla V100

    GPU
 • Smart Batching
 ◦ Sentences with similar lengths are grouped together and are only padded to the longest element in a mini-batch. 
 • On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. 
 

  19. Conclusions
 • They showed that BERT embeddings is unsuitable to

    be used with common similarity measures like cos-similarity.
 • To overcome this shortcoming, they presented SBERT.
 ◦ SBERT fine-tunes BERT in a triplet network architecture.
 • Evaluation on various common benchmarks
 ◦ Improvement over SOTA sentence embeddings methods. 
 • SBERT is computationally efficient.