Slide 1

Slide 1 text

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
 Nils Reimers and Iryna Gurevych
 EMNLP 2019
 
 2020/01/27 EMNLP論文読み会 
 紹介者: 吉村
 


Slide 2

Slide 2 text

Introduction
 ● BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has SOTA performance on sentence-pair regression tasks.
 ● Problem: Massive computational overhead
 ○ Finding the most similar pair in a collection of 10,000 sentences requires about 65 hours with BERT.
 ○ Not applicable to large-scale semantic similarity comparison, clustering, and information retrieval via semantic search.
 ● Proposal: Sentence-BERT (SBERT)
 ○ Using siamese and triplet network structures.
 ○ Fine-tuning on NLI and MNLI data.
 ● Experiment: STS task and transfer learning task.
 ○ Outperforms other SOTA sentence embeddings methods. 
 ○ 65 hours with BERT but 5 seconds with SBERT.
 ○ BERT embeddings is unsuitable to be used with common similarity measures like cos-similarity.


Slide 3

Slide 3 text

Related Work
 ● InferSent (Conneau et al., 2017)
 ○ Siamese BiLSTM network with max-pooling over the output.
 ○ SNLI dataset and the MultiGenre NLI dataset.
 ○ Outperforms unsupervised method like SkipThought.
 ● Universal Sentence Encoder (Cer et al., 2018)
 ○ transformer network and augments unsupervised learning with training on SNLI.
 ● BERT (Dvlin et al., 2018)
 ○ Researchers have started to input individual sentences into BERT and to derive fixedsize sentence embeddings.
 ○ Commonly used approach is to average the BERT output layer or by using the output of the first token (the [CLS] token).
 ○ There is so far no evaluation if these methods lead to useful sentence embeddings.


Slide 4

Slide 4 text

Model
 
 ● Adds a pooling operation to the output of BERT/RoBERTa.
 ○ Pooling strategy (CLS-token, MEAN, MAX)
 ● Create siamese and triplet networks (Schroff et al, 2015)
 ○ Update weight such that the sentence embeddings are semantically meaningful and can be compared with cosine-similarity.
 ● The network structure depends on the available training data.
 ○ Classificartion
 ○ Regression 
 ○ Triplet


Slide 5

Slide 5 text

Classification and Regression
 ● Classification Objective Function
 ○ Cross entropy loss.
 Wt: parameter
 u, v: sentence embedding 
 ● Regression Objective Function
 ○ Mean squared error loss.


Slide 6

Slide 6 text

Triplet 
 ● Loss function
 
 
 ● Triplet loss tunes the network such that the distance between a and p is smaller than that a and n. 
 ● As metric they use Euclidean distance and set ∊ = 1.
 
 a: anchor sentence
 p: positive sentence 
 n: negative sentence 
 s x : sentence embeddings 
 ∊: margin
 ‖・‖: a distance metric 
 


Slide 7

Slide 7 text

Training details
 ● Data
 ○ SNLI and the Multi-Genre NLI dataset 
 ○ SNLI: 570K sent-pairs with labels contradiction, entailment, neutral.
 ○ Multi-NLI: 430K covers a range of spoken and written text.
 ● Fine-tuning SBERT with a 3-way softmax-classifier objective function for one epoch.
 ● Hyper parameter
 ○ Batch-size: 16
 ○ Optimizer: Adam
 ○ Learning rate: 2e-5
 ○ Linear learning rate warmup over 10%
 ○ Pooling strategy: MEAN


Slide 8

Slide 8 text

Evaluation - Unsupervised STS
 ● Directly using the output of BERT leads to rather poor performances.
 ● Using siamese network substantially improves the correlation.
 ● SBERT performs worse than Universal Sentence Encoder in SICK-R.
 ○ USE is trained on various dataset(news, QA, discussion forums)
 ○ SBERT is pre-trained only on Wikipedia and NLI 
 ● Only minor difference between SBERT and SRoBERTa


Slide 9

Slide 9 text

Evaluation - Supervised STS
 ● Two set up
 ○ Only training on STSn
 ○ First training on NLI, then training on STSb
 ● Later strategy leads to a slight improvement of 1-2 points.
 ● Two-step approach had an especially large impact for the BERT cross-encoder.
 


Slide 10

Slide 10 text

Evaluation - Argument Facet Similarity
 ● Argument Facet Similarity (AFS) corpus. (Misra et al., 2016)
 ○ 6,000 sentential argument pairs.
 ○ From social media dialogs on three controversial topics.
 ■ gun control, gay marriage, and death penalty. ○ Annotated on a scale from 0 to 5
 ■ 0: different topic, 5: completely equivalent. ○ The similarity notion is fairly different to STS datasets.
 ■ To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning. ■ Simple unsupervised methods perform badly on this dataset (Reimers et al., 2019)

Slide 11

Slide 11 text

Evaluation - Argument Facet Similarity
 
 
 ● BERT is able to use attention to compare directly both sentences.
 ● SBERT must map individual sentences such that arguments with similar claims and reasons are close. 
 ● It appears to require more than just two topics for training to work on-par with BERT.
 
 
 


Slide 12

Slide 12 text

Evaluation - Wikipedia Sections Distinction
 ● Large dataset of labeled sentence triplets (Dor et al., 2018)
 ○ sentences in the same section are thematically closer than sentences in different sections.
 ○ The anchor and the positive example come from the same section
 ○ The negative example comes from a different section of the same article.
 ● Train: 1.8 Million / Test: 222,957
 ● Evaluation Metric: accuracy


Slide 13

Slide 13 text

Evaluation - Wikipedia Sections Distinction
 ● Dor et al fine-tuned a BiLSTM with triplet loss to derive sentence embeddings for this dataset.
 ● SBERT clearly outperforms the BiLSTM approach.


Slide 14

Slide 14 text

Evaluation - SentEval
 ● SentEval (Conneau and Kiela, 2018)
 ○ Toolkit to evaluate the quality of sentence embesings.
 ● The purpose of SBERT SE are not to be used for transfer learning for other tasks.
 ○ Fine-tuning BERT as described by Devlin is more suitable.
 ○ However, SentEval can still give an impression on the quality of our SE for various tasks.
 ● Experiment flollowing seven SentEval transfer tasks
 ○ MR: Sentiment prediction for movie reviews snippets. 
 ○ CR: Sentiment Prediction of customer product reviews. 
 ○ SUBJ: Subjectivity prediction of sentences from movie reviews. 
 ○ MPQA: Phase level opinion polarity classification from newswire. 
 ○ SST: Stanford Sentiment Treebank with binary labels. 
 ○ TREC: Fine grained question-type classification from TREC. 
 ○ MRPC: Microsoft Research Paraphrase Corpus. 


Slide 15

Slide 15 text

Evaluation - SentEval
 ● SBERT achieve the best performance in 5 out of 7 tasks.
 ● Even though transfer learning is not the purpose of SBERT, it outperforms other SOTA sentence embeddings.
 ● SE from SBERT capture well sentiment information.
 ● USE was pre-trained on QA data.
 


Slide 16

Slide 16 text

Evaluation - SentEval
 ● They conclude that average BERT and CLS-token output from BERT embeddings are infeasible to be used with cos-similarity or with Manhatten / Euclidean distance. 
 ● Using the described fine-tuning setup with a siamese network structure on NLI datasets yields sentence embeddings that achieve a SOTA for the SentEval toolkit.


Slide 17

Slide 17 text

Ablation Study
 ● Minor impact on classification.
 ● Large impact on regression.
 ○ Conneau et al., 2017 found is beneficial for the InferSent to use MAX instead of MEAN.
 Pooling Strategy
 Concatenation
 ● InferSent and USE both use 
 (u, v, |u-v|, u∗v).
 ○ In this architecture, adding the u∗v decreased the performance.
 ● |u-v| is the most important.
 


Slide 18

Slide 18 text

Computational Efficiency
 ● Intel i7-5820K CPU@ 3.30GHz, Nvidia Tesla V100 GPU
 ● Smart Batching
 ○ Sentences with similar lengths are grouped together and are only padded to the longest element in a mini-batch. 
 ● On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. 
 


Slide 19

Slide 19 text

Conclusions
 ● They showed that BERT embeddings is unsuitable to be used with common similarity measures like cos-similarity.
 ● To overcome this shortcoming, they presented SBERT.
 ○ SBERT fine-tunes BERT in a triplet network architecture.
 ● Evaluation on various common benchmarks
 ○ Improvement over SOTA sentence embeddings methods. 
 ● SBERT is computationally efficient.