Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[輪講資料] SimCSE: Simple Contrastive Learning of Sentence Embeddings

Hayato Tsukagoshi
February 11, 2022
4.7k

[輪講資料] SimCSE: Simple Contrastive Learning of Sentence Embeddings

事前学習済み言語モデルと対照学習を用いて、非常にシンプルながら文埋め込み手法のState-of-the-Artを更新したSimCSEという手法について解説します。

Hayato Tsukagoshi

February 11, 2022
Tweet

Transcript

  1. SimCSE: Simple Contrastive Learning of
 Sentence Embeddings Tianyu Gao, Xingcheng,

    and Danqi Chen EMNLP 2021 URL: https://aclanthology.org/2021.emnlp-main.552.pdf ൃදऀ: Hayato Tsukagoshi Graduate school of Informatics, Nagoya University, Japan.
  2. ࿦จ֓ཁ •Contrastive LearningΛ༻͍ͨจຒΊࠐΈख๏ SimCSE ΛఏҊ •ਖ਼ྫͷ࡞ΓํͰ Unsupervised SimCSE / Supervised

    SimCSE ͷ2छྨʹ෼͔ΕΔ Unsupervised SimCSE (unsup-SimCSE) •ಉ͡จʹରͯ͠ҟͳΔdropout maskΛద༻ͯ͠࡞੒ͨ͠ೋͭͷจຒΊࠐΈΛਖ਼ྫͱ͢Δ • ಉ͡จΛ2ճಉ͡Ϟσϧʹ௨͚ͩ͢ •STSλεΫʹ͓͍ͯڭࢣͳ͠ख๏ͰSOTA • طଘͷڭࢣ͋ΓϕʔεϥΠϯ(Sentence-BERT)ͱಉ౳ੑೳ Supervised SimCSE (sup-SimCSE) •NLIσʔληοτͷؚҙ (entailment) ϖΞΛਖ਼ྫͱ͢Δ • hard negativeͱͯ͠ໃ६ (contradiction) ϖΞΛ༻͍ͯ͞Βʹੑೳ޲্ •STSλεΫͰଞͷจຒΊࠐΈख๏Λେ্͖͘ճͬͯSOTA 2
  3. ໨࣍ •ಋೖ • จຒΊࠐΈ • Contrastive Learning / ରরֶश •

    Semantic Textual Similarity (STS) λεΫ •จຒΊࠐΈͷؔ࿈ݚڀ • ୯ޠຒΊࠐΈ͔ΒจຒΊࠐΈΛߏ੒͢Δख๏ / จྨࣅ౓ܭࢉख๏ • BERTҎલͷจຒΊࠐΈϞσϧ/ख๏ • BERTҎޙͷจຒΊࠐΈϞσϧ/ख๏ •SimCSE • ࣄલௐࠪ • ख๏֓ཁ • ࣮ݧઃఆ / ࣮ݧ݁Ռ 3
  4. ໔੹ࣄ߲ / උߟ •அΓ͕ͳ͍ݶΓਤද͸࿦จ͔ΒͷҾ༻Ͱ͢ •εϥΠυ಺༰ͷޡΓ΍ؾʹͳΔ͜ͱ͕͋Δ৔߹͸ɺ@hayato_tkgs ·Ͱ͓ئ͍͍ͨ͠·͢ • ͦͷଞͷࡶஊ΍ݚڀ/ਐ࿏૬ஊͳͲ΋େৎ෉Ͱ͢ɺͳΜͰ΋͝࿈བྷ͍ͩ͘͞ •ؔ࿈ݚڀΛ঺հ͢Δઅ͕͋Γ·͕͢ɺͦͷઅ͸จຒΊࠐΈશൠͷؔ࿈ݚڀΛ঺հ͍ͯ͠·͢ •

    SimCSE࿦จͰݴٴ͞Ε͍ͯΔؔ࿈ݚڀͷΈΛ֬ೝ͍ͨ͠৔߹͸ݪஶ࿦จΛ͋ͨͬͯΈͯ ͍ͩ͘͞ • ޡͬͨཧղɺݴٴ͢΂͖ؔ࿈ݚڀͷݟಀؚ͕͠·ΕΔ͔΋͠Ε·ͤΜɻ܁Γฦ͠Ͱ͕͢ɺ ͦ͏ݴͬͨ৔߹͸͝ࢦఠ͍͚ͨͩΔͱخ͍͠Ͱ͢ •🧐←͜ͷϚʔΫ͕͍͍ͭͯΔهड़͸චऀͷײ૝Ͱ͢ 4
  5. ಋೖ: จຒΊࠐΈ •จຒΊࠐΈ: จΛݻఆ௕ͷϕΫτϧͱͯ͠දݱ͢Δ • ҙຯతྨࣅ౓ܭࢉɺจॻ/࣭໰Ԡ౴ ݕࡧɺจ/จॻΫϥελϦϯάʹ༗༻ •ݪཧతʹ͸ԿΛຒΊࠐΜͰ΋ྑ͍(ҙຯҎ֎ͷߏจ৘ใͳͲ) • ଟ͘ͷݚڀ͸จͷҙຯΛຒΊࠐΜͩϕΫτϧදݱʹfocus

    •ྨࣅ౓ܭࢉΛखܰʹ௿ίετͰ࣮ߦͰ͖Δ • ϕΫτϧۙ๣୳ࡧϥΠϒϥϦ[1, 2] Ͱߴ଎ʹྨࣅϕΫτϧΛܭࢉՄೳ(ۙࣅతʹ) •จຒΊࠐΈΛଞͷλεΫͷͨΊͷಛ௃ྔͱͯ͠࢖͏͜ͱ΋Մೳ ධՁࢦඪ •Semantic Textual Similarity (STS) λεΫ •ςΩετ෼ྨͳͲͷԼྲྀλεΫͰͷੑೳධՁ … ୅දྫ: SentEval [3] •ΫϥελϦϯάͷੑೳ(clustering accuracy) 6 [1] https://github.com/facebookresearch/faiss [2] https://github.com/nmslib/nmslib [3] Conneau+: SentEval: An Evaluation Toolkit for Universal Sentence Representations, LREC ‘18
  6. ಋೖ: Semantic Textual Similarity (STS) λεΫ •จϖΞͱͦͷҙຯతྨࣅ౓͕෇༩͞ΕͨSTSσʔληοτΛ༻͍Δ •จϖΞͷҙຯతͳྨࣅ౓ΛܭࢉɺਓखධՁʹΑΔྨࣅ౓ͱͷ
 ૬ؔΛଌΔ͜ͱͰϞσϧ͕Ͳͷఔ౓
 จͷҙຯΛଊ͍͑ͯΔ͔ΛධՁ


    •จຒΊࠐΈͷධՁʹ͸Ұൠతʹ… • จຒΊࠐΈಉ࢜ͷίαΠϯྨࣅ౓Λ
 จͷҙຯྨࣅ౓ͱͯ͠ར༻ • Unsupervisedઃఆ: STSσʔληοτͰͷ
 ֶश͸͠ͳ͍(ྨࣅ౓ͷճؼϞσϧͳͲ) • Ϟσϧͷྨࣅ౓ͱਓखධՁͷྨࣅ౓ͱͷ
 SpearmanͷॱҐ૬ؔ܎਺ΛଌΔ 7 STSσʔληοτ಺ͷ ࣮ࡍͷσʔλˠ
  7. ಋೖ: Semantic Textual Similarity (STS) λεΫ •STSλεΫʹ͸Ұൠతʹ STS12-16 [4-8], STS

    Benchmark [9], SICK-R [10] ͕༻͍ΒΕΔ • ͍ͣΕͷσʔληοτ΋จϖΞͱ࣮਺ͷҙຯతྨࣅ౓͕ϥϕϧ෇͚͞Ε͍ͯΔ • ҙຯతྨࣅ౓ͷൣғ͸ STS12-16, STS Benchmark ͕ 0-5, SICK-R ͕ 1-5 • STS12-16͸test setͷΈɺSTS Benchmark ʹ͸ train / dev / test set ͕ଘࡏ •STS Benchmark dev setΛ࢖ͬͯϋΠύϥௐ੔͢Δ͜ͱ͕͋Δ • SimCSE͸ֶश཰ͳͲͷνϡʔχϯάͷ΄͔ɺධՁʹ࢖༻͢Δcheckpointͷબ୒ͷͨΊʹ ܇࿅த250step͝ͱʹධՁͯ͠࠷΋ྑ͍checkpointΛར༻ •STSλεΫ͸ධՁख๏͕࿦จ͝ͱʹҟͳΔ͕࣌͋Γɺ஫ҙ͕ඞཁ • Spearman / Pearson ͷ(ॱҐ)૬ؔ܎਺ͷ ୯७ / ॏΈ෇͖ ฏۉΛ࢖͏…ͳͲ • SimCSE࿦จͷAppendix.Bʹهड़͕͋ΔͷͰҰಡΛਪ঑ 8 [4] Agirre+: SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity, *SEM ’12 [5] Agirre+: *SEM 2013 shared task: Semantic Textual Similarity, *SEM ‘13 [6] Agirre+: SemEval-2014 Task 10: Multilingual Semantic Textual Similarity, SemEval ‘14 [7] Agirre+: SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability, SemEval ’15 [8] Agirre+: SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation, SemEval ’16 [9] Cer+: SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation, SemEval ’17 [10] Marelli+: A SICK cure for the evaluation of compositional distributional semantic models, LREC ‘14
  8. ಋೖ: Semantic Textual Similarity (STS) λεΫ ิ଍ •BERT͔ΒφΠʔϒʹநग़ͨ͠จຒΊࠐΈ͸STSʹ͓͚Δੑೳ͕௿͍͜ͱ͕஌ΒΕ͍ͯΔ[11] • BERTͷจ຺Խ୯ޠຒΊࠐΈͷฏۉ΍CLSͷϕΫτϧͳͲ

    • GloVe΍fastTextͳͲͷ੩తͳ୯ޠຒΊࠐΈͷฏۉΛͱͬͨํ͕BERTΑΓੑೳ͕͍͍ •ҰํͰɺԼྲྀλεΫ(sentiment classi fi cationͳͲ)ʹ͓͚ΔBERT༝དྷͷจຒΊࠐΈͷੑೳ͸ ͋Δఔ౓ߴ͍఺ʹ஫ҙ •BERTͳͲࣄલֶशࡁΈݴޠϞσϧͷຒΊࠐΈۭؒ͸ҟํੑ(anisotropy)Λ࣋ͪ[12]ɺ͜Ε͕ STSλεΫͷੑೳʹѱӨڹΛ༩͍͑ͯΔՄೳੑ͕ࣔࠦ͞Ε͍ͯΔ[13] 9 [11] Reimers+: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP '19 [12] Ethayarajh, How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings, EMNLP ’19 [13] Li+: On the Sentence Embeddings from Pre-trained Language Models, EMNLP '20
  9. ಋೖ: Contrastive Learning / ରরֶश •ਖ਼ྫͱෛྫͷಛ௃දݱΛϞσϧ͔Βग़ྗ •ਖ਼ྫಉ࢜ͷྨࣅ౓͕ߴ͘ͳΔΑ͏ʹֶशΛߦ͏ •Computer Vision෼໺ͰେਓؾɺNLPͰ΋ྲྀߦத SimCLR

    [15] •ಉ͡ը૾ʹରͯ͠ҟͳΔ
 data augmentationΛͨ͠
 ը૾ಉ࢜Λਖ਼ྫʹ͢Δ •ޙஈͷը૾෼ྨλεΫͳͲ
 Ͱߴ͍ੑೳ •CVʹ͓͚ΔදݱֶशͷͨΊ
 ͷpre-trainingͱͯ͠༗ޮ 10 ը૾͸ϒϩά[16]ΑΓҾ༻ [14] Oord+: Representation Learning with Contrastive Predictive Coding, arXiv ‘18 [15] Chen+: A Simple Framework for Contrastive Learning of Visual Representations, ICML ’20 [16] Advancing Self-Supervised and Semi-Supervised Learning with SimCLR, ’20 [17] Chen+: Big Self-Supervised Models are Strong Semi-Supervised Learners, NeurIPS ’20
  10. ಋೖ: Contrastive Learning / ֶशखॱ •mini-batch಺ͷ͋Δࣄྫʹ͍ͭͯɺଞͷࣄྫΛෛྫͱͯ͠ߟ͑Δ → in-batch negatives •(batch

    size x batch size) ͷྨࣅ౓ߦྻΛܭࢉɺର֯੒෼(ࣗ෼ࣗ਎)Λਖ਼ղͱ͢Δ • ର֯੒෼ͷྨࣅ౓࠷େԽ == ࣗ෼ࣗ਎ͱͷྨࣅ౓࠷େԽ •࣮૷తʹ͸Cross Entropy LossΛܭࢉ ↓ଛࣦؔ਺ •ଟ͘ͷ৔߹ྨࣅ౓ͱͯ͠ίαΠϯྨࣅ౓Λ࢖͏ •ਖ਼ྫͷ࡞Γํ͕ඇৗʹॏཁ • SimCLRͰ͸augmentationख๏ʹΑͬͯੑೳ͕มԽ 11 ↑ྨࣅ౓ߦྻɺ੨৭෦෼͕ਖ਼ղʹ͋ͨΔ ℒi = − log esim(hi ,h+ i )/τ ∑N j=1 esim(hi ,h+ j )/τ ը૾ 1 ը૾ 2 ը૾ 3 ը૾ 4 ը૾ 5 ը ૾ 1’ ը ૾ 2’ ը ૾ 3’ ը ૾ 4’ ը ૾ 5’ batch size
  11. [18] Wang+: Understanding Contrastive Representation Learning through Alignment and Uniformity

    on the Hypersphere, ICML ’20 ಋೖ: Contrastive Learning / Alignment, Uniformity •Contrastive Learningʹ͓͚Δಛ௃දݱ͕࣋ͭͱ޷·͍͠ੑ࣭͔Β
 ߟ͑ͯಛ௃දݱͷྑ͞ΛଌΔ(ඍ෼Մೳͳ)ࢦඪΛఏҊ [18] • lower is better Alignment •ࣅͨαϯϓϧ͕ಛ௃্ۭؒͰۙ͘ʹ෼෍ͯ͘͠ΕΔ͔ Uniformity •ಛ௃දݱ͕Ͳͷఔ౓ಛ௃্ۭؒͷ
 ୯Ґ௒ٿ໘্ʹҰ༷෼෍͢Δ͔ 12 ਤ͸࿦จ͔ΒҾ༻
  12. ؔ࿈ݚڀ: ୯ޠຒΊࠐΈˠจຒΊࠐΈ / จྨࣅ౓ܭࢉख๏ p-mean: จຒΊࠐΈΛ୯ޠຒΊࠐΈͷ Ͱܭࢉ [19] SWEM: ୯ޠຒΊࠐΈΛ

    ฏۉ / max / ฏۉͱmaxͷconcat / ہॴ૭͝ͱʹฏۉ͔ͯ͠Βmax ΛͱΔ [20] GEM: จதͷ୯ޠຒΊࠐΈͷ௚ߦجఈΛ΋ͱʹnoveltyͳͲॏΈΛܭࢉͯ͠୯ޠຒΊࠐΈΛॏΈ෇͚࿨ [21] DynaMax: ೋͭͷจͷ୯ޠຒΊࠐΈΛstackͨ͠ߦྻΛ࡞ΓFuzzy setͷߟ͑ΛݩʹFuzzy Jaccard܎਺Λܭࢉ [22] SIF: ΛܭࢉˠຒΊࠐΈߦྻΛಛҟ஋෼ղˠୈҰಛҟϕΫτϧ Ͱ Λܭࢉ [23] uSIF: ෳ਺ͷಛҟϕΫτϧΛར༻ɺಛҟ஋ͷ૯࿨Λ࢖͏ϋΠύϥௐ੔ෆཁͳSIF [24] P-SIF: ୯ޠͷτϐοΫϕΫτϧΛ࢖ͬͨSIF [25] All-but-the-Top: ୯ޠຒΊࠐΈͷू߹ΛPCA্ͯ͠Ґओ੒෼Λআ͘ [26] ( xp 1 + xp 1 + . . . + xp n n ) 1 p 1 |s| ∑ w∈s a a + p(w) vw u vs − uuTvs 14 [19] Ru ̈ckle ́+: Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations, arXiv ’18 [20] Shen+: Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms, ACL ’18 [21] Yang: Parameter-free Sentence Embedding via Orthogonal Basis, EMNLP-IJCNLP '19 [22] Zhelezniak+: Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, ICLR '19 [23] Arora+: A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR '17 [24] Ethayarajh: Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline, Rep4NLP '18 [25] Gupta+: P-SIF: Document Embeddings Using Partition Averaging, AAAI '20 [26] Mu+: All-but-the-Top: Simple and E ff ective Postprocessing for Word Representations, ICLR '18
  13. ؔ࿈ݚڀ: ୯ޠຒΊࠐΈˠจຒΊࠐΈ / จྨࣅ౓ܭࢉख๏ Word Mover's Distance: จ௕ͷٯ਺ΛҰ༷ͳ֬཰࣭ྔɼίετΛϢʔΫϦουڑ཭ͱͯ͠࠷ద༌ૹ [27] Word

    Mover's Embedding: จ(ॻ)ͱαϯϓϦϯάͨ͠ෳ਺ͷจ(ॻ)ͱͷWMDͷྻΛจ(ॻ)ϕΫτϧͱ͢Δ [28] Word Rotator's Distance: ୯ޠຒΊࠐΈͷϊϧϜΛ֬཰࣭ྔɼίετΛίαΠϯྨࣅ౓ͱͯ͠࠷ద༌ૹ [29] 15 [27] Kusner+: From Word Embeddings To Document Distances, ICML '15 [28] Wu+: Word Mover's Embedding: From Word2Vec to Document Embedding, EMNLP '18 [29] Yokoi+: Word Rotator’s Distance, EMNLP '20
  14. จຒΊࠐΈͷؔ࿈ݚڀ: BERTҎલ 16 SkipThought: લޙͷจΛ࠶ߏ੒͢ΔΑ͏ʹBooks CorpusͰLSTMͷencoder-decoderϞσϧΛֶश [30] FastSent: લޙͷจͷBag-of-WordsΛ࠶ߏ੒͢ΔΑ͏ʹSkip-gramతͳϞσϧΛֶश [31]

    SDAE: ϊΠζͷՃΘͬͨೖྗจ͔ΒϊΠζΛআڈͨ͠จΛ࠶ߏ੒Ͱ͖ΔΑ͏ʹLSTMΛֶश [31] SCDV: ୯ޠຒΊࠐΈΛΫϥελϦϯάˠτϐοΫΛߟྀͨ͠୯ޠຒΊࠐΈͱεύʔεͳจຒΊࠐΈΛ֫ಘ [32] QuickThought: ࣍ͷจΛਖ਼ྫɼͦͷଞͷจΛෛྫͱͯ͠GRUΛରরֶश [33] Sent2Vec: n-gramΛߟྀͨ͠CBOWΛֶश [34] InferSent*: NLI෼ྨͰLSTMΛֶश [35] Universal Sentence Encoder: DAN/TransformerΛNLIσʔληοτͰSkipThoughtతʹڭࢣͳֶ͠श [36] [30] Kiros+: Skip-Thought Vectors, NIPS '15 [31] Hill+: Learning Distributed Representations of Sentences from Unlabelled Data, NAACL ’16 [32] Mekala+: SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations, ACL ’17 [33] Logesmaran+: An e ff i cient framework for learning sentence representations, ICLR ’18 [34] Pagliardini+:, Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features, NAACL ’18 [35] Conneau+: Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, EMNLP '17 [36] Cer+: Universal Sentence Encoder, arXiv, Mar 2018 *͸ڭࢣ͋Γֶश
  15. จຒΊࠐΈͷؔ࿈ݚڀ: BERTҎޙ / Post-processing BERT- fl ow: ҟํతͳBERTͷจຒΊࠐΈۭ͔ؒΒ౳ํతͳજࡏۭؒ΁ͷࣸ૾Λֶश [37] BERT-whitening:

    จຒΊࠐΈͷฏۉ͕0ɼڞ෼ࢄߦྻ͕୯ҐߦྻʹͳΔΑ͏ʹઢܗม׵(+࣍ݩ࡟ݮ) [38] WhiteningBERT: จຒΊࠐΈͷฏۉ͕0ɼڞ෼ࢄߦྻ͕୯ҐߦྻʹͳΔΑ͏ʹઢܗม׵(w/ ͍ΖΜͳϞσϧ) [39] SBERT-WK: BERT/SBERTͷ૚ผͷಛ௃ྔΛ༻͍ͯܭࢉͨ͠noveltyͳͲͷ஋Λ΋ͱʹจຒΊࠐΈΛߏ੒ [40] 17 [37] Li+: On the Sentence Embeddings from Pre-trained Language Models, EMNLP '20 [38] Su+: Whitening Sentence Representations for Better Semantics and Faster Retrieval, arXiv '21 [39] Huang+: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach, arXiv ’21 [40] Wang+: SBERT-WK: A Sentence Embedding Method by Dissecting BERT-Based Word Models, IEEE/ACM Transactions on Audio, Speech, and Language Processing ’20 ͢΂ͯͷख๏͕Ϟσϧͱͯ͠BERT, Sentence-BERTΛϕʔεʹ͓ͯ͠Γɺ fi ne-tuning͸͠ͳ͍
  16. จຒΊࠐΈͷؔ࿈ݚڀ: BERTҎޙ / Unsupervised IS-BERT: จຒΊࠐΈͱจதͷn-gramͷຒΊࠐΈͷ૬ޓ৘ใྔΛ࠷େԽ͢ΔΑ͏ʹֶश [41] ˑDeCLUTR: ಉ͡จॻͷҟͳΔεύϯಉ࢜Λਖ਼ྫͱͯ͠ରরֶश [42]

    ˑBERT-CT: ҟͳΔೋͭͷಉ͡Ϟσϧͷɼಉ͡จʹର͢ΔจຒΊࠐΈಉ࢜ͷ಺ੵ͕େ͖͘ͳΔΑ͏ʹֶश [43] ˑConSERT: ݩͷจΛਖ਼ྫͱͯ͠ɼఢରతઁಈ / ୯ޠɾಛ௃ྔ࡟আ / γϟοϑϧ / Dropoutͯ͠ରরֶश [44] ˑSG/SG-OPT: จຒΊࠐΈͱBERTͷதؒ૚ͷຒΊࠐΈΛ͚ۙͮΔΑ͏ʹରরֶश [45] ˑCLEAR: ݩͷจΛਖ਼ྫͱͯ͠ɼ୯ޠɾεύϯ࡟আ / ޠॱೖΕସ͑ / ಉٛޠஔ׵ͯ͠ࣄલֶश࣌ʹରরֶश [46] ˑCOCO-LM: ୯ޠஔ׵ͨ͠จͱ୯ޠ࡟আͨ͠จಉ࢜Λਖ਼ྫͱͯ͠ରরֶश [47] TSDAE**: SDAEͷTransformer൛, จΛ࠶ߏ੒ͯ͠จຒΊࠐΈϞσϧΛֶश [48] 18 [41] Zhang+: An Unsupervised Sentence Embedding Method by Mutual Information Maximization, EMNLP '20 [42] Giorgi+: DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations, ACL '21 [43] Carlsson+: Semantic Re-tuning with Contrastive Tension, ICLR '21 [44] Yan+: ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer, ACL ’21 [45] Kim+: Self-Guided Contrastive Learning for BERT Sentence Representations, ACL ’21 [46] Wu+: CLEAR: Contrastive Learning for Sentence Representation, arXiv '20 [47] Meng+: COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining, NeurIPS ‘21 [48] Wang+:, TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning, EMNLP fi ndings '21 ˑ͸Contrastive LearningΛ༻͍ͨख๏ **͸ࣄલֶशࡁΈݴޠϞσϧͷ fi ne-tuningͰ͸ͳ͘εΫϥονͰϞσϧΛֶश͢Δ
  17. [11] Reimers+: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP ’19

    [49] Thakur+: Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks, NAACL ’21 [50] Tsukagoshi+: DefSent: Sentence Embeddings using De fi nition Sentences, ACL ’21 [51] Zhang+: Pairwise Supervised Contrastive Learning of Sentence Representations, EMNLP ’21 [58] Jiang+: PromptBERT: Improving BERT Sentence Embeddings with Prompts, arXiv ‘22 จຒΊࠐΈͷؔ࿈ݚڀ: BERTҎޙ / Supervised Sentence-BERT: จຒΊࠐΈΛϕʔεʹͨ͠NLI෼ྨͰBERTΛ fi ne-tuning [11] Augmented SBERT: Cross-encoderͰྨࣅ౓ϥϕϧΛ෇༩ٖͯ͠ࣅσʔληοτͰBi-encoderΛֶश [49] DefSent: ࣄલֶशࡁΈݴޠϞσϧͷ୯ޠ༧ଌ૚Λར༻ɺఆٛจͷຒΊࠐΈ͔ΒͦΕ͕ද͢୯ޠΛ༧ଌ [50] ˑPairSupCon※: ؚҙؔ܎ͷจϖΞͷจຒΊࠐΈ͕ۙ͘ͳΔΑ͏ͳରরֶश+NLI෼ྨͷಉֶ࣌श [51] ˑPromptBERT***: ؚҙؔ܎ͷຒΊࠐΈ͕ۙ͘ͳΔΑ͏ͳରরֶश+ςϯϓϨʔτ༝དྷͷϊΠζআڈ [58] 19 ˑ͸Contrastive LearningΛ༻͍ͨख๏ ※ SimCSEͱඇৗʹྨࣅɺ࿦จதͰSimCSEͷ͜ͱΛconcurrent workͱͯ͠Ҿ༻ ***͸SimCSEΑΓޙൃͷݚڀɺSimCSEΑΓߴ͍ੑೳΛୡ੒ͨ͠ͷͰཁνΣοΫ
  18. SimCSE: Simple Contrastive Sentence Embedding •ࣄલֶशࡁΈݴޠϞσϧΛContrastive LearningͰ fi ne-tuning͢ΔจຒΊࠐΈख๏ •ਖ਼ྫͷ࡞ΓํͰ

    Unsupervised SimCSE / Supervised SimCSE ͷ2छྨʹ෼͔ΕΔ Unsupervised SimCSE (unsup-SimCSE) •ಉ͡จʹରͯ͠ҟͳΔdropout maskΛద༻ͯ͠࡞੒ͨ͠ೋͭͷจຒΊࠐΈΛਖ਼ྫͱ͢Δ • ಉ͡จΛ2ճಉ͡Ϟσϧʹ௨͚ͩ͢ • “minimal” ͳ data augmentation ͱͯ͠ͷ dropout Supervised SimCSE (sup-SimCSE) •NLIσʔληοτͷؚҙ (entailment) ϖΞΛਖ਼ྫͱ͢Δ • ௚઀෼ྨ໰୊Λղ͘͜ͱ͸ͤͣɺਖ਼ྫΛ࡞ΔͨΊʹڭࢣϥϕϧΛؒ઀తʹར༻ • hard negativeͱͯ͠ໃ६ (contradiction) ϖΞΛ༻͍ͯ͞Βʹੑೳ޲্ •ҙຯతʹؚҙؔ܎ʹ͋Δจ͕จຒΊࠐΈ্ۭؒͰۙ͘ʹ෼෍͢ΔΑ͏ʹֶश 21
  19. SimCSE: طଘݚڀͱͷൺֱ Unsupervised SimCSE •ϥϕϧͳ͠ςΩετͷΈΛ༻͍ͨରরֶशͰطଘͷڭࢣ͋ΓϕʔεϥΠϯͱಉ౳ੑೳ •ෳࡶͳdata augmentationɺ௥ՃͷωοτϫʔΫ͕ෆཁ •࣮૷͕ඇৗʹ༰қɺ֦ுੑʹ༏ΕΔ Supervised SimCSE

    •Contrastive Learningͱϥϕϧ෇σʔληοτΛ׆༻ͯ͠จຒΊࠐΈΛֶश͢Δख๏ΛఏҊ • ҙຯΛදݱ͢ΔจຒΊࠐΈҎ֎ʹ΋༷ʑʹԠ༻Մೳ •طଘͷڭࢣ͋ΓϕʔεϥΠϯΛେ্͖͘ճΔੑೳ •࣮૷͕ඇৗʹ༰қɺ֦ுੑʹ༏ΕΔ 🧐ֶशޮ཰͕ඇৗʹߴ҆͘ఆ( fi ne-tuningͷૣ͍ஈ֊Ͱੑೳ͕େ͖͘޲্͢Δ) •Contrastive LearningʹڭࢣϥϕϧΛ׆༻͢Δํ๏͸طʹଘࡏ (Ҿ༻͞Ε͍ͯͳ͍͕) [52] 22 [52] Khosla+: Supervised Contrastive Learning, arXiv ‘20
  20. [53] Bowman+: A large annotated corpus for learning natural language

    inference, EMNLP ‘15 [54] Williams+: A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference, NAACL ‘18 SimCSE: Contradiction as hard negatives NLIσʔληοτ (SNLI [53], MNLI [54]) ࡞੒ͷࡍͷखॱˣ •1ͭͷpremise (લఏจ) ͕Ξϊςʔλʹఏࣔ͞ΕΔ •Ξϊςʔλ͕premiseʹରͯ͠entailment (ؚҙ), neutral (தཱ), contradiction (ໃ६) ؔ܎ʹ ͋Δจ (hypothesis; Ծઆจ) Λهड़ → 1ͭͷpremiseʹରͯ͠entailment ͱ contradictionͷจ͕ͦΕͧΕଘࡏ •contradictionΛhard negativeͱͯ͠Ճ͑Δ͜ͱͰSTSͷੑೳ޲্ 24 premise hypothesis label A man playing an electric guitar on stage. A man playing banjo on the fl oor. contradiction A man playing an electric guitar on stage. A man playing guitar on stage. entailment A man playing an electric guitar on stage. A man is performing for cash. neutral දͷྫ͸SNLI[53]͔Β
  21. ࣄલௐࠪ •SimCSEͷ༗ޮੑɺSimCSE͕༗ޮͳ৔໘Λௐ΂ΔͨΊʹෳ਺ͷ࣮ݧΛ࣮ࢪ ௐ߲ࠪ໨: Unsupervised SimCSE •data augmentationख๏ͷҧ͍ʹΑΔੑೳ΁ͷӨڹ •Next Sentence PredictionͳͲଞͷڭࢣͳֶ͠श༻໨తؔ਺ͱͷൺֱ

    •dropout rateͷ୳ࡧͱੑೳ΁ͷӨڹ •ख๏͝ͱͷalignment, uniformityͷධՁ ௐ߲ࠪ໨: Supervised SimCSE •NLIΛؚΉෳ਺ͷσʔληοτͰͷֶशͱੑೳͷൺֱධՁ •hard negativesΛ௥Ճ͢Δ͜ͱʹΑΔੑೳ΁ͷӨڹ 32
  22. Unsupervised SimCSE: data augmentationʹΑΔҧ͍ •dropoutҎ֎ͷdata augmentationख๏ͱͷൺֱ •ֶशʹ࢖͏จ͸ӳޠWikipedia͔Β
 ϥϯμϜʹநग़ͨ͠100ສจ •STS Benchmark

    dev setͰධՁ࣮ݧ •Ͳͷ཭ࢄత(discrete)ͳdata augmentationख๏
 ΑΓ΋dropoutͷੑೳ͕ߴ͍ •཭ࢄతͳdata augmentation͸ҙຯΛຒΊࠐΉ
 ϕΫτϧදݱͷֶशʹ͸ޮՌతͰͳ͍ʁ • ཭ࢄతͳઁಈͰ༰қʹจҙ͕มԽ͠͏Δ
 ͱ͍͏ࣗવݴޠจͷੑ࣭͕ؔ܎ʁ 33
  23. Unsupervised SimCSE: ଞͷڭࢣͳ͠໨తؔ਺ͱͷൺֱ •Next sentence prediction (NSP) ͳͲͱUnsupervised SimCSEͱͷൺֱ •

    NSP͸BERTͷࣄલֶशʹར༻͞Ε͍ͯΔ •ଞͷख๏ͱൺ΂ͯ΋Unsupervised SimCSE͕STS Benchmark dev setͰ࠷΋ྑ͍ੑೳ 🧐ࣄલֶशͷͨΊͷ໨తؔ਺ͱ
 จຒΊࠐΈͷͨΊͷ໨తؔ਺
 Λൺֱ͢Δҙਤ͸͋·Γ
 Θ͔Βͳ͍ 34
  24. Unsupervised SimCSE: dropout rateʹΑΔੑೳมԽ •dropout rate ΛมԽͤͯ͞STS Benchmark dev setʹ͓͚Δੑೳݕূ

    •ಛʹҎԼͷઃఆ͕ಛघ(extreme cases)ɺશ͘ಉ͡ຒΊࠐΈදݱ͕ܭࢉʹ࢖ΘΕΔ • : dropoutΛ͠ͳ͍ • : dropoutΛ͢Δ͕2ճͷຒΊࠐΈͰಉ͡dropout maskΛ࢖͏ •Transformer, BERTͷσϑΥϧτ஋Ͱ͋Δ 0.1 ͕࠷΋Α͍ੑೳ • dropout rateͷ஋ɺdropoutͷ࢓ํʹରͯ͠͸ͦΕͳΓʹහײ • , ͷ݁Ռ͔Β
 શ͘ಉ͡ຒΊࠐΈදݱͰͷ fi ne-tuning
 ͩͱੑೳ͕௿Լ͢Δ͜ͱ͕ࣔࠦ 🧐ٯʹݴ͑͹ Ͱ΋ׂͱ
 ֶशͰ͖ͯΔ(ੑೳ޲্ͯ͠Δ) p p = 0.0 p = Fixed 0.1 p = 0.0 Fixed 0.1 p = 0.0 35
  25. Unsupervised SimCSE: alignment, uniformityͷධՁ •SimCSEΛ༻ֶ͍ͯशதͷalignment, uniformityͷ஋ͷมԽΛ؍࡯ •alignment, uniformityͱ΋ʹ௿͍ํ͕ྑ͍ •“No dropout”͓Αͼ

    “Fixed 0.1”͸
 uniformity͕ݮগ͢Δ͕alignment͕૿େ • ҙຯతʹۙͯ͘΋ɺͱʹ͔͘શͯͷ
 จຒΊࠐΈͷڑ཭Λ཭ͯ͠͠·͍ͬͯΔ •“Unsupervised SimCSE”͸uniformity͕
 ݮগͯ͠΋alignment͕૿େ(ѱԽ)͠ͳ͍ 🧐alignmentͱuniformity͚͔ͩΒ൑அ͢Δͱ
 “Delete one word”͕Ұ൪ྑͦ͞͏ʹ
 ݟ͑Δ͕…ʁ • STSͷੑೳͱ͸៉ྷʹ૬͍ؔͯ͠ΔΘ͚
 Ͱ͸ͳͦ͞͏ɺ͋͘·Ͱ໨҆ʁ 36
  26. Supervised SimCSE: σʔληοτ͝ͱͷൺֱ •Supervised SimCSE͸NLIσʔληοτҎ֎Λ࢖༻͢Δͷ΋ߟ͑ΒΕΔ • hard negativesΛߟ͑ͳ͚Ε͹ɺจϖΞσʔληοτͳΒͳΜͰ΋͍͍ •NLIσʔληοτ͕STSͰ࠷ྑͷ݁Ռ •NLIσʔληοτ͸ਓखͰ࡞੒͞Εͯ


    ͓Γߴ඼࣭ • ͔ͭɺจϖΞͷޠኮͷॏෳ͕
 খ͘͞ͳΔΑ͏ઃܭ͞Ε͍ͯΔ 38 Quora Question Pairs ը૾Ωϟϓγϣϯ ٯ຋༁ʹΑΔݴ͍׵͑ neutralϖΞΛਖ਼ྫʹ contradictionϖΞΛਖ਼ྫʹ ͢΂ͯͷϖΞΛਖ਼ྫʹ entailmentϖΞΛਖ਼ྫʹ
  27. ධՁ࣮ݧ •Unsupervised STS taskͰͷධՁ • STS12-16, STS Benchmark, SICK-R Λσʔληοτͱͯ͠ར༻

    (ඪ४తͳઃఆ) •SentEvalΛ༻͍ͨධՁ (Appendix.E) • sentiment classi fi cationͳͲͷςΩετ෼ྨ໰୊ΛɺจຒΊࠐΈΛೖྗͱ͢Δ(ϩδε ςΟοΫճؼ)෼ྨثΛֶशͯ͠ੑೳΛଌΔ͜ͱͰɺจຒΊࠐΈͷੑ࣭ΛݟΔ 40
  28. ධՁ࣮ݧ: STS •Unsupervised models ͕ڭࢣ
 ϥϕϧΛར༻͠ͳ͍ख๏ • Supervised models ͕ڭࢣ


    ϥϕϧΛར༻ͨ͠ख๏ • STSσʔληοτΛڭࢣ
 ͱͯ͠࢖͍ͬͯΔΘ͚Ͱ͸
 ͳ͍఺ʹ஫ҙ •Unsupervised / Supervised
 ͱ΋ʹSimCSE͕ߴ͍ੑೳ •Unsup-SimCSE-BERT_base͕
 SBERT_baseΑΓߴ͍ੑೳ • ڭࢣ͋ΓϕʔεϥΠϯΛ
 ڭࢣͳ͠ख๏Ͱӽ͑Δ •Sup-SimCSEͰSOTA 41
  29. ධՁ࣮ݧ: STS •Unsupervised models ͕ڭࢣ
 ϥϕϧΛར༻͠ͳ͍ख๏ • Supervised models ͕ڭࢣ


    ϥϕϧΛར༻ͨ͠ख๏ • STSσʔληοτΛڭࢣ
 ͱͯ͠࢖͍ͬͯΔΘ͚Ͱ͸
 ͳ͍఺ʹ஫ҙ •Unsupervised / Supervised
 ͱ΋ʹSimCSE͕ߴ͍ੑೳ •Unsup-SimCSE-BERT_base͕
 SBERT_baseΑΓߴ͍ੑೳ • ڭࢣ͋ΓϕʔεϥΠϯΛ
 ڭࢣͳ͠ख๏Ͱӽ͑Δ •Sup-SimCSEͰSOTA 42
  30. ධՁ࣮ݧ: SentEval 🧐STSͱҟͳΓඇৗʹߴ͍ੑೳͱ
 ͍͏Θ͚Ͱ͸ͳ͍ • ಛʹUnsup-SimCSE
 จຒΊࠐΈۭؒͷre fi ne͕ϝΠϯ
 ͔ͩΒʁ

    •SBERTͱSup-SimCSE-BERTͰ͸
 લऀͷํ͕ߴ͍͕ɺSRoBERTaͱ
 Sup-SimCSE-RoBERTaͰ͸ޙऀ
 ͷ΄͏͕ߴ͍ੑೳ •ԼྲྀλεΫʹ͓͚Δੑೳ͸
 Ϟσϧࣗମͷೳྗʹґଘ • fi ne-tuningதʹMLMΛ͢Δ͜ͱͰ
 ԼྲྀλεΫͷੑೳ͕޲্ • catastrophic forgetting Λ๷͙ • STSͰͷੑೳ޲্͸ͳ͍ 43
  31. ධՁ࣮ݧ: SentEval 🧐STSͱҟͳΓඇৗʹߴ͍ੑೳͱ
 ͍͏Θ͚Ͱ͸ͳ͍ • ಛʹUnsup-SimCSE
 จຒΊࠐΈۭؒͷre fi ne͕ϝΠϯ
 ͔ͩΒʁ

    •SBERTͱSup-SimCSE-BERTͰ͸
 લऀͷํ͕ߴ͍͕ɺSRoBERTaͱ
 Sup-SimCSE-RoBERTaͰ͸ޙऀ
 ͷ΄͏͕ߴ͍ੑೳ •ԼྲྀλεΫʹ͓͚Δੑೳ͸
 Ϟσϧࣗମͷೳྗʹґଘ • fi ne-tuningதʹMLMΛ͢Δ͜ͱͰ
 ԼྲྀλεΫͷੑೳ͕޲্ • catastrophic forgetting Λ๷͙ • STSͰͷੑೳ޲্͸ͳ͍ 44
  32. ධՁ࣮ݧ: SentEval 🧐STSͱҟͳΓඇৗʹߴ͍ੑೳͱ
 ͍͏Θ͚Ͱ͸ͳ͍ • ಛʹUnsup-SimCSE
 จຒΊࠐΈۭؒͷre fi ne͕ϝΠϯ
 ͔ͩΒʁ

    •SBERTͱSup-SimCSE-BERTͰ͸
 લऀͷํ͕ߴ͍͕ɺSRoBERTaͱ
 Sup-SimCSE-RoBERTaͰ͸ޙऀ
 ͷ΄͏͕ߴ͍ੑೳ •ԼྲྀλεΫʹ͓͚Δੑೳ͸
 Ϟσϧࣗମͷೳྗʹґଘ • fi ne-tuningதʹMLMΛ͢Δ͜ͱͰ
 ԼྲྀλεΫͷੑೳ͕޲্ • catastrophic forgetting Λ๷͙ • STSͰͷੑೳ޲্͸ͳ͍ 45
  33. ධՁ࣮ݧ: ख๏͝ͱͷalignment, uniformityͷධՁ •BERT- fl ow΍BERT-whiteningͳͲ
 ͷpost-processingతͳख๏͸
 alignment͕͔ͳΓྑ͘ͳ͍ • ҰํͰSBERT-

    fl owͳͲͷSTSੑೳ
 ͸͔ͳΓߴ͍ •BERTͷຒΊࠐΈͷฏۉ͸
 uniformity͕ྑ͘ͳ͍ • ҟํੑʹ͍ͭͯͷطଘݚڀͱҰக •Contrastive LearningΛ͍ͯ͠ͳ͍
 SBERT΋uniformity͕BERTΑΓվળ
 •Supervised / Unsupervised SimCSE
 ΋uniformity͕վળ 46
  34. ධՁ࣮ݧ: ख๏͝ͱͷalignment, uniformityͷධՁ •BERT- fl ow΍BERT-whiteningͳͲ
 ͷpost-processingతͳख๏͸
 alignment͕͔ͳΓѱ͍(ߴ͍) • ʹ΋͔͔ΘΒͣɺSBERT-

    fl owͳͲ
 ͷSTSੑೳ͸͔ͳΓߴ͍ •BERTͷຒΊࠐΈͷฏۉ͸
 uniformity͕ѱ͍ • ҟํੑʹ͍ͭͯͷطଘݚڀͱҰக •Contrastive LearningΛ͍ͯ͠ͳ͍
 SBERT΋uniformity͕BERTΑΓվળ
 •Supervised / Unsupervised SimCSE
 ΋uniformity͕վળ 47
  35. ධՁ࣮ݧ: ख๏͝ͱͷalignment, uniformityͷධՁ •BERT- fl ow΍BERT-whiteningͳͲ
 ͷpost-processingతͳख๏͸
 alignment͕͔ͳΓѱ͍(ߴ͍) • ʹ΋͔͔ΘΒͣɺSBERT-

    fl owͳͲ
 ͷSTSੑೳ͸͔ͳΓߴ͍ •BERTͷຒΊࠐΈͷฏۉ͸
 uniformity͕ѱ͍ • ҟํੑʹ͍ͭͯͷطଘݚڀͱҰக •Contrastive LearningΛ͍ͯ͠ͳ͍
 SBERT΋uniformity͕BERTΑΓվળ
 •Supervised / Unsupervised SimCSE
 ΋uniformity͕վળ 48
  36. ·ͱΊ •Contrastive LearningΛ༻͍ͨඇৗʹγϯϓϧͳจຒΊࠐΈֶशख๏ΛఏҊ Unsupervised SimCSE •ೖྗจʹҟͳΔdropout maskΛֻ͚ͨจຒΊࠐΈಉ࢜Λਖ਼ྫʹ •STSλεΫʹ͓͍ͯڭࢣ͋ΓϕʔεϥΠϯΛ্ճΔ Supervised SimCSE

    •NLIσʔληοτΛར༻ؚͯ͠ҙؔ܎ͷจϖΞΛਖ਼ྫʹ •طଘख๏Λେ্͖͘ճͬͯSOTA •alignmentͱuniformtyΛվળͨ͜͠ͱͰࣄલֶशࡁΈݴޠϞσϧͷҟํੑΛղফͯ͠ߴ඼࣭ ͳจຒΊࠐΈΛಘΒΕͨͱɺจຒΊࠐΈͷ෼෍ʹ͍ͭͯͷ࣮ݧ͔Βߟ࡯ •γϯϓϧͳͷͰద༻ൣғ͕޿͍ɺԠ༻͕ޮ͘ 52
  37. ײ૝🧐 •Unsupervised SimCSE͸ಉ͡Ϟσϧʹ2ճ௨͚ͩ͢ͱ͍͏γϯϓϧ͞Ͱૉ੖Β͍͠ • Ͱ͖Δ޻෉͕͍Ζ͍Ζ͋ΔͷͰ೿ੜݚڀ͕ͨ͘͞Μੜ·Εͦ͏ • ྫ: ڭࢣͳ͠ͰͲ͏΍ͬͯྑ͍hard negativesΛ࡞Δ͔ʁͳͲ •Unsupervised

    SimCSEͱSupervised SimCSE๊͕͖߹Θͤͳͷ͕ؾʹͳΔ • Supervised SimCSE͸SupCon΍SimCLRΛจຒΊࠐΈʹద༻͢Δ࿩ͳͷͰɺͦΕ୯ମ
 ͩͱcontribution͕খ͍͞ͱࢥΘΕΔͷΛආ͚ͨͷ͔΋͠Εͳ͍ʁ •alignmentͱuniformityΛվળ͢Δ͜ͱ͕ຊ౰ʹSTSλεΫʹͱͬͯ༗༻͔Ͳ͏͔ͷݕূ͕؁ ͍ؾ͕͢Δ • STSͰ͸͋Μ·Γ૬͍ؔͯ͠ͳ͍ • SentEvalͰ΋ͦΜͳʹ૬͍ؔͯ͠ͳͦ͞͏…? CVͱݴޠͰҟͳΔʁ • ޙൃͷPromptBERT[58] Ͱ୯ޠͷස౓৘ใ͕࠷΋ӨڹΛ༩͍͑ͯΔͱͷ෼ੳΞϦ 53
  38. [55] Wu+: Smoothed Contrastive Learning for Unsupervised Sentence Embedding, arXiv

    Sep. ’21 [56] Wu+: ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding, arXiv Sep. ’21 [57] Zhang+: S-SimCSE: Sampled Sub-networks for Contrastive Learning of Sentence Embedding, arXiv Nov. ’21 SimCSEͷ೿ੜݚڀ GS-InfoNCE: ِͷෛྫ(ҙຯతʹ͸ಉ͡ͳͷʹෛྫ)ͷӨڹΛ௿ݮ͢ΔͨΊʹεϜʔδϯά [55] ESimCSE: ୯ޠෳ੡ͰจຒΊࠐΈͷ௕͞৘ใͷόΠΞε௿ݮ && Momentum EncoderͰෛྫ਺૿Ճ [56] S-SimCSE: dropout rateΛจͷຒΊࠐΈ͝ͱʹมԽͤ͞Δ [57] 55 චऀײ૝: Ӎޙͷ᝔