Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[輪講資料] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Hayato Tsukagoshi
December 18, 2023
560

[輪講資料] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

指示に基づいた文埋め込みを生成するモデル InstructOR について解説した輪講資料です。

元論文: https://aclanthology.org/2023.findings-acl.71/

Hayato Tsukagoshi

December 18, 2023
Tweet

Transcript

  1. One Embedder, Any Task:
 Instruction-Finetuned Text Embeddings D1, Graduate School

    of Informatics, Nagoya University, Japan Hayato Tsukagoshi Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih
 Noah A. Smith, Luke Zettlemoyer, Tao Yu ACL 2023 Findings
 https://aclanthology.org/2023. fi ndings-acl.71/
  2. •จຒΊࠐΈ͸Out Of Domain (OOD) ʹऑ͍ •ࢦࣔʹैͬͯจΛຒΊࠐΉϞσϧ
 InstructORΛఏҊ • λεΫɾυϝΠϯ͝ͱʹҟͳΔࢦࣔ
 ͰจຒΊࠐΈΛੜ੒Մೳ

    •ଟ༷ͳσʔληοτऩू
 →ࢦࣔΛΞϊςʔτ (MEDI dataset)
 →ࢦࣔ+จΛຒΊࠐΈදݱʹ
 →ରরֶशͰϞσϧΛ܇࿅ •MTEBͳͲෳ਺ͷϕϯνϚʔΫͰฏۉͯ͠ੑೳ޲্ ֓ཁ 2
  3. •ࣗવݴޠจͷີϕΫτϧදݱ •ϕΫτϧͷڑ཭͕จͷҙຯͷۙ͞Λදݱ ಋೖ: จຒΊࠐΈ / Sentence embedding 5 ͜Ͳ΋͕Ոʹ޲͔͍ͬͯΔɻ ͜Ͳ΋ֶ͕ߍ͔ΒՈʹ޲͔͍ͬͯΔɻ

    ͜Ͳ΋͕ਤॻؗʹ͍Δɻ ͜Ͳ΋͕ޕޙʹา͍͍ͯΔɻ จຒΊࠐΈۭؒ [0.1, 0.2, ...] [0.1, 0.3, ...] [0.9, 0.8, ...] [0.5, 0.7, ...]
  4. •ࣗવݴޠจͷີϕΫτϧදݱ •ϕΫτϧͷڑ཭͕จͷҙຯͷۙ͞Λදݱ ಋೖ: จຒΊࠐΈ / Sentence embedding 6 ͜Ͳ΋͕Ոʹ޲͔͍ͬͯΔɻ ͜Ͳ΋ֶ͕ߍ͔ΒՈʹ޲͔͍ͬͯΔɻ

    ͜Ͳ΋͕ਤॻؗʹ͍Δɻ ͜Ͳ΋͕ޕޙʹา͍͍ͯΔɻ จຒΊࠐΈۭؒ [0.1, 0.2, ...] [0.1, 0.3, ...] [0.9, 0.8, ...] [0.5, 0.7, ...] ҙຯతʹྨࣅ ͍ۙҙຯΛ࣋ͭจ͸ ۙ͘ʹ෼෍ ϕΫτϧؒͷڑ཭͕
 ҙຯతͳؔ܎Λදݱ
  5. •จຒΊࠐΈ͸Out Of Domain (OOD) ʹऑ͍ •ࢦࣔʹैͬͯจΛຒΊࠐΉϞσϧ
 InstructORΛఏҊ • Instruction-based Omnifarious

    Representations • จ͝ͱʹࢦࣔΛม͑ଟ༷ͳຒΊࠐΈΛੜ੒ •ࢦ͕ࣔ͋Δ͜ͱͰOOD΁ͷੑೳ΋޲্ ख๏֓ཁ •ଟ༷ͳσʔληοτऩू
 →ࢦࣔΛΞϊςʔτ (MEDI dataset)
 →ࢦࣔ+จΛຒΊࠐΈදݱʹ
 →ରরֶशͰϞσϧΛ܇࿅ ख๏֓ཁ 7
  6. •ൺֱతγϯϓϧͳରরֶशʹΑΔ܇࿅ •ࢦࣔ+จΛ·Δ͝ͱϞσϧʹೖྗ͢Δ • ࢦࣔΛߟྀͨ͠จຒΊࠐΈΛੜ੒ ܇࿅खॱ 1. ࢦࣔͱจͷϖΞ(x, Ix, y, Iy)Λ༻ҙ

    2. Ex = E(Ix⊕x), Ey = E(Iy⊕y)ͷΑ͏ʹຒΊࠐΉ 3. ਖ਼ྫಉ͕࢜ۙͮ͘Α͏ʹରরֶश •ෛྫͱͯ͠in-batch negativesΛར༻ InstructOR: ܇࿅खॱ 8 Ix⊕x Iy⊕y Model Model ਖ਼ྫͷຒΊࠐΈΛ͚ۙͮΔ ଛࣦؔ਺
  7. •InstructORͷ܇࿅ʹ͸
 ࢦࣔͱจ͕ϖΞʹͳͬͨσʔληοτ͕ඞཁ •طଘσʔληοτΛ܇࿅༻ʹ౷߹ (ܭ300ݸ) • Super-NaturalInstructions (super-NI) • Sentence Transformers

    embedding data •super-NI͸ࢦࣔͱจ͕ϖΞʹͳ͍ͬͯΔ͕
 ਖ਼ྫɾෛྫ͕ଘࡏ͠ͳ͍ → Sentence-T5ͰจຒΊࠐΈΛੜ੒ (ࢦࣔͳ͠)
 → ྨࣅ౓Λ࢖ͬͯਖ਼ෛྫϖΞΛࣗಈੜ੒ MEDI: Multitask Embedding Data with Instructions 11
  8. •Seq2Seqܥͷσʔληοτ͸ϥϕϧ͕ͳ͍ͷͰ޻෉͕͍Δ •ਖ਼ྫ༻ͷείΞSpos ͱෛྫ༻ͷείΞSneg Λ༻ҙ •࠷΋Spos ͕ߴ͍ϖΞΛਖ਼ྫɺSneg ͕ߴ͍ϖΞΛෛྫʹ MEDI: ਖ਼ྫɾෛྫϖΞͷࣗಈੜ੒ख๏ 13

    ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗͱग़ྗ͕ڞʹྨࣅ͍ͯ͠ΔͱߴείΞ ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗ͸ࣅ͍ͯΔ͕ग़ྗ͸ࣅ͍ͯͳ͍ͱߴείΞ
  9. •Seq2Seqܥͷσʔληοτ͸ϥϕϧ͕ͳ͍ͷͰ޻෉͕͍Δ •ਖ਼ྫ༻ͷείΞSpos ͱෛྫ༻ͷείΞSneg Λ༻ҙ •࠷΋Spos ͕ߴ͍ϖΞΛਖ਼ྫɺSneg ͕ߴ͍ϖΞΛෛྫʹ MEDI: ਖ਼ྫɾෛྫϖΞͷࣗಈੜ੒ख๏ 14

    ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗͱग़ྗ͕ڞʹྨࣅ͍ͯ͠ΔͱߴείΞ ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗ͸ࣅ͍ͯΔ͕ग़ྗ͸ࣅ͍ͯͳ͍ͱߴείΞ
  10. •Seq2Seqܥͷσʔληοτ͸ϥϕϧ͕ͳ͍ͷͰ޻෉͕͍Δ •ਖ਼ྫ༻ͷείΞSpos ͱෛྫ༻ͷείΞSneg Λ༻ҙ •࠷΋Spos ͕ߴ͍ϖΞΛਖ਼ྫɺSneg ͕ߴ͍ϖΞΛෛྫʹ MEDI: ਖ਼ྫɾෛྫϖΞͷࣗಈੜ੒ख๏ 15

    ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗͱग़ྗ͕ڞʹྨࣅ͍ͯ͠ΔͱߴείΞ ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗ͸ࣅ͍ͯΔ͕ग़ྗ͸ࣅ͍ͯͳ͍ͱߴείΞ
  11. •Seq2Seqܥͷσʔληοτ͸ϥϕϧ͕ͳ͍ͷͰ޻෉͕͍Δ •ਖ਼ྫ༻ͷείΞSpos ͱෛྫ༻ͷείΞSneg Λ༻ҙ •࠷΋Spos ͕ߴ͍ϖΞΛਖ਼ྫɺSneg ͕ߴ͍ϖΞΛෛྫʹ MEDI: ਖ਼ྫɾෛྫϖΞͷࣗಈੜ੒ख๏ 16

    ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗͱग़ྗ͕ڞʹྨࣅ͍ͯ͠ΔͱߴείΞ ೖྗจͷྨࣅ౓ ग़ྗจͷྨࣅ౓ ೖྗ͸ࣅ͍ͯΔ͕ग़ྗ͸ࣅ͍ͯͳ͍ͱߴείΞ hard negative
 తͳཱͪҐஔ
  12. •ࢦࣔ͸ҎԼͷςϯϓϨʔτ͔Βߏங •“Represent The (DOMAIN) TEXT TYPE for TASK OBJECTIVE:.” •࣮ࡍͷࢦ͕ࣔҎԼ

    • ॏཁ: จͷλΠϓʹΑͬͯࢦࣔͰຒΊࠐΈํΛม͑ΒΕΔ MEDI: ࢦࣔͷߏங 18 ্ද͸࿦จதͷද͔Βൈਮͨ͠΋ͷ
  13. •ࢦࣔʹैͬͯλεΫΛղ͚ΔΑ͏ʹLMΛtuning (Instruction Tuning) •zero-shotͰ৭ʑͳੜ੒ܥλεΫ͕͏·͘ղ͚Δ •InstructOR͸FLANͷຒΊࠐΈ൛ͱ΋ݴ͑Δ •FLAN͸Finetuned Language Net ͷུΒ͍͠🧐 Wei+:

    Finetuned Language Models Are Zero-Shot Learners, ICLR ’22 ؔ࿈ݚڀ: FLAN 20 Flan-T5, Flan-UL2 ͸ΊͪΌͪ͘Όڧ͍ϞσϧͳͷͰ͓͢͢Ίʂ(͜ͷ࣌୅Ͱ΋·ͩڧ͍)
  14. •58ͷσʔληοτɺ112ͷݴޠ͔Β
 ͳΔจຒΊࠐΈͷϕϯνϚʔΫ •STSΛ͸͡Ί8छྨͷλεΫ͕͋Δ •(༨ஊ) MTEBͷ࣮ݧ݁Ռ͔Β
 จຒΊࠐΈͰ΋σΧ͍Ϟσϧ
 ͷํ͕ڧ͍܏޲͕ݟ͑Δ •ධՁ؍఺: InstructOR͕
 ͲΕ͘Β͍ྑ͍จຒΊࠐΈΛੜ੒Ͱ͖Δ͔

    •ಛʹɺࢦࣔΛ෇Ճ͢Δ͜ͱͰ
 طଘख๏ΑΓੑೳ͕޲্͢Δ͔͕ؾʹͳΔ Muennigho ff +: MTEB: Massive Text Embedding Benchmark, arXiv ’22 ධՁ࣮ݧ: Massive Text Embedding Benchmark (MTEB) 23 Sentence-BERT ͷஶऀͷ Nils Reimers ͕last authorͷproject
  15. •ੜ੒ϞσϧɾධՁࢦඪͷ૒ํΛ
 ొ࿥Ͱ͖ΔϦʔμʔϘʔυ(ϑϨʔϜϫʔΫ) • ֤ࢦඪΛΞϯαϯϒϧͨ͠ࢦඪ΋ࣗಈࢉग़ • ੜ੒Ϟσϧͷ։ൃͱධՁΛ૬ޓʹଅਐ •ධՁ؍఺: ੜ੒จ—ਖ਼ղཁ໿ͷྨࣅ౓͕
 ͲΕ͘Β͍ਓؒධՁͱ͍͔ۙ •ੜ੒จͱ֤ਖ਼ղཁ໿ͱͷcosྨࣅ౓ͷ


    ࠷େ஋ͱਓखධՁͱͷϐΞιϯ૬ؔ •3ͭͷσʔληοτͰͦΕͧΕ૬ؔ܎਺Λࢉग़ɺฏۉΛධՁ஋ʹ Kasai+: Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand, NAACL ’22 ධՁ࣮ݧ: Billboard (Ͱͷཁ໿ͷྨࣅ౓ධՁ) 24
  16. •In-Context Learning͸ςετ࣌ͷfew-shotࣄྫΛબͿඞཁ͕͋Δ • ࣄલʹগ਺ʹߜΔͷ͸େมɾͲΕ͕͍͍͔Θ͔Βͳ͍ •few-shotࣄྫΛຒΊࠐΈʹม׵ˠςετࣄྫʹ͍ۙࣄྫΛೖΕͯੑೳ޲্ •ධՁ؍఺: ੜ੒͕ͲΕ͚ͩ͏·͘ग़དྷΔ͔ • ੑೳͷد༩͢Δྑ͍few-shotࣄྫΛ࣋ͬͯདྷΕΔ͔Ͳ͏͔͕ධՁई౓ Su+:

    Selective Annotation Makes Language Models Better Few-Shot Learners, arXiv ’22 ධՁ࣮ݧ: Prompt Retrieval 25 ίϝϯτ: LLM࣌୅ͷSentEvalతͳཱͪҐஔʹݟ͑Δ (SentEval: ຒΊࠐΈϕʔεͷઢܗ෼ྨثΛֶशɾͦͷਫ਼౓ͰධՁ) ੜ੒͸GPT-J
  17. SimCSE: ҟͳΔDropoutΛద༻ͨ͠ಉ͡จΛਖ਼ྫ or ؚҙؔ܎ͷจϖΞΛਖ਼ྫͱͨ͠ରরֶश Contriever: ϕΫτϧݕࡧʹ͓͚ΔରরֶशͷݶքΛௐࠪɼMS MARCOͰͷ fi ne-tuningͰੑೳ޲্֬ೝ GTR:

    ScalingͰจຒΊࠐΈͰ΋ϕΫτϧݕࡧੑೳ & ൚Խੑೳ޲্Λ֬ೝ coCondenser: ࣄલ܇࿅࣌ʹίʔύεґଘͷଛࣦΛՃ͑Retrieverͷ܇࿅Λؤ݈ʹ&ੑೳ޲্ [00] Sentence-T5: T5ΛNLI/QAσʔλͰରরֶशʹΑΔ fi ne-tuning → scaling lawͷௐࠪ + SentGLUEͰධՁ SGPT: GPTΛ৘ใݕࡧλεΫʹར༻ɼCross-EncoderܗࣜͱBi-Encoderܗࣜͷ྆ํΛBEIRͰ࣮ݧ Gao+: SimCSE: Simple Contrastive Learning of Sentence Embeddings, EMNLP ’21 Izacard+: Unsupervised Dense Information Retrieval with Contrastive Learning, TMLR ’23 Ni+: Large Dual Encoders Are Generalizable Retrievers, EMNLP ’22 Gao+: Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval, ACL ’22 Ni+: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models, CoRR ’21 Muennigho ff : SGPT: GPT Sentence Embeddings for Semantic Search, arXiv ’22 ධՁ࣮ݧ: ൺֱख๏ 26
  18. •GTRΛ fi ne-tuning • Թ౓ύϥϝʔλ: 0.01 • ֶश཰: 2e-5 •

    Optimizer: AdamW ϛχόονͷαϯϓϦϯάख๏ •֤ϛχόον͸୯Ұͷσʔληοτ͔ΒͳΔ • λεΫ΍σʔληοτͷҧ͍Λֶश͠ͳ͍Α͏ʹ͢Δ ධՁ࣮ݧ: InstructORͷ܇࿅ઃఆ 27 in-batch negatives
 ʹΑΔѱӨڹͷ௿ݮ