Slide 1

Slide 1 text

Text Embeddings by Weakly-Supervised Contrastive Pre-training Graduate School of Informatics, Nagoya University, Japan. ൃදऀ: Hayato Tsukagoshi Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei
 https://arxiv.org/abs/2212.03533

Slide 2

Slide 2 text

•େن໛ͳࣄલରরֶशʹΑΓߏங͞Εͨ
 ςΩετຒΊࠐΈϞσϧE5ΛఏҊ • ൒ߏ଄ԽσʔλͱϑΟϧλϦϯάΛ
 ༻͍ͨऑڭࢣ͋ΓσʔληοτΛߏங • όοναΠζ32000Ͱͷpre-training • hard negativeͱCross-Encoder͔Βͷ
 ஌ࣝৠཹΛ׆༻ͨ͠ fi ne-tuning •ධՁͷ݁ՌछʑͷϕϯνϚʔΫͰ
 ฏۉͯ͠طଘϞσϧΛ্ճΔ ֓ཁ 2 #Layers hidden size #params E5-small 12 384 33M E5-base 12 768 110M E5-large 24 1024 330M

Slide 3

Slide 3 text

•ۙ೥࠷΋Α͘ར༻͞Ε͍ͯΔจຒΊࠐΈϞσϧͷҰ͔ͭͩΒ •จຒΊࠐΈݚڀͷࠓޙͷํ޲ੑΛཧղ͢Δ্ͰࢀߟʹͳΔ࿦จ͔ͩΒ • Cross-Encoder͔Βͷ஌ࣝৠཹ • ଟஈ֊Ͱͷରরֶश ໔੹ࣄ߲ •εϥΠυதͷਤද͸֤εϥΠυͰݴٴ͞Ε͍ͯΔ࿦จ͔ΒͷҾ༻Ͱ͢ •࿦จதͷ਺ࣜͱ͸ҟͳΔจࣈΛ࢖͍ͬͯΔ৔߹͕͋Γ·͢ બఆཧ༝ 3

Slide 4

Slide 4 text

ࣄલ஌ࣝ

Slide 5

Slide 5 text

•ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘ TransformerͰߏ੒ •͍Ζ͍Ζͳछྨ͕ଘࡏ ݴޠϞσϧ: Language Models 5 ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, … BERTͷ֓ཁਤ

Slide 6

Slide 6 text

•ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘ TransformerͰߏ੒ •͍Ζ͍Ζͳछྨ͕ଘࡏ ࣗݾճؼܕݴޠϞσϧ (Causal LM) •ࠨ͔Βӈʹ୯ޠΛ༧ଌͯ͠܇࿅ •ྫ: GPT, GPT-2, GPT-3, Llama 2, … ݴޠϞσϧ: Language Models 6 ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, … BERTͷ֓ཁਤ

Slide 7

Slide 7 text

•ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘ TransformerͰߏ੒ •͍Ζ͍Ζͳछྨ͕ଘࡏ ࣗݾճؼܕݴޠϞσϧ (Causal LM) •ࠨ͔Βӈʹ୯ޠΛ༧ଌͯ͠܇࿅ •ྫ: GPT, GPT-2, GPT-3, Llama 2, … ϚεΫݴޠϞσϧ (Masked LM) •จதͷҰ෦ΛӅ͢ɾ༧ଌͯ͠܇࿅ •ྫ: BERT, RoBERTa, DeBERTa, … ݴޠϞσϧ: Language Models 7 ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, … BERTͷ֓ཁਤ

Slide 8

Slide 8 text

•ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V: nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ ஫ҙػߏ (Attention Mechanism) 8 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

Slide 9

Slide 9 text

•ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V: nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ •Qʹର͢ΔVͷॏཁ౓ΛQͱKͷ಺ੵˠSoftmaxͰܭࢉ • Attention Weights: ܭࢉͷ݁ՌಘΒΕΔ(m × n)ߦྻ ஫ҙػߏ (Attention Mechanism) 9 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

Slide 10

Slide 10 text

•ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V: nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ •Qʹର͢ΔVͷॏཁ౓ΛQͱKͷ಺ੵˠSoftmaxͰܭࢉ • Attention Weights: ܭࢉͷ݁ՌಘΒΕΔ(m × n)ߦྻ •Self-Attention (ࣗݾ஫ҙػߏ): Q, K, VΛಉ͡ϕΫτϧྻ͔Βߏ੒ (i.e. n=m) •Cross-Attention: ʮQʯͱʮK, VʯΛҟͳΔϕΫτϧྻ͔Βߏ੒ ஫ҙػߏ (Attention Mechanism) 10 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

Slide 11

Slide 11 text

•஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ • Transformer 11 Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder

Slide 12

Slide 12 text

•஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ Transformer 12 Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder

Slide 13

Slide 13 text

•஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ •EncoderͱDecoderͷೋछྨ͕ଘࡏ • EncoderͷΈ: BERT, LUKE, … • DecoderͷΈ: GPT, GPT-2, GPT-3, … • Encoder-Decoder: BART, T5, UL2, … Transformer 13 Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder

Slide 14

Slide 14 text

•Transformer EncoderΛෳ਺૚ॏͶͯେن໛ʹࣄલֶशͨ͠Ϟσϧ • base͸12૚ (1ԯύϥϝʔλ)ɺlarge͸24૚ (3.3ԯύϥϝʔλ) •ࣄલֶश (pre-training) → ඍௐ੔ ( fi ne-tuning) ͱ͍͏ύϥμΠϜ͕ීٴ BERT: Bidirectional Encoder Representations from Transformers 14 Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.

Slide 15

Slide 15 text

•ࣗવݴޠจͷີϕΫτϧදݱ •ϕΫτϧͷڑ཭͕จͷҙຯͷۙ͞Λදݱ จຒΊࠐΈ: Sentence Embedding 15 ͜Ͳ΋͕Ոʹ޲͔͍ͬͯΔɻ ͜Ͳ΋ֶ͕ߍ͔ΒՈʹ޲͔͍ͬͯΔɻ ͜Ͳ΋͕ਤॻؗʹ͍Δɻ ͜Ͳ΋͕ޕޙʹา͍͍ͯΔɻ จຒΊࠐΈۭؒ [0.1, 0.2, ...] [0.1, 0.3, ...] [0.9, 0.8, ...] [0.5, 0.7, ...]

Slide 16

Slide 16 text

•ࣗવݴޠจͷີϕΫτϧදݱ •ϕΫτϧͷڑ཭͕จͷҙຯͷۙ͞Λදݱ จຒΊࠐΈ: Sentence Embedding 16 ͜Ͳ΋͕Ոʹ޲͔͍ͬͯΔɻ ͜Ͳ΋ֶ͕ߍ͔ΒՈʹ޲͔͍ͬͯΔɻ ͜Ͳ΋͕ਤॻؗʹ͍Δɻ ͜Ͳ΋͕ޕޙʹา͍͍ͯΔɻ จຒΊࠐΈۭؒ [0.1, 0.2, ...] [0.1, 0.3, ...] [0.9, 0.8, ...] [0.5, 0.7, ...] ҙຯతʹྨࣅ ͍ۙҙຯΛ࣋ͭจ͸ ۙ͘ʹ෼෍ ϕΫτϧؒͷڑ཭͕
 ҙຯతͳؔ܎Λදݱ

Slide 17

Slide 17 text

ॳظ (~2018) •੩త୯ޠຒΊࠐΈ(Word2Vec, GloVe)͔ΒจຒΊࠐΈΛߏ੒͢Δख๏͕ओྲྀ • SIF, uSIF, All-but-the-Top, … •LSTM౳Λར༻ͯ͠from scratchʹֶश͢Δख๏΋͍͔ͭ͘ଘࡏ • SkipThought, InferSent, Universal Sentence Encoder (USE), … ࣄલֶशϞσϧඍௐ੔ख๏ͷ୆಄ (2019~2021) •BERTͷ fi ne-tuningʹΑΓจຒΊࠐΈϞσϧΛಘΔख๏͕૿Ճ • BERT- fl ow, Sentence-BERT (SBERT), … จຒΊࠐΈݚڀͷมભ 17 ஫ҙ: ೥୅͸ͬ͘͟ΓͰ͢

Slide 18

Slide 18 text

ࣗવݴޠਪ࿦ (Natural Language Inference; NLI) •จϖΞ (લఏจɾԾઆจ) ʹϥϕϧ (ؚҙɾໃ६ɾதཱ) ͕෇༩ •จϖΞͷҙຯؔ܎Λ༧ଌ͢ΔλεΫ NLIσʔληοτ 18 લఏจ Ծઆจ ϥϕϧ A man playing an electric guitar on stage. A man playing guitar on stage. ؚҙ A man playing an electric guitar on stage. A man playing banjo on the fl oor. ໃ६ A man playing an electric guitar on stage. A man is performing for cash. தཱ

Slide 19

Slide 19 text

ରরֶशོ੝ظ (2021~) •ը૾෼໺Ͱྲྀߦ͍ͯͨ͠ରরֶशख๏͕จຒΊࠐΈʹ΋ •ಛʹ SimCSE ͕୅දతͳख๏ʹ • ڭࢣ͋Γɾڭࢣͳ͠ͷೋͭͷख๏ΛఏҊ Unsupervised SimCSE 1. ಉ͡ೖྗʹର͠ҟͳΔdropout maskͰforward 2. ಘΒΕͨʮಉ͡ೖྗʹର͢ΔҟͳΔग़ྗʯಉ࢜Λਖ਼ྫʹରরֶश Supervised SimCSE • NLIσʔληοτதͷʮؚҙʯؔ܎ʹ͋ΔจϖΞΛਖ਼ྫʹରরֶश จຒΊࠐΈݚڀͷมભ 19 ೔ຊޠSimCSEͷςΫχΧϧϨϙʔτ͸ͪ͜Β

Slide 20

Slide 20 text

•දݱֶश (representation learning) ͷख๏ͷҰͭ •ਖ਼ྫಉ͕࢜ۙͮ͘Α͏ʹɺ͔ͭɺෛྫಉ͕࢜཭ΕΔΑ͏ʹֶश͢Δ • ਖ਼ྫಉ࢜ͷྨࣅ౓࠷େԽ & ෛྫಉ࢜ͷྨࣅ౓࠷খԽ ରরֶश 20

Slide 21

Slide 21 text

•දݱֶश (representation learning) ͷख๏ͷҰͭ •ਖ਼ྫಉ͕࢜ۙͮ͘Α͏ʹɺ͔ͭɺෛྫಉ͕࢜཭ΕΔΑ͏ʹֶश͢Δ • ਖ਼ྫಉ࢜ͷྨࣅ౓࠷େԽ & ෛྫಉ࢜ͷྨࣅ౓࠷খԽ ଛࣦ (InfoNCE) ͷܭࢉ •ਖ਼ྫಉ࢜ͷຒΊࠐΈදݱͷcosྨࣅ౓ΛٻΊΔ •ෛྫಉ࢜ͷຒΊࠐΈදݱͷcosྨࣅ౓ΛٻΊΔ •ྨࣅ౓Λฒ΂ͯԹ౓ύϥϝʔλΛద༻͢Δ •Softmaxؔ਺Λద༻ͯ֬͠཰෼෍ͱΈͳ͢ •ਖ਼ྫʹ͚ͩ1ཱ͕ͭ෼෍ʹ͚ۙͮΔ ରরֶश 21

Slide 22

Slide 22 text

Unsupervised SimCSE:ʮਖ਼ଇԽ+จຒΊࠐΈಉ࢜Λ཭͢ʯ Supervised SimCSE:ʮҙຯతʹ͍ۙจຒΊࠐΈΛ͚ۙͮΔ+ͦͷଞͷจຒΊࠐΈಉ࢜Λ཭͢ʯ SimCSE: ֓ཁਤ 22

Slide 23

Slide 23 text

εέʔϦϯάظ (2022~) •܇࿅ͷେن໛Խ͕ੵۃతʹߦΘΕΔΑ͏ʹ • σʔλྔͱόοναΠζͷ૿େʹΑΔରরֶशͷεέʔϦϯά • ϞσϧύϥϝʔλͷεέʔϦϯά •multi-stage contrastive learningͷಋೖ • ऑڭࢣσʔλΛ༻͍ͨࣄલֶशˠڭࢣ͋ΓֶशʹΑΔFine-tuning • Ϟσϧྫ: E5, GTE, BGE, … •େن໛ݴޠϞσϧ(LLM)Λ༻͍ͨςΩετຒΊࠐΈͷݚڀ΋ൃలத • PromptEOL, E5-Mistral, LLM2Vec, … จຒΊࠐΈݚڀͷมભ 23 ೔ຊޠSimCSEͷςΫχΧϧϨϙʔτ͸ͪ͜Β

Slide 24

Slide 24 text

E5

Slide 25

Slide 25 text

•େن໛ͳࣄલରরֶशʹΑΓߏங͞Εͨ
 ςΩετຒΊࠐΈϞσϧE5ΛఏҊ • ൒ߏ଄ԽσʔλͱϑΟϧλϦϯάΛ
 ༻͍ͨऑڭࢣ͋ΓσʔληοτΛߏங • όοναΠζ32000Ͱͷpre-training • hard negativeͱCross-Encoder͔Βͷ
 ஌ࣝৠཹΛ׆༻ͨ͠ fi ne-tuning •ධՁͷ݁ՌछʑͷϕϯνϚʔΫͰ
 ฏۉͯ͠طଘϞσϧΛ্ճΔ ֓ཁ: ࠶ܝ 25 #Layers hidden size #params E5-small 12 384 33M E5-base 12 768 110M E5-large 24 1024 330M

Slide 26

Slide 26 text

E5ͷߏ੒ཁૉ •େن໛ͳςΩετϖΞσʔληοτͷߏங •ରরֶशʹΑΔେن໛ͳࣄલ܇࿅ •஌ࣝৠཹΛซ༻ͨ͠ϥϕϧ෇͖ߴ඼࣭σʔληοτͰͷ fi ne-tuning •Ϟσϧͷೖྗʹpre fi xΛ෇Ճɺσʔλͷඇରশͳؔ܎Λଊ͑ΒΕΔΑ͏ֶश E5: શମ૾ 26

Slide 27

Slide 27 text

•ਂ૚ֶशϞσϧͷ܇࿅Ͱ͸σʔλͷ඼࣭ͱଟ༷ੑ͕ੑೳΛେ͖͘ࠨӈ •͔͠͠ςΩετຒΊࠐΈϞσϧֶशͷͨΊͷσʔληοτ͸গ਺ • طଘݚڀ͸ Stanford NLI ΍ MS-MARCO ͳͲਓखখن໛σʔλΛར༻ •େن໛ͳςΩετຒΊࠐΈϞσϧ܇࿅༻σʔληοτΛߏங͢Δ • ൒ߏ଄Խ͞Εͨσʔλ͔ΒςΩετϖΞΛऩू (ϑΟϧλલ: 1.3B pairs) CCPairs: ࣄલରরֶशͷͨΊͷେن໛σʔληοτ 27 Source Query Passage Size Wikipedia entity + section title passage 24M Reddit post upvoted comment 60M Common Crawl title passage 69M ࠷ऴతʹऩू͞Εͨσʔλͱܗࣜͷྫ

Slide 28

Slide 28 text

•σʔλͷ඼࣭޲্ɾ܇࿅ίετ࡟ݮͷͨΊϑΟϧλϦϯάΛ࣮ࢪ • ࠷ऴతʹ270M·Ͱ࡟ݮ •ਂ૚ֶशϞσϧͷʮnoisyͳσʔλதͷ៉ྷͳࣄྫ͔Β֮͑ΔʯڍಈΛར༻ CCPairs: Consistency fi lterʹΑΔϊΠζআڈ 28 ޙଓͷGTEͰ͸ fi ltering͸͞Ε͍ͯͳ͍ Ϟσϧ͸Ԛ͍σʔλதͷ៉ྷͳσʔλ͔Β͍֮͑ͯ͘ →֮͑ΒΕͳ͔ͬͨσʔλ͸ʮԚ͍ʯ

Slide 29

Slide 29 text

•σʔλͷ඼࣭޲্ɾ܇࿅ίετ࡟ݮͷͨΊϑΟϧλϦϯάΛ࣮ࢪ • ࠷ऴతʹ270M·Ͱ࡟ݮ •ਂ૚ֶशϞσϧͷʮnoisyͳσʔλதͷ៉ྷͳࣄྫ͔Β֮͑ΔʯڍಈΛར༻ Consistency-based data fi ltering 1. 1.3BͷnoisyͳσʔληοτͰϞσϧΛ܇࿅ 2. 1MͷจষΛϥϯμϜʹநग़ͯ͠༻ҙ 3. ͋ΔΫΤϦʹର͠ਖ਼ྫจষͱϥϯμϜநग़͞Ε֤ͨจষͱͷྨࣅ౓Λ1ͷ ϞσϧΛ࢖ͬͯܭࢉ 4. ਖ਼ྫจষͷྨࣅ౓ॱҐ͕2Ҏ্ͷࣄྫͷΈ࢒͢ CCPairs: Consistency fi lterʹΑΔϊΠζআڈ 29 ޙଓͷGTEͰ͸ fi ltering͸͞Ε͍ͯͳ͍

Slide 30

Slide 30 text

•2ஈ֊ͷֶशख๏Λ࠾༻ 1. Contrastive Pre-training •௨ৗͷରরֶशଛࣦͱCCPairsΛ༻͍ͯڊେόοναΠζͰ܇࿅ • σʔλ͕noisyͳ৔߹΄ͲόοναΠζ͸େ͖ͨ͘͠΄͏͕ྑͦ͞͏ •ೖྗʹଐੑ৘ใΛද͢pre fi xΛ෇Ճ 2. Fine-tuning •ਓखͰ࡞੒͞Εͨϥϕϧ෇͖σʔληοτͰ fi ne-tuning • ରরֶशଛࣦͷଞʹ஌ࣝৠཹଛࣦ΋༻͍Δ ֶशख๏ 30 “query:” ͱ “passage:” ͷೋͭ

Slide 31

Slide 31 text

•ΫΤϦɾจষͷຒΊࠐΈಉ࢜Ͱྨࣅ౓Λܭࢉ • ΫΤϦ—ෛྫͷྨࣅ౓ΑΓΫΤϦ—ਖ਼ྫͷྨࣅ౓͕ߴ͘ͳΔΑ͏ֶश • ҙ༁: ΫΤϦͱจষͱͷྨࣅ౓ߦྻʹ͓͚Δର֯੒෼ͷ࠷େԽ •ಉ͡όον಺ͷଞͷࣄྫΛෛྫʹ͢Δ: in-batch negatives ֶशख๏: ରরֶश / Contrastive Pre-training 31

Slide 32

Slide 32 text

•ΫΤϦɾจষͷຒΊࠐΈಉ࢜Ͱྨࣅ౓Λܭࢉ • ΫΤϦ—ෛྫͷྨࣅ౓ΑΓΫΤϦ—ਖ਼ྫͷྨࣅ౓͕ߴ͘ͳΔΑ͏ֶश • ҙ༁: ΫΤϦͱจষͱͷྨࣅ౓ߦྻʹ͓͚Δର֯੒෼ͷ࠷େԽ •ಉ͡όον಺ͷଞͷࣄྫΛෛྫʹ͢Δ: in-batch negatives ֶशख๏: ରরֶश / Contrastive Pre-training 32 ΫΤϦ ਖ਼ྫจষ Model Model ਖ਼ྫͷຒΊࠐΈΛ͚ۙͮΔ యܕతʹ͸cosྨࣅ౓

Slide 33

Slide 33 text

•ΫΤϦɾจষͷຒΊࠐΈಉ࢜Ͱྨࣅ౓Λܭࢉ • ΫΤϦ—ෛྫͷྨࣅ౓ΑΓΫΤϦ—ਖ਼ྫͷྨࣅ౓͕ߴ͘ͳΔΑ͏ֶश • ҙ༁: ΫΤϦͱจষͱͷྨࣅ౓ߦྻʹ͓͚Δର֯੒෼ͷ࠷େԽ •ಉ͡όον಺ͷଞͷࣄྫΛෛྫʹ͢Δ: in-batch negatives ֶशख๏: ରরֶश / Contrastive Pre-training 33 ΫΤϦ ਖ਼ྫจষ Model Model ਖ਼ྫͷຒΊࠐΈΛ͚ۙͮΔ batch size batch size ΫΤϦͱจষΛผʑʹ encode͢ΔͷͰDual- Encoderͱݺ͹ΕΔ ॏΈ͸ڞ༗
 (ಉ͡Ϟσϧ) యܕతʹ͸cosྨࣅ౓

Slide 34

Slide 34 text

•ਖ਼ྫ͚ͩ֬཰1ͷ֬཰෼෍Λ໨ࢦ͢ •ଛࣦΛԼ͛ΔͨΊʹ͸… • ਖ਼ྫͷྨࣅ౓Λେ͖͘͢Δ • ෛྫͷྨࣅ౓Λখ͘͢͞Δ ଛࣦؔ਺ͷ௚ײతཧղ: ରরֶशଛࣦ 34 ಺ੵʹsoftmaxΛ͔͚ΔͷͰ֬཰෼෍ͱΈͳͤΔ Pcl 1 Pstu … in-batch negatives one-hotͳ෼෍ʹ͚ۙͮΔ hard negatives Query Passage

Slide 35

Slide 35 text

•ϥϕϧ෇͖σʔλͰͷ fi ne-tuningͰ͸σʔλͷ඼࣭͕ॏཁ 1. hard negativesͷར༻ •ͺͬͱݟ͸Θ͔Βͳ͍೉͍͠ࣄྫ • ϞσϧͷදݱྗΛߴΊΔɺඍࡉͳ৘ใΛଊ͑ΒΕΔΑ͏ʹ͢ΔޮՌ •MS-MARCO΍Natural Questions (NQ)Ͱ͸ෛྫΛminingͯ͠ར༻ 2. ରরֶशͱ஌ࣝৠཹͷ૊Έ߹Θͤ •ڭࢣ৴߸ΛΑΓϦονʹ͢ΔͨΊCross-Encoderͷग़ྗΛڭࢣͱͯ͠ར༻ •ରরֶशଛࣦͱ஌ࣝৠཹଛࣦΛ૊Έ߹ΘͤͨϚϧνλεΫֶश ֶशख๏: Fine-tuning 35 ஌ࣝৠཹଛࣦ ରরֶशଛࣦ

Slide 36

Slide 36 text

•Dual-Encoder (DE) ͸ΫΤϦɾจষؒͷ૬ޓ࡞༻ΛߟྀͰ͖ͳ͍ •Cross-Encoder (CE) ͸૬ޓ࡞༻ΛݟΕΔ͕ඇޮ཰ •CEΛ໛฿Ͱ͖ΔΑ͏ʹDEΛֶश͢Ε͹ղܾͰ͸ʁ🧐 •CEͷྨࣅ౓෼෍ʹDEͷग़ྗྨࣅ౓෼෍Λ͚ۙͮΔ ֶशख๏: Cross-Encoder͔Βͷ஌ࣝৠཹ 36 ΫΤϦ+จষ Model ྨࣅ౓είΞ ΫΤϦ จষ Model Model ྨࣅ౓ Dual-Encoder Cross-Encoder

Slide 37

Slide 37 text

•CEͷྨࣅ౓ͷ෼෍ʹDEͷ෼෍Λ͚ۙͮΔ •ରরֶशͱҧ͍ྨࣅ౓ͷ্͛Լ͕͛ෆఆ • negativeͰ΋্͛ΔΑ͏ʹֶश͞Ε͏Δ • ણࡉͳྨࣅ౓ͷؔ܎ΛDEʹڭ͑Δ ଛࣦؔ਺ͷ௚ײతཧղ: ஌ࣝৠཹଛࣦ 37 ಺ੵʹsoftmaxΛ͔͚ΔͷͰ֬཰෼෍ͱΈͳͤΔ ༧ଌ෼෍ CEͷ෼෍ CEͷ෼෍ʹ͚ۙͮΔ Query Passage

Slide 38

Slide 38 text

•ରরֶशͰ͸in-batch negativesΛར༻͢Δ͕false negative͕ൃੜ͠͏Δ • ຊ౰͸ਖ਼ྫ(͚͍ۙͮͨ)ͳͷʹෛྫͱͯ͠ॲཧͯ͠͠·͏ݱ৅ ஌ࣝৠཹͷར఺: false negativeͷ཈੍ 38 1 … ຊ౰͸ਖ਼ղ͕ͩ… ରরֶशଛࣦͷ৔߹ ྨࣅ౓Λա౓ʹԼ͛Δํ޲
 ΁ֶश͞Εͯ͠·͏

Slide 39

Slide 39 text

•ରরֶशͰ͸in-batch negativesΛར༻͢Δ͕false negative͕ൃੜ͠͏Δ • ຊ౰͸ਖ਼ྫ(͚͍ۙͮͨ)ͳͷʹෛྫͱͯ͠ॲཧͯ͠͠·͏ݱ৅ •஌ࣝৠཹͰ͸Cross-Encoderͷग़ྗΛٖࣅతͳڭࢣϥϕϧͱͯ͠ར༻ • false negativeͷ໰୊Λ؇࿨ ஌ࣝৠཹͷར఺: false negativeͷ཈੍ 39 ຊ౰͸ਖ਼ղ͕ͩ
 ྨࣅ౓͕খ͍͞ false negativeͷ໰୊؇࿨
 ྨࣅ౓Λ্͛ΔΑ͏ʹֶशՄೳ ஌ࣝৠཹʹΑΔଛࣦͷ৔߹

Slide 40

Slide 40 text

ଛࣦؔ਺ͷ௚ײతཧղ: ͚͔ۙͮͨͷൺֱ 40 ༧ଌ෼෍ ਖ਼ղ෼෍ 1 … CEͷ෼෍ʹ͚ۙͮΔ DKL one-hotͳ෼෍ʹ͚ۙͮΔ ରরֶशଛࣦ ΄΅͚ۙͮΔ෼෍͕ҟͳΔ͚ͩ

Slide 41

Slide 41 text

Summary: E5ͷ࡞Γํ 41 Un fi ltered
 Corpus Consistency-based
 Filtering CCPairs Masked LM E5-PT Contrastive
 Pre-training E5 Contrastive
 Fine-tuning Labeled Data Knowledge Distillation Reranker encoder-only

Slide 42

Slide 42 text

Summary: E5ͷ࡞Γํ 42 rerankerͷ࡞੒खॱ͸ݪஶ࿦จʹ͸શવॻ͍͍ͯͳ͍ Un fi ltered
 Corpus Consistency-based
 Filtering CCPairs Masked LM E5-PT Contrastive
 Pre-training E5 Contrastive
 Fine-tuning Labeled Data Knowledge Distillation Reranker Retriever 1 Retriever 2 ৄࡉ͸ઌߦݚڀͷSimLM࿦จΛࢀরͷ͜ͱ encoder-only ৭ʑΊͪΌͪ͘Όؤு͍ͬͯΔ

Slide 43

Slide 43 text

•Poolingख๏: Average Pooling (ग़ྗຒΊࠐΈͷฏۉΛऔΔ) • Transformerͷग़ྗ͸ϕΫτϧྻɺ୯ҰϕΫτϧʹ͢ΔͨΊͷૢ࡞͕Pooling Ϟσϧઃఆɾ܇࿅ৄࡉ 43 E5-large pre-training fi ne-tuning #GPUs (V100) 64 8 batch size 32000 256 max length 128 192 #iteration 20000 steps 3 epochs Թ౓ύϥϝʔλ (τ) 0.01 0.01 ଛࣦͷॏΈ (α) N/A 0.2 #hard negatives N/A 7 Dataset CCPairs MS-MARCO, NQ, NLI

Slide 44

Slide 44 text

•Poolingख๏: Average Pooling (ग़ྗຒΊࠐΈͷฏۉΛऔΔ) • Transformerͷग़ྗ͸ϕΫτϧྻɺ୯ҰϕΫτϧʹ͢ΔͨΊͷૢ࡞͕Pooling Ϟσϧઃఆɾ܇࿅ৄࡉ 44 E5-large pre-training fi ne-tuning #GPUs (V100) 64 8 batch size 32000 256 max length 128 192 #iteration 20000 steps 3 epochs Թ౓ύϥϝʔλ (τ) 0.01 0.01 ଛࣦͷॏΈ (α) N/A 0.2 #hard negatives N/A 7 Dataset CCPairs MS-MARCO, NQ, NLI SimCSEͷ0.05ΑΓখ͍͞ Թ౓ύϥϝʔλ͕খ͍͞
 →ͦ͜·Ͱྨࣅ౓෼෍Λ
 ઑΒͤͳͯ͘΋͍͍ ଟ༷ͳσʔλΛֶश
 ͢Δނͷ഑ྀʁ
 (ແཧʹྨࣅ౓ΛߴΊΑ͏
 ͱ͠ͳ͍Α͏ʹ)

Slide 45

Slide 45 text

•ରরֶशͰ࠷΋ॏཁͳϋΠύϥͷҰͭ •Softmaxલͷ஋ΛՃ޻ͯ͠Softmaxޙͷ
 ෼෍ͷܗঢ়ΛมԽͤ͞Δ Թ౓ύϥϝʔλͷิ଍: Πϝʔδ 45 ༧ଌ෼෍ ߴԹ౓ύϥϝʔλ
 ྫ: 10 ௿Թ౓ύϥϝʔλ
 ྫ: 0.01 ෼෍͕ฏୱʹ ෼෍͕ٸफ़ʹ

Slide 46

Slide 46 text

•ରরֶशͰ࠷΋ॏཁͳϋΠύϥͷҰͭ •Softmaxલͷ஋ΛՃ޻ͯ͠Softmaxޙͷ
 ෼෍ͷܗঢ়ΛมԽͤ͞Δ Թ౓ύϥϝʔλͷิ଍: Πϝʔδ 46 ༧ଌ෼෍ ߴԹ౓ύϥϝʔλ
 ྫ: 10 ௿Թ౓ύϥϝʔλ
 ྫ: 0.01 ෼෍͕ฏୱʹ ෼෍͕ٸफ़ʹ Ϟσϧ͕ؤுͬͯ෼෍Λ
 ઑΒͤΔඞཁ͋Γ Ϟσϧ͕ؤுΒͳͯ͘΋
 ෼෍͕ઑΔ

Slide 47

Slide 47 text

•ςΩετຒΊࠐΈධՁ༻ͷϕϯνϚʔΫΛར༻ BEIR •19ݸͷσʔληοτΛؚΉ৘ใݕࡧλεΫʹಛԽͨ͠ϕϯνϚʔΫ •nDCG@10ͰධՁ MTEB •6λεΫɾ56σʔληοτ͔ΒͳΔ൚༻ϕϯνϚʔΫ •ϦʔμʔϘʔυ΋੔උ͞Ε͓ͯΓۙ೥׆ൃʹར༻͞Ε͍ͯΔ ධՁ࣮ݧ 47

Slide 48

Slide 48 text

•SimCSE΍Contrieverͱ͍ͬͨطଘख๏Λ্ճΔੑೳ • E5͸͜ΕΒͷख๏ΑΓσʔληοτ࡞੒
 Λؤு͍ͬͯΔ •Contrastive Pre-trainingͷΈ΋͔ͳΓڧ͍ • ςΩετຒΊࠐΈͷͨΊͷࣄલֶश͕ޮՌత •E5ͷ fi ne-tuningͷσʔληοτ͸ݶఆత • طଘख๏ʹෛ͚ͯΔλεΫ΋ׂͱ͋Δ • ଟ༷Խɾେن໛Խ͢Δ͜ͱͰੑೳ޲্Λ
 ໨ࢦͤͦ͏ ࣮ݧ݁Ռ: BEIR🍺 48 ද͸ॾʑলུͨ݁͠Ռɺৄࡉͳ࣮ݧ݁Ռ͸ݪஶ࿦จΛࢀরͷ͜ͱ Avg. BM25 41.7 SimCSEbase 20.3 Contrieverunsup 36.0 E5-PTlarge 44.2 Contrieversup 46.6 ColBERT 44.4 E5large 50.0 ڭࢣͳ͠ ڭࢣ͋Γ

Slide 49

Slide 49 text

•ଟ༷ͳσʔληοτͰฏۉͯ͠ߴ͍ੑೳ ࣮ݧ݁Ռ: MTEB 49

Slide 50

Slide 50 text

•Contrastive Pre-trainingʹ͓͚ΔόοναΠζΛม͑ͯBEIRͰੑೳධՁ •όοναΠζΛେ͖͘͢Δ΄Ͳੑೳ޲্ •🧐16kʹϐʔΫ͕͋ΔՄೳੑ͕ͳ͘΋ͳ͍ • ޙଓݚڀͷGTEͰ͸8k͔16k͕࠷ળ ෼ੳ: όοναΠζ͕ٴ΅͢Өڹ 50

Slide 51

Slide 51 text

• fi ne-tuningʹ࢖͏σʔληοτΛม͑ͯMTEBͰੑೳධՁ •Contrastive Pre-training͚ͩΑΓ fi ne-tuningͨ͠ํ͕ฏۉੑೳ͸ߴ͍ • ͕ɺNLIσʔληοτ͚ͩͰ fi ne-tuning͢Δͱݕࡧੑೳ͸Ή͠Ζ௿Լ • ݕࡧ+QAͰੑೳ͕͔ͳΓ޲্͢Δ͕ɺSTSͷੑೳ࠷େԽʹ͸NLI͕ඞཁ •શͯΛࠞͥͯ࢖͏͜ͱͰฏۉͯ͠࠷ߴੑೳɺଟ༷Խ͕େࣄ ෼ੳ: fi ne-tuningσʔληοτͷଟ༷ੑ 51 NLIͰtuning͞ΕͨจຒΊࠐΈϞσϧ͸ݕࡧ༻్ʹ͸޲͔ͳ͍Ͱ͢(ࢲݟ)

Slide 52

Slide 52 text

•ϑΟϧλϦϯάͷ༗ແΛม͑ͯੑೳൺֱ • σʔλن໛ʹΑΔӨڹ΋ݟΔͨΊখن໛σʔλͰͷ࣮ݧ΋࣮ࢪ •σʔλن໛ʹΑΒͣϑΟϧλϦϯάʹΑΓੑೳ޲্ • ಛʹখن໛σʔλͰϑΟϧλϦϯάͷޮՌ͕େ͖͍ͱݴ͑Δ ෼ੳ: ϑΟϧλϦϯάͷॏཁੑ 52

Slide 53

Slide 53 text

•େن໛ͳࣄલରরֶशʹΑΓߏங͞ΕͨςΩετຒΊࠐΈϞσϧE5ΛఏҊ •൒ߏ଄ԽσʔλͱϑΟϧλϦϯάΛ༻͍ͨऑڭࢣ͋ΓσʔληοτΛߏங •2ஈ֊ͷֶशख๏Λ࠾༻ 1. Contrastive Pre-training 2. ରরֶशͱ஌ࣝৠཹΛ૊Έ߹ΘͤͨFine-tuning ؾʹͳͬͨ͜ͱ •͔ͳΓςΫχΧϧϨϙʔτ෩ͳ࿦จͰResearch Question͕͋·Γͳ͍ • ஌ࣝৠཹͷ༗༻ੑͷݕূ͸ͳ͠ɾhard negativeΛͲ͏࡞Δ͔΋ᐆດ •pre fi x͸༗༻ͦ͏͕ͩޮՌ͕ݕূ͞Ε͍ͯͳ͍ ·ͱΊ 53

Slide 54

Slide 54 text

•E5͸ઌߦݚڀʹ͋ͨΔSimLMͷ஌ࣝΛલఏͱ͍ͯ͠Δ෦෼͕͋Δ • ಛʹRerankerͷ࡞Γํ΍negative miningʹ͍ͭͯ͸SimLM࿦จ͕ৄࡉ •ޙଓݚڀͷGTE΋ษڧʹͳΔͷͰ߹ΘͤͯͲ͏ͧ •pre-trainingʹrerankerΛ࢖Θͳ͍ͷ͸ίετͷ໰୊ͬΆ͍ • ಉ༷ʹcontrastive pre-trainingͰ͸hard negative mining΋͍ͯ͠ͳ͍ •E5ͷଟݴޠ൛Ͱ͋ΔMultilingual E5͸ඇৗʹڧ͘ϕʔεϥΠϯͱͯ͠ྑ޷ • ೔ຊޠʹ͓͚Δݕࡧੑೳɾݕࡧ֦ுੜ੒ͰͷධՁ࿦จ͸ͪ͜Β ิ଍ 54