Slide 1

Slide 1 text

2018/07/08 Yuji Yamamoto (@y_yammt) ACL2018ಡΈձ (@LINE Corp)

Slide 2

Slide 2 text

ࠓճ঺հ͢Δ࿦จ • https://arxiv.org/abs/1805.03642 • Authors contributed equally. • Borealis AIΠϯλʔϯ࣌ͷ੒ՌΒ͍͠ (͏Β΍·)ɻ 2

Slide 3

Slide 3 text

֓ཁ • ୯ޠຒΊࠐΈͳͲͷύϥϝʔλਪఆʹ༻͍ΒΕΔNoise Contrastive Estimation (NCE)ͷվྑɻ • ෛྫαϯϓϦϯάʹGenerative Adversarial Network (GAN) ͷ࢓૊ΈΛऔΓೖΕͨɻ • ࣮ݧʹΑͬͯNCEͱൺֱͯ͠ૣ͘ऩଋ͢Δ͜ͱ͕֬ೝɻ • Ԡ༻λεΫͰͷෳ਺ͷϝτϦοΫ͕վળ͢Δ͜ͱ΋֬ೝɻ 3

Slide 4

Slide 4 text

ൃදͷྲྀΕ 1. ಋೖ: Skip-gramϞσϧͱNoise Contrastive Estimation 2. ఏҊख๏: Adversarial Contrastive Estimation 3. ࣮ݧ 4. ·ͱΊ 4

Slide 5

Slide 5 text

Skip-gramϞσϧͱ
 Noise Contrastive Estimation

Slide 6

Slide 6 text

Skip-gramϞσϧͬͯԿ͚ͩͬ? (1/2) • ୯ޠΛϕΫτϧʹରԠ͚ͮΔํ๏(୯ޠຒΊࠐΈ)ͷҰͭɻ • ஫໨ͨ͠୯ޠΛݩʹपลʹ͋Δ୯ޠΛ͏·͘༧ଌͰ͖Α ͏ͳϕΫτϧΛੜ੒͢Δɻ 6 Words are mapped to vectors wt wc pU,V (wc |wt ) = exp(u(wt )⊤v(wc )) ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ͷ୯ޠΛ౰ͯʹ͍͘

Slide 7

Slide 7 text

Skip-gramϞσϧͬͯԿ͚ͩͬ? (2/2) 7 mapped to wt wc pU,V (wc |wt ) = exp(u(wt )⊤v(wc )) ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ͷ୯ޠΛ౰ͯʹ͍͘ u( ⋅ ), v( ⋅ ) ∈ ℝd U ∈ ℝA×d wt u(wt ) V ∈ ℝA′×d wc v(wc )

Slide 8

Slide 8 text

→ ࠷খʹͳΔΑ͏ʹ
 u, vΛ࠷దԽ Skip-gramϞσϧͷ໨తؔ਺ • ςΩετͷ͋Δ৔ॴʹ͋Δwt ͱwc ͷෛͷର਺໬౓ΛऔΔ 8 l = − log pU,V (wc |wt ) = − log exp(u(wt )⊤v(wc )) ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ∂l ∂u(wt ) , ∂l ∂v(wc ) ภඍ෼ ΛٻΊΕ͹ύϥϝʔλਪఆͰ͖Δ͕… ∂l ∂u(wt ) = − v(wc ) + p(wc′ |wt ) [v(wc′ )] O(A′) ޯ഑ΛٻΊΔͷʹ͔͔Δܭࢉ͕ɺ पลޠኮͷαΠζʹൺྫ͢Δ → ॏ͍ܭࢉʹͳΓ͑Δ V ∈ ℝA′×d wc v(wc )

Slide 9

Slide 9 text

ܭࢉΛݮΒ͢޻෉ • Noise Contrastive Estimation (NCE) ͳͲɻ • Mikolovͷ࿦จͰग़ͯ͘Δ؆қ൛Noise Contrastive Estimation (Negative Sampling) Λ঺հ͠·͢ɻ • ࠓճ঺հ͢Δ࿦จͰ͸؆қ൛Ͱ͋ͬͯ΋ͦ͏Ͱͳͯ͘΋ͲͬͪͰ΋໰ ୊ͳ͍(inconsequential)Ͱ͢ɻ • NCE, NSʹ͍ͭͯ͸ʮਂ૚ֶशʹΑΔࣗવݴޠॲཧʯʹ΋ৄ͍͠આ໌͕͋ Γ·͢ɻ 9

Slide 10

Slide 10 text

؆қ൛ Noise Contrastive Estimation • 1ͭͷֶशࣄྫͱͳΔจ຺୯ޠ(wc )ͱϊΠζͱͳΔkݸͷ จ຺୯ޠ Λࣝผ͢ΔΑ͏ʹֶश͢Δɻ 10 S′ = { ¯ wc1 , ⋯, ¯ wck } lNS = − log (u(wt )⊤v(wc )) − ∑ wc′∈S′ log(1 − (u(wt )⊤v(wc′ ))) ໨తؔ਺Λม͑ͨ ਖ਼ྫ͕ى͜Δ֬཰ ෛྫ(ϊΠζ)͕ ى͜Βͳ͍֬཰ l = − log exp(u(wt )⊤v(wc )) + log ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ϥϯμϜʹऔΓग़ͨ͠kݸͷจ຺୯ޠͷू߹
 (ͨͩ͠Ұ༷෼෍ͰऔΓग़͍ͯ͠Δͱ͸ݶΒͳ͍)

Slide 11

Slide 11 text

Adversarial Contrastive Estimation (ACE)

Slide 12

Slide 12 text

؆қ൛NCEΛݟ௚͢ 12 lNS = − log (u(wt )⊤v(wc )) − ∑ wc′∈S′ log(1 − (u(wt )⊤v(wc′ ))) ਖ਼ྫ͕ى͜Δ֬཰ ෛྫ(ϊΠζ)͕ ى͜Βͳ͍֬཰ ϥϯμϜʹऔΓग़ͨ͠kݸͷจ຺୯ޠͷू߹ ର৅ͱͳΔ୯ޠ(wt )Λݟͣʹ ෛྫΛ࡞ΔͷͰɺਖ਼ྫͱ༰қʹ
 ൑ผՄೳͳෛྫʹͳͬͯ͠·͏
 Մೳੑ͕͋Δ
 ˠ ೉͠ΊͷෛྫΛੜ੒Ͱ͖ΔΑ͏ʹ͍ͨ͠ → Generative Adversarial Networksͷ࢓૊ΈΛೖΕΔ mapped to wt wc concentrate more ¯ wc1 ¯ wc2

Slide 13

Slide 13 text

NCEΛ΋͏গ͠Ұൠతʹॻ͖௚͢ • ࠷దԽ͍ͨ͠ύϥϝʔλΛ ω • ର৅ x ͕༩͑ΒΕͨͱ͖ͷɺ • ग़ݱͨ݁͠Ռ(ਖ਼ྫ)Λ y+ɺ • ϊΠζͱͳΔ݁Ռ(ෛྫ)Λ y- • ͱ͓͘ɻ ͜ͷͱ͖ͷଛࣦؔ਺͸ɺ 13 ← wt ← wc ← wc’ ← U, V L(ω; x) = p(y+|x)pnce (y−) lω (x, y+, y−) ← ࠷খԽ ෛྫ͸ x ʹؔ܎ͳ͘ੜ੒ ໬౓ؔ਺

Slide 14

Slide 14 text

Adversarial Contrastive Estimation • ఏҊख๏ͷଛࣦؔ਺: 14 L(ω, θ; x) = λp(y+|x)pnce (y−) lω (x, y+, y−) +(1 − λ)p(y+|x)gθ (y−|x) lω (x, y+, y−) ର৅ΛݩʹෛྫΛੜ੒ • ࠷దԽ (GAN-style minimax game): min ω max θ p+(x) L(ω, θ; x) ೉͍͠ෛྫग़ͯ͠΍Ζ͏
 (Generator) ਖ਼ྫͱෛྫΛ͖ͪΜͱ ݟ෼͚ͯ΍Ζ͏
 (Discriminator)

Slide 15

Slide 15 text

ACEͷࡉ͔͍޻෉ • Generatorʹ͍ͭͯͷΤϯτϩϐʔਖ਼ଇԽ • ϊΠζͱͯ͠ False Negative (ਖ਼ྫ) ΛҾ͖ൈ͍ͨͱ͖ ͷྫ֎ॲཧ • ͳͲͳͲ 15

Slide 16

Slide 16 text

࣮ݧ

Slide 17

Slide 17 text

࣮ݧλεΫͷ֓ཁ 1. ୯ޠຒΊࠐΈ • ୯ޠϖΞʹؔͯ͠ɺਓؒʹΑͬͯ෇͚ͨࣅͯΔ౓߹͍ͱ୯ޠຒΊࠐΈʹΑ Δྨࣅ౓ʹ͍ͭͯͷॱং૬ؔΛٻΊͯධՁ͢Δ΋ͷɻ • ࣍ϖʔδҎ߱Ͱ݁ՌΛࣔ͠·͢ɻ 2. ্Ґޠͷ༧ଌ • ୯ޠϖΞ(word1, word2)͕༩͑ΒΕͨͱ͖ʹɺword1 is a word2 Ͱ͋Δ͔ Λ༧ଌ͢Δ΋ͷɻ • e.g. (New York, city) → True 3. ஌ࣝάϥϑͷຒΊࠐΈ • ؔ܎σʔλ (entity1, relation, entity2) Λֶशͯ͠ɺ͚͍ܽͯΔϦϯΫΛ༧ ଌ͢Δ΋ͷ (a.k.a. ϦϯΫ༧ଌ) • http://letra418.hatenablog.com/entry/2017/07/24/223257 17

Slide 18

Slide 18 text

୯ޠຒΊࠐΈͷ࣮ݧ݁Ռ (Spearman score) 18 • ӳޠ൛WikipediaΛ1ճ͚ͩ௨͠(single pass)Ͱֶशͨ͠΋ͷɻ • ୯ޠϖΞʹؔͯ͠ɺਓؒʹΑͬͯ෇͚ͨࣅͯΔ౓߹͍ͱ୯ޠຒΊ ࠐΈʹΑΔྨࣅ౓ʹ͍ͭͯͷॱং૬ؔΛٻΊͯධՁ͢Δ΋ͷɻ • ADV: ෛྫੜ੒͕GeneratorͷΈ (λ=0)ɻACE: GeneratorͱNSɻ • Iterationͱ͸? (֤IterationͰղ͍ͯΔ໰୊ͱ͸?)

Slide 19

Slide 19 text

୯ޠຒΊࠐΈͷ࣮ݧ݁Ռ (Nearest neighbors) 19

Slide 20

Slide 20 text

ACEͷ੍ݶʹ͍ͭͯ • Generatorͷܭࢉ͕ॏ͍ɻ • ෛྫΛͭ͘ΔͷʹSoftmax͕ೖ͍ͬͯΔ͔Β(NCEͰۙࣅ͢ΔલͷࣜͱࣅͨΑ͏ͳ ܭࢉ͕ೖͬͪΌ͏)ɻ • ୯ޠຒΊࠐΈͷֶश͸ޙଓλεΫͷͨΊͷࣄલܭࢉͳͷͰ͔͔࣌ؒͬͯ΋ਅͬ౰ (justified)ͳͷͰ͸ͳ͍ͷ?
 (MLEͱൺ΂ͯऩଋ͕଎͍ͱ͔Ԡ༻λεΫͷϝτϦοΫ͕Α͘ͳͬͨͱ͔ݴ͑Δͱ ͍͍͔ͳ) • NCEͰຬͨ͢ੑ࣭͕ͲΕ͘Β͍ݴ͑Δͷ͔Α͘Θ͔Βͳ͍ɻ • NCE͸Ұఆͷ৚݅ԼͰMLEͱྨࣅͨ͠ৼΔ෣͍Λ͢Δɻ
 https://qiita.com/Quasi-quant2010/items/a15b0d1b6428dc49c6c2 • ACEͰ͸GANͷ࢓૊ΈΛೖΕͨ͜ͱʹΑͬͯɺ͜Ε͕ݴ͑Δ͔Ͳ͏͔͕Α͘Θ͔ Βͳ͍ɻ 20

Slide 21

Slide 21 text

·ͱΊ

Slide 22

Slide 22 text

·ͱΊ • ؍ଌ͞ΕͨαϯϓϧͱِͷαϯϓϧΛରরͤ͞Δ͜ͱʹ Αֶͬͯश͢Δͱ͍ͬͨڭࢣ͋Γֶशʹ͍ͭͯͷվળɻ • Adversarial Contrastive Estimation (ACE) • ࣝผϞσϧʹରͯ͠೉͍͠ෛྫΛఏҊͰ͖ΔGANʹࣅ ͨઃఆͷੜ੒ωοτϫʔΫΛ༻͍ͨɻ • Generatorʹ͍ͭͯͷΤϯτϩϐʔਖ਼ଇԽ΍False NegativeΛద੾ʹॲཧ͢Δ͜ͱ͕͏·ֶ͘श͢Δͷʹ ॏཁͰ͋Δ͜ͱ͕Θ͔ͬͨɻ 22

Slide 23

Slide 23 text

ײ૝ • ୯ޠຒΊࠐΈλεΫͰྨࣅ౓ͱͯ͠ଥ౰ͦ͏ͳϕΫτ ϧ͕ಘΒΕ͍ͯΔ → ঎඼ਪનʹ͔ͭ͑ͦ͏?
 → ࣮͸RecSys 2018ͰࣅͨΑ͏ͳ಺༰͕ (΄΅ಉ࣌ظ)
 Adversarial Training of Word2Vec for Basket Completion
 https://arxiv.org/abs/1805.08720 • ࣮૷ํ๏ʹ͍ͭͯෆ໌ͳͱ͜Ζ͕ଟ͍ɻ࣮૷ެ։ͯ͠ ΄͍͠ɻ 23

Slide 24

Slide 24 text

ิ଍εϥΠυ

Slide 25

Slide 25 text

Skip-gramϞσϧͱ
 ࿦จͷ਺ࣜදهͷؔ࿈෇͚

Slide 26

Slide 26 text

Skip-gramϞσϧͷ໨తؔ਺ (1/2) • ςΩετதͰऔΓಘΔ୯ޠͷϖΞʹ͍ͭͯͷෛͷର਺໬౓ΛͱΔɻ • ୯ޠͷϖΞ1ݸͷΈʹ͍ͭͯͷఆࣜԽׂ͕ͱΑ͘ݟ͔͚·͕͢ɺ
 ࿦จͷදهʹ߹ΘͤΔͨΊʹ͢΂ͯͷϖΞͰߟ͑Δ͜ͱʹ͠·͢ɻ 26 L = − ∑ wt ∈A ∑ wc ∈A′ p(wt , wc )log pU,V (wc |wt ) = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt )log pU,V (wc |wt ) − ∑ wt ∈A ∑ wc ∈A′ freq(wt , wc )log pU,V (wc |wt ) → ࠷খʹͳΔΑ͏ʹU,V Λ࠷దԽ • ҰൠԽ͢Δͱɺ → ࠷খԽ p(wt , wc ) ∝ freq(wt , wc ) ͱஔ͘ͳΒ࠷খԽͷҙຯͰ͸྆ऀ͸౳Ձ

Slide 27

Slide 27 text

Skip-gramϞσϧͷ໨తؔ਺ (2/2) 27 L = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt )log pU,V (wc |wt ) = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt ) log exp(u(wt )⊤v(wc )) − log ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) O(A′) ޠኮ͕ଟ͍ͱܭࢉ͕͔͔࣌ؒΔ ܭࢉΛݮΒ͢޻෉ Noise Contrastive Estimation Negative Sampling ͳͲ V ∈ ℝA′×d wc v(wc )

Slide 28

Slide 28 text

؆қ൛ Noise Contrastive Estimation (1/2) • Mikolovͷ࿦จͰग़ͯ͘Δ؆қ൛Noise Contrastive Estimation (Negative Sampling) Λ঺հ͠·͢ɻ • ࠓճ঺հ͢Δ࿦จͰ͸؆қ൛Ͱ͋ͬͯ΋ͦ͏Ͱͳͯ͘΋Ͳͬͪ Ͱ΋໰୊ͳ͍(inconsequential)Ͱ͢ɻ • NCE, NSʹ͍ͭͯ͸ʮਂ૚ֶशʹΑΔࣗવݴޠॲཧʯʹ΋ৄ͍͠ આ໌͕͋Γ·͢ɻ 28

Slide 29

Slide 29 text

؆қ൛ Noise Contrastive Estimation (2/2) • 1ͭͷֶशࣄྫͱͳΔจ຺୯ޠ(wc )ͱϊΠζͱͳΔkݸͷ จ຺୯ޠ Λࣝผ͢ΔΑ͏ʹֶश͢Δɻ 29 L = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt ) log exp(u(wt )⊤v(wc )) − log ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) S′ = { ¯ wc1 , ⋯, ¯ wck } LNS = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt ) log (u(wt )⊤v(wc )) + ∑ wc′∈S′ log(1 − (u(wt )⊤v(wc′ ))) ໨తؔ਺Λม͑ͨ ਖ਼ྫ͕ى͜Δ֬཰ ෛྫ(ϊΠζ)͕ ى͜Βͳ͍֬཰

Slide 30

Slide 30 text

NCEͷҰൠܗͱSkip-gramͷؔ࿈෇͚ • ઌʹࣔͨ͠Skip-gramͷఆࣜԽ΋ˢͷಛघܗʹͳΓ·͢ɻ 30 p+(x) [p(y+|x)pnce (y−) lω (x, y+, y−)] p(wt ) [p(wc |wt )pnce (wc′) lU,V (wt , wc , wc′ )] lU,V (wt , wc , wc′ ) = − log (u(wt )⊤v(wc )) − k log(1 − (u(wt )⊤v(wc′ )))