## Slide 1

### Slide 1 text

2018/07/08 Yuji Yamamoto (@y_yammt) ACL2018ಡΈձ (@LINE Corp)

## Slide 2

### Slide 2 text

ࠓճ঺հ͢Δ࿦จ • https://arxiv.org/abs/1805.03642 • Authors contributed equally. • Borealis AIΠϯλʔϯ࣌ͷ੒ՌΒ͍͠ (͏Β΍·)ɻ 2

## Slide 3

### Slide 3 text

֓ཁ • ୯ޠຒΊࠐΈͳͲͷύϥϝʔλਪఆʹ༻͍ΒΕΔNoise Contrastive Estimation (NCE)ͷվྑɻ • ෛྫαϯϓϦϯάʹGenerative Adversarial Network (GAN) ͷ࢓૊ΈΛऔΓೖΕͨɻ • ࣮ݧʹΑͬͯNCEͱൺֱͯ͠ૣ͘ऩଋ͢Δ͜ͱ͕֬ೝɻ • Ԡ༻λεΫͰͷෳ਺ͷϝτϦοΫ͕վળ͢Δ͜ͱ΋֬ೝɻ 3

## Slide 4

### Slide 4 text

ൃදͷྲྀΕ 1. ಋೖ: Skip-gramϞσϧͱNoise Contrastive Estimation 2. ఏҊख๏: Adversarial Contrastive Estimation 3. ࣮ݧ 4. ·ͱΊ 4

## Slide 5

### Slide 5 text

Skip-gramϞσϧͱ  Noise Contrastive Estimation

## Slide 6

### Slide 6 text

Skip-gramϞσϧͬͯԿ͚ͩͬ? (1/2) • ୯ޠΛϕΫτϧʹରԠ͚ͮΔํ๏(୯ޠຒΊࠐΈ)ͷҰͭɻ • ஫໨ͨ͠୯ޠΛݩʹपลʹ͋Δ୯ޠΛ͏·͘༧ଌͰ͖Α ͏ͳϕΫτϧΛੜ੒͢Δɻ 6 Words are mapped to vectors wt wc pU,V (wc |wt ) = exp(u(wt )⊤v(wc )) ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ͷ୯ޠΛ౰ͯʹ͍͘

## Slide 7

### Slide 7 text

Skip-gramϞσϧͬͯԿ͚ͩͬ? (2/2) 7 mapped to wt wc pU,V (wc |wt ) = exp(u(wt )⊤v(wc )) ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ͷ୯ޠΛ౰ͯʹ͍͘ u( ⋅ ), v( ⋅ ) ∈ ℝd U ∈ ℝA×d wt u(wt ) V ∈ ℝA′×d wc v(wc )

## Slide 8

### Slide 8 text

→ ࠷খʹͳΔΑ͏ʹ  u, vΛ࠷దԽ Skip-gramϞσϧͷ໨తؔ਺ • ςΩετͷ͋Δ৔ॴʹ͋Δwt ͱwc ͷෛͷର਺໬౓ΛऔΔ 8 l = − log pU,V (wc |wt ) = − log exp(u(wt )⊤v(wc )) ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ∂l ∂u(wt ) , ∂l ∂v(wc ) ภඍ෼ ΛٻΊΕ͹ύϥϝʔλਪఆͰ͖Δ͕… ∂l ∂u(wt ) = − v(wc ) + p(wc′ |wt ) [v(wc′ )] O(A′) ޯ഑ΛٻΊΔͷʹ͔͔Δܭࢉ͕ɺ पลޠኮͷαΠζʹൺྫ͢Δ → ॏ͍ܭࢉʹͳΓ͑Δ V ∈ ℝA′×d wc v(wc )

## Slide 9

### Slide 9 text

ܭࢉΛݮΒ͢޻෉ • Noise Contrastive Estimation (NCE) ͳͲɻ • Mikolovͷ࿦จͰग़ͯ͘Δ؆қ൛Noise Contrastive Estimation (Negative Sampling) Λ঺հ͠·͢ɻ • ࠓճ঺հ͢Δ࿦จͰ͸؆қ൛Ͱ͋ͬͯ΋ͦ͏Ͱͳͯ͘΋ͲͬͪͰ΋໰ ୊ͳ͍(inconsequential)Ͱ͢ɻ • NCE, NSʹ͍ͭͯ͸ʮਂ૚ֶशʹΑΔࣗવݴޠॲཧʯʹ΋ৄ͍͠આ໌͕͋ Γ·͢ɻ 9

## Slide 10

### Slide 10 text

؆қ൛ Noise Contrastive Estimation • 1ͭͷֶशࣄྫͱͳΔจ຺୯ޠ(wc )ͱϊΠζͱͳΔkݸͷ จ຺୯ޠ Λࣝผ͢ΔΑ͏ʹֶश͢Δɻ 10 S′ = { ¯ wc1 , ⋯, ¯ wck } lNS = − log (u(wt )⊤v(wc )) − ∑ wc′∈S′ log(1 − (u(wt )⊤v(wc′ ))) ໨తؔ਺Λม͑ͨ ਖ਼ྫ͕ى͜Δ֬཰ ෛྫ(ϊΠζ)͕ ى͜Βͳ͍֬཰ l = − log exp(u(wt )⊤v(wc )) + log ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) ϥϯμϜʹऔΓग़ͨ͠kݸͷจ຺୯ޠͷू߹  (ͨͩ͠Ұ༷෼෍ͰऔΓग़͍ͯ͠Δͱ͸ݶΒͳ͍)

## Slide 12

### Slide 12 text

؆қ൛NCEΛݟ௚͢ 12 lNS = − log (u(wt )⊤v(wc )) − ∑ wc′∈S′ log(1 − (u(wt )⊤v(wc′ ))) ਖ਼ྫ͕ى͜Δ֬཰ ෛྫ(ϊΠζ)͕ ى͜Βͳ͍֬཰ ϥϯμϜʹऔΓग़ͨ͠kݸͷจ຺୯ޠͷू߹ ର৅ͱͳΔ୯ޠ(wt )Λݟͣʹ ෛྫΛ࡞ΔͷͰɺਖ਼ྫͱ༰қʹ  ൑ผՄೳͳෛྫʹͳͬͯ͠·͏  Մೳੑ͕͋Δ  ˠ ೉͠ΊͷෛྫΛੜ੒Ͱ͖ΔΑ͏ʹ͍ͨ͠ → Generative Adversarial Networksͷ࢓૊ΈΛೖΕΔ mapped to wt wc concentrate more ¯ wc1 ¯ wc2

## Slide 13

### Slide 13 text

NCEΛ΋͏গ͠Ұൠతʹॻ͖௚͢ • ࠷దԽ͍ͨ͠ύϥϝʔλΛ ω • ର৅ x ͕༩͑ΒΕͨͱ͖ͷɺ • ग़ݱͨ݁͠Ռ(ਖ਼ྫ)Λ y+ɺ • ϊΠζͱͳΔ݁Ռ(ෛྫ)Λ y- • ͱ͓͘ɻ ͜ͷͱ͖ͷଛࣦؔ਺͸ɺ 13 ← wt ← wc ← wc’ ← U, V L(ω; x) = p(y+|x)pnce (y−) lω (x, y+, y−) ← ࠷খԽ ෛྫ͸ x ʹؔ܎ͳ͘ੜ੒ ໬౓ؔ਺

## Slide 14

### Slide 14 text

Adversarial Contrastive Estimation • ఏҊख๏ͷଛࣦؔ਺: 14 L(ω, θ; x) = λp(y+|x)pnce (y−) lω (x, y+, y−) +(1 − λ)p(y+|x)gθ (y−|x) lω (x, y+, y−) ର৅ΛݩʹෛྫΛੜ੒ • ࠷దԽ (GAN-style minimax game): min ω max θ p+(x) L(ω, θ; x) ೉͍͠ෛྫग़ͯ͠΍Ζ͏  (Generator) ਖ਼ྫͱෛྫΛ͖ͪΜͱ ݟ෼͚ͯ΍Ζ͏  (Discriminator)

## Slide 15

### Slide 15 text

ACEͷࡉ͔͍޻෉ • Generatorʹ͍ͭͯͷΤϯτϩϐʔਖ਼ଇԽ • ϊΠζͱͯ͠ False Negative (ਖ਼ྫ) ΛҾ͖ൈ͍ͨͱ͖ ͷྫ֎ॲཧ • ͳͲͳͲ 15

࣮ݧ

## Slide 17

### Slide 17 text

࣮ݧλεΫͷ֓ཁ 1. ୯ޠຒΊࠐΈ • ୯ޠϖΞʹؔͯ͠ɺਓؒʹΑͬͯ෇͚ͨࣅͯΔ౓߹͍ͱ୯ޠຒΊࠐΈʹΑ Δྨࣅ౓ʹ͍ͭͯͷॱং૬ؔΛٻΊͯධՁ͢Δ΋ͷɻ • ࣍ϖʔδҎ߱Ͱ݁ՌΛࣔ͠·͢ɻ 2. ্Ґޠͷ༧ଌ • ୯ޠϖΞ(word1, word2)͕༩͑ΒΕͨͱ͖ʹɺword1 is a word2 Ͱ͋Δ͔ Λ༧ଌ͢Δ΋ͷɻ • e.g. (New York, city) → True 3. ஌ࣝάϥϑͷຒΊࠐΈ • ؔ܎σʔλ (entity1, relation, entity2) Λֶशͯ͠ɺ͚͍ܽͯΔϦϯΫΛ༧ ଌ͢Δ΋ͷ (a.k.a. ϦϯΫ༧ଌ) • http://letra418.hatenablog.com/entry/2017/07/24/223257 17

## Slide 18

### Slide 18 text

୯ޠຒΊࠐΈͷ࣮ݧ݁Ռ (Spearman score) 18 • ӳޠ൛WikipediaΛ1ճ͚ͩ௨͠(single pass)Ͱֶशͨ͠΋ͷɻ • ୯ޠϖΞʹؔͯ͠ɺਓؒʹΑͬͯ෇͚ͨࣅͯΔ౓߹͍ͱ୯ޠຒΊ ࠐΈʹΑΔྨࣅ౓ʹ͍ͭͯͷॱং૬ؔΛٻΊͯධՁ͢Δ΋ͷɻ • ADV: ෛྫੜ੒͕GeneratorͷΈ (λ=0)ɻACE: GeneratorͱNSɻ • Iterationͱ͸? (֤IterationͰղ͍ͯΔ໰୊ͱ͸?)

## Slide 19

### Slide 19 text

୯ޠຒΊࠐΈͷ࣮ݧ݁Ռ (Nearest neighbors) 19

## Slide 20

### Slide 20 text

ACEͷ੍ݶʹ͍ͭͯ • Generatorͷܭࢉ͕ॏ͍ɻ • ෛྫΛͭ͘ΔͷʹSoftmax͕ೖ͍ͬͯΔ͔Β(NCEͰۙࣅ͢ΔલͷࣜͱࣅͨΑ͏ͳ ܭࢉ͕ೖͬͪΌ͏)ɻ • ୯ޠຒΊࠐΈͷֶश͸ޙଓλεΫͷͨΊͷࣄલܭࢉͳͷͰ͔͔࣌ؒͬͯ΋ਅͬ౰ (justiﬁed)ͳͷͰ͸ͳ͍ͷ?  (MLEͱൺ΂ͯऩଋ͕଎͍ͱ͔Ԡ༻λεΫͷϝτϦοΫ͕Α͘ͳͬͨͱ͔ݴ͑Δͱ ͍͍͔ͳ) • NCEͰຬͨ͢ੑ࣭͕ͲΕ͘Β͍ݴ͑Δͷ͔Α͘Θ͔Βͳ͍ɻ • NCE͸Ұఆͷ৚݅ԼͰMLEͱྨࣅͨ͠ৼΔ෣͍Λ͢Δɻ  https://qiita.com/Quasi-quant2010/items/a15b0d1b6428dc49c6c2 • ACEͰ͸GANͷ࢓૊ΈΛೖΕͨ͜ͱʹΑͬͯɺ͜Ε͕ݴ͑Δ͔Ͳ͏͔͕Α͘Θ͔ Βͳ͍ɻ 20

·ͱΊ

## Slide 22

### Slide 22 text

·ͱΊ • ؍ଌ͞ΕͨαϯϓϧͱِͷαϯϓϧΛରরͤ͞Δ͜ͱʹ Αֶͬͯश͢Δͱ͍ͬͨڭࢣ͋Γֶशʹ͍ͭͯͷվળɻ • Adversarial Contrastive Estimation (ACE) • ࣝผϞσϧʹରͯ͠೉͍͠ෛྫΛఏҊͰ͖ΔGANʹࣅ ͨઃఆͷੜ੒ωοτϫʔΫΛ༻͍ͨɻ • Generatorʹ͍ͭͯͷΤϯτϩϐʔਖ਼ଇԽ΍False NegativeΛద੾ʹॲཧ͢Δ͜ͱ͕͏·ֶ͘श͢Δͷʹ ॏཁͰ͋Δ͜ͱ͕Θ͔ͬͨɻ 22

## Slide 23

### Slide 23 text

ײ૝ • ୯ޠຒΊࠐΈλεΫͰྨࣅ౓ͱͯ͠ଥ౰ͦ͏ͳϕΫτ ϧ͕ಘΒΕ͍ͯΔ → ঎඼ਪનʹ͔ͭ͑ͦ͏?  → ࣮͸RecSys 2018ͰࣅͨΑ͏ͳ಺༰͕ (΄΅ಉ࣌ظ)  Adversarial Training of Word2Vec for Basket Completion  https://arxiv.org/abs/1805.08720 • ࣮૷ํ๏ʹ͍ͭͯෆ໌ͳͱ͜Ζ͕ଟ͍ɻ࣮૷ެ։ͯ͠ ΄͍͠ɻ 23

ิ଍εϥΠυ

## Slide 25

### Slide 25 text

Skip-gramϞσϧͱ  ࿦จͷ਺ࣜදهͷؔ࿈෇͚

## Slide 26

### Slide 26 text

Skip-gramϞσϧͷ໨తؔ਺ (1/2) • ςΩετதͰऔΓಘΔ୯ޠͷϖΞʹ͍ͭͯͷෛͷର਺໬౓ΛͱΔɻ • ୯ޠͷϖΞ1ݸͷΈʹ͍ͭͯͷఆࣜԽׂ͕ͱΑ͘ݟ͔͚·͕͢ɺ  ࿦จͷදهʹ߹ΘͤΔͨΊʹ͢΂ͯͷϖΞͰߟ͑Δ͜ͱʹ͠·͢ɻ 26 L = − ∑ wt ∈A ∑ wc ∈A′ p(wt , wc )log pU,V (wc |wt ) = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt )log pU,V (wc |wt ) − ∑ wt ∈A ∑ wc ∈A′ freq(wt , wc )log pU,V (wc |wt ) → ࠷খʹͳΔΑ͏ʹU,V Λ࠷దԽ • ҰൠԽ͢Δͱɺ → ࠷খԽ p(wt , wc ) ∝ freq(wt , wc ) ͱஔ͘ͳΒ࠷খԽͷҙຯͰ͸྆ऀ͸౳Ձ

## Slide 27

### Slide 27 text

Skip-gramϞσϧͷ໨తؔ਺ (2/2) 27 L = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt )log pU,V (wc |wt ) = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt ) log exp(u(wt )⊤v(wc )) − log ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) O(A′) ޠኮ͕ଟ͍ͱܭࢉ͕͔͔࣌ؒΔ ܭࢉΛݮΒ͢޻෉ Noise Contrastive Estimation Negative Sampling ͳͲ V ∈ ℝA′×d wc v(wc )

## Slide 28

### Slide 28 text

؆қ൛ Noise Contrastive Estimation (1/2) • Mikolovͷ࿦จͰग़ͯ͘Δ؆қ൛Noise Contrastive Estimation (Negative Sampling) Λ঺հ͠·͢ɻ • ࠓճ঺հ͢Δ࿦จͰ͸؆қ൛Ͱ͋ͬͯ΋ͦ͏Ͱͳͯ͘΋Ͳͬͪ Ͱ΋໰୊ͳ͍(inconsequential)Ͱ͢ɻ • NCE, NSʹ͍ͭͯ͸ʮਂ૚ֶशʹΑΔࣗવݴޠॲཧʯʹ΋ৄ͍͠ આ໌͕͋Γ·͢ɻ 28

## Slide 29

### Slide 29 text

؆қ൛ Noise Contrastive Estimation (2/2) • 1ͭͷֶशࣄྫͱͳΔจ຺୯ޠ(wc )ͱϊΠζͱͳΔkݸͷ จ຺୯ޠ Λࣝผ͢ΔΑ͏ʹֶश͢Δɻ 29 L = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt ) log exp(u(wt )⊤v(wc )) − log ∑ wc′∈A′ exp(u(wt )⊤v(wc′ )) S′ = { ¯ wc1 , ⋯, ¯ wck } LNS = − ∑ wt ∈A p(wt ) ∑ wc ∈A′ p(wc |wt ) log (u(wt )⊤v(wc )) + ∑ wc′∈S′ log(1 − (u(wt )⊤v(wc′ ))) ໨తؔ਺Λม͑ͨ ਖ਼ྫ͕ى͜Δ֬཰ ෛྫ(ϊΠζ)͕ ى͜Βͳ͍֬཰

## Slide 30

### Slide 30 text

NCEͷҰൠܗͱSkip-gramͷؔ࿈෇͚ • ઌʹࣔͨ͠Skip-gramͷఆࣜԽ΋ˢͷಛघܗʹͳΓ·͢ɻ 30 p+(x) [p(y+|x)pnce (y−) lω (x, y+, y−)] p(wt ) [p(wc |wt )pnce (wc′) lU,V (wt , wc , wc′ )] lU,V (wt , wc , wc′ ) = − log (u(wt )⊤v(wc )) − k log(1 − (u(wt )⊤v(wc′ )))