Hayato Tsukagoshi
October 18, 2022
1.1k

# [輪講資料] Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

Optimusを支えるVAEの目的関数の導出から丁寧に紹介します。

October 18, 2022

## Transcript

1. Optimus: Organizing Sentences via Pre-trained
Modeling of a Latent Space
Graduate school of Informatics, Nagoya University, Japan.
ൃදऀ: Hayato Tsukagoshi
Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, and Jianfeng Gao

EMNLP 2020

URL: https://aclanthology.org/2020.emnlp-main.378/

2. •VAE (ม෼ࣗݾූ߸Խث)ϕʔεͷࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ

• ஫ҙ: طଘͷࣄલֶशࡁΈݴޠϞσϧ͸͔ͬ͠Γར༻

•EncoderʹBERTɺDecoder͸GPT-2

• ೋͭͷϞσϧΛ͏·͘౷߹ͯ͠
VAEΛߏ੒ɺ౷߹ख๏΋޻෉

•จੜ੒ʹ͓͚ΔධՁࢦඪɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ

• જࡏදݱͷઢܗิ׬ʹΑΔҙຯతʹͳΊΒ͔ͳจੜ੒͕Մೳ

• NLPʹ͓͚ΔVAE + ࣄલֶशͷ༗༻ੑΛࣔ͢
࿦จ֓ཁ
2

3. •VAE͸ཧ࿦తɾٕज़తʹ໘ന͍͕(ಛʹNLPͰ)͋·Γ஫໨͞Ε͍ͯͳ͍

• BERTͳͲͷࣄલֶशࡁΈϞσϧ͕؆୯ɾڧྗ

• ҰԠଟ༷ੑΛॏࢹ͢Δจੜ੒λεΫͰ͸࢖ΘΕ͍ͯΔΑ͏͕ͩ…

•ࣗ෼༻ʹVAEʹ͍ͭͯษڧɾഎܠ஌ࣝͷ·ͱΊ௚͕͔ͨͬͨ͠͠

• ਺ࣜΛ͋·Γ͓֮͑ͯΒͣ…

•ࣗ෼ͷݚڀͰ࢖͏͔΋͠Εͣڵຯ͕͋ͬͨ

• VAEϕʔεͷจຒΊࠐΈϞσϧ͸΄ͱΜͲݟͳ͍ͷͰ

• BERT-
fl
owͱ͔͸ࢥ૝͕ۙͦ͏Ͱ͸͋Δ
બఆཧ༝
3

4. ಋೖ
•VAEͱ͸

•VAEͷ໨తؔ਺ͷಋग़

Optimus
•Ϟσϧߏ଄

•ଛࣦؔ਺

•BERTͱGPT-2ͷ౷߹

•ධՁ࣮ݧ
໨࣍
4

5. ಋೖ

6. ಋೖ
•VAEͱ͸

•VAEͷ໨తؔ਺ͷಋग़

Optimus
•Ϟσϧߏ଄

•ଛࣦؔ਺

•BERTͱGPT-2ͷ౷߹

•ධՁ࣮ݧ
໨࣍
6

7. •ग़ྗ͚ͩͰͳ͘ೖྗͷ෼෍΋ϞσϧԽ͢Δख๏ *

• զʑ͕Α͘࢖͏ͷ͸ࣝผϞσϧ (෼ྨ͚ͩߦ͏)

•σʔλ͕ԿΒ͔ͷ֬཰෼෍ʹج͍ͮͯੜ੒͞ΕΔͱߟ͑Δ

• ؍ଌσʔλ͔Β؍ଌσʔλ͕ै͏֬཰෼෍Λਪఆ͢Δ

•ը૾෼໺ʹ͓͚ΔGAN͕༗໊

• NLPͰ͸ҙ֎ͱ͋·Γݟͳ͍ʁ
ੜ੒Ϟσϧ
7
* ύλʔϯೝࣝͱػցֶश ্ר p.42Λࢀর

8. •தؒදݱ͔ΒೖྗΛ࠶ߏ੒Ͱ͖ΔΑ͏ʹ܇࿅͢ΔϞσϧ

• ੜ੒ϞσϧͷҰछ

• ڭࢣͳֶ͠श͕Մೳ

•தؒදݱ͸ೖྗͷѹॖ͞ΕͨදݱͱΈͳͤΔ

• ඇઢܗͰෳࡶͳ࣍ݩѹॖ͕Ͱ͖Δ

• ΫϥελϦϯά΍ҟৗݕ஌ɾϊΠζআڈͳͲʹ΋࢖ΘΕΔ
Auto-Encoder (AE): ࣗݾූ߸Խث
8

9. •Auto-Encoderͷજࡏදݱͷ෼෍ʹ੍໿ΛՃ͑ͨ΋ͷ (ͱݟ၏ͤΔ)

• AEͱ͸ҟͳΔಈػͱཧ࿦എܠΛ͕࣋ͭɺࣅͨ΋ͷͱղऍͰ͖Δ

• જࡏදݱʹର͢Δ੍໿ʹΑͬͯσʔλͷੜ੒͕༰қʹ

• Kingma et al., 2013. Auto-Encoding Variational Bayes ͰఏҊ

•જࡏදݱͷ෼෍ʹ͸೚ҙͷࣄલ෼෍ (prior) Λબ΂Δ

• ଟ͘ͷ৔߹͸ඪ४ਖ਼ن෼෍ (standard normal distribution)
•ଛࣦؔ਺ͱͯ͠ೋͭͷଛࣦΛ଍͠߹Θͤͯ༻͍Δ

• ࠶ߏ੒ޡࠩ
• જࡏදݱͷ෼෍ʹ͍ͭͯͷଛࣦ
Variational Auto-Encoder (VAE): ม෼ࣗݾූ߸Խث
9

10. VAEͷϞσϧߏ଄
10
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder

11. VAEͷϞσϧߏ଄
11
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
ೖྗΛϕΫτϧදݱʹม׵

12. VAEͷϞσϧߏ଄
12
෼ࢄڞ෼ࢄߦྻ͸ΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
ϕΫτϧදݱ͔ΒΨ΢ε෼෍ͷ
ฏۉͱ෼ࢄڞ෼ࢄߦྻΛग़ྗ

13. VAEͷϞσϧߏ଄
13
෼ࢄڞ෼ࢄߦྻ͸ΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
ฏۉͱ෼ࢄڞ෼ࢄߦྻΛ༻͍ͯΨ΢ε෼
෍͔ΒαϯϓϦϯάɺજࡏදݱΛ֫ಘ

14. VAEͷϞσϧߏ଄
14
෼ࢄڞ෼ࢄߦྻ͸ΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
જࡏදݱ͔Βग़ྗΛ࠶ߏ੒

15. AEͱVAEͷϞσϧߏ଄ͷൺֱ
15
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
જࡏදݱ
z
x Encoder x’
Decoder
AE
VAE

16. AEͱVAEͷϞσϧߏ଄ͷൺֱ
16
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
જࡏදݱ
z
x Encoder x’
Decoder
AE
VAE
જࡏදݱΛαϯϓϦϯά͢Δ
ͨΊͷॲཧͱ
જࡏදݱͷ෼෍ʹؔ͢Δ
ଛࣦ͕૿͑Δ͚ͩ

17. •AEͰ͸જࡏදݱ͕ͲͷΑ͏ʹ෼෍͍ͯ͠Δ͔ෆ໌

• VAEͰ͸ط஌ͷ֬཰෼෍ʹ͚ۙͮΔΑ͏ʹֶशΛߦ͏

• ط஌෼෍͔ΒͷαϯϓϦϯάͰࣗવͳσʔλͷੜ੒͕ߦ͑Δ

•ਖ਼ଇԽೳྗ͕͋ΓAEΑΓؤ݈

• Denoising Auto-EncoderͳͲͱಉ༷

• PCA΍SVDͱҟͳΓɺඇઢܗม׵Ͱೖྗσʔλͷѹॖ͕ߦ͑Δ
VAEͷར఺
17

18. GAN
•ࣝผث(Discriminator)͕ੜ੒ث(Generator)ͷग़ྗΛ෼ྨͰ͖ͳ͍Α͏ʹֶश

VAE
•જࡏදݱͷ෼෍͕ࣄલ෼෍ʹۙͮ͘Α͏ʹ + ೖྗΛ࠶ߏ੒͢ΔΑ͏ʹֶश

Normalizing
fl
ow
•ٯม׵Մೳͳࣸ૾Λֶशɺෳࡶͳજࡏදݱͷ෼෍Λߏ੒

•VAEͱ૊Έ߹ΘͤՄೳ

Di
ff
usion Models
•ॱํ޲ͰϊΠζՃࢉɺٯํ޲ͰϊΠζΛআڈ͢ΔΑ͏ʹϞσϧΛֶश
VAEͱͦͷଞͷੜ੒Ϟσϧͷൺֱ
18
ม෼ਪ࿦ͱ Normalizing Flow

19. •VAEͷଛࣦؔ਺͸ҎԼͷೋͭͷ଍͠߹Θͤ

• ࠶ߏ੒ޡࠩ
• ਖ਼ଇԽ߲ (જࡏදݱͷ෼෍ʹ͍ͭͯͷଛࣦ)
• ͸Encoderͷύϥϝʔλɺ ͸Decoderͷύϥϝʔλ
ϕ θ
VAEͷ໨తؔ਺
19
ℒ = −
DKL
( qϕ
(z|X) ∥ pθ
(z) ) Eqϕ
(z|X)
[ log pθ
(X|z) ]
ਖ਼ଇԽ߲ ࠶ߏ੒ޡࠩ

20. •ͦ΋ͦ΋ͷVAE (΋͘͠͸ม෼ϕΠζ)ͷ͓ؾ࣋ͪ

• σʔλ ʹӅ͞Εͨੑ࣭ Λදݱ͢Δࣄޙ֬཰෼෍ Λ஌Γ͍ͨ

•࣮ࡍʹ͸ ΍ ͸Θ͔Βͳ͍͜ͱ͕΄ͱΜͲ

• Λۙࣅͨ͠ Ͱଥڠ

• ͸ͲͷΑ͏ʹٻΊΔ͔ʁ

• ͜ͷ֬཰෼෍΋ͲͷΑ͏ʹͳΔ͔Θ͔Βͳ͍

• Λͱ͔͔ͬΓʹࣜΛ͜Ͷ͘Γ·Θͯ͠ΈΔ
X Z pθ
(Z|X)

(X) pθ
(Z|X)

(Z|X) qϕ
(Z|X)

(Z|X)

(X)
VAEͷ໨తؔ਺ͷٻΊํ
20

21. •ͦ΋ͦ΋ͷVAE (΋͘͠͸ม෼ϕΠζ)ͷ͓ؾ࣋ͪ

• σʔλ ʹӅ͞Εͨੑ࣭ Λදݱ͢Δࣄޙ֬཰෼෍ Λ஌Γ͍ͨ

•࣮ࡍʹ͸ ΍ ͸Θ͔Βͳ͍͜ͱ͕΄ͱΜͲ

• Λۙࣅͨ͠ Ͱଥڠ

• ͸ͲͷΑ͏ʹٻΊΔ͔ʁ

• ͜ͷ֬཰෼෍΋ͲͷΑ͏ʹͳΔ͔Θ͔Βͳ͍

• Λͱ͔͔ͬΓʹࣜΛ͜Ͷ͘Γ·Θͯ͠ΈΔ
X Z pθ
(Z|X)

(X) pθ
(Z|X)

(Z|X) qϕ
(Z|X)

(Z|X)

(X)
VAEͷ໨తؔ਺ͷٻΊํ
21

22. VAEͷ໨తؔ਺ͷٻΊํ
22
log pθ
(X) = log

(X, z) dz
= log

(X, z)

(z|X)

(z|X)
dz
= log

(X, z)
dz

(z|X)

(z|X)
ҎԼͷΑ͏ʹࣜมܗΛͯ͠ΈΔ
zͰपลԽͨ͠΋ͷ
ͱΈͳ͢

23. VAEͷ໨తؔ਺ͷٻΊํ
23
log pθ
(X) = log

(X, z) dz
= log

(X, z)

(z|X)

(z|X)
dz
= log

(X, z)
dz

(z|X)

(z|X)
ҎԼͷΑ͏ʹࣜมܗΛͯ͠ΈΔ
1Λ͔͚ͯ΋͍ͬ͠ΐ

24. VAEͷ໨తؔ਺ͷٻΊํ
24
ΠΣϯηϯͷෆ౳ࣜΑΓɺ
͸Ԝؔ਺ (্ʹತ) Ͱ͋Δ͜ͱʹ஫ҙ͢Δͱ
f(x) = log(x)

(X, z)
dz

(z|X)

(z|X)
log
log pθ
(X) ≥

(X, z)
dz

(z|X)

(z|X)
log

(X, z)
dz

(z|X)

(z|X)
log

͢ͳΘͪ

25. VAEͷ໨తؔ਺ͷٻΊํ
25
ΠΣϯηϯͷෆ౳ࣜΑΓɺ
͸Ԝؔ਺ (্ʹತ) Ͱ͋Δ͜ͱʹ஫ҙ͢Δͱ
f(x) = log(x)

(X, z)
dz

(z|X)

(z|X)
log
log pθ
(X) ≥

(X, z)
dz

(z|X)

(z|X)
log

(X, z)
dz

(z|X)

(z|X)
log

͢ͳΘͪ

26. VAEͷ໨తؔ਺ͷٻΊํ
26
͜͜ͰӈลΛ
ͱ͓͘ͱ
log pθ
(X) ≥ ℒ(θ, ϕ; X)
ℒ(θ, ϕ; X) =

(X, z)
dz

(z|X)

(z|X)
log
ͱॻ͚Δɻ͜ͷ Λ
ELBO (Evidence Lower BOund): ม෼Լք ͱݺͿ
ℒ(θ, ϕ; X)

27. VAEͷ໨తؔ਺ͷٻΊํ
27
ELBOΛม෼Լݶͱॻ͘͜ͱ΋͋Δ͕ɺlower limit (Լݶ)Ͱ͸ͳ͘lower boundͳͷͰԼք͕ਖ਼͍͠Μ͡Όͳ͍͔ͱࢥ͍ͬͯΔ
͜͜ͰӈลΛ
ͱ͓͘ͱ
log pθ
(X) ≥ ℒ(θ, ϕ; X)
ℒ(θ, ϕ; X) =

(X, z)
dz

(z|X)

(z|X)
log
ͱॻ͚Δɻ͜ͷ Λ
ELBO (Evidence Lower BOund): ม෼Լք ͱݺͿ
ℒ(θ, ϕ; X)

28. VAEͷ໨తؔ਺ͷٻΊํ
28
ͱ͜ΖͰઌ΄Ͳͷෆ౳ࣜͷ྆ลͷࠩ
ʹ͍ͭͯߟ͑ͯΈΔͱ
log pθ
(X) − ℒ(θ, ϕ; X)
= ∫

(X, z)
dz

(z|X)

(z|X)
log
log pθ
(X) −
=

(z|X) pθ
(X)
dz

(z|X)

(z|X)
log
log pθ
(X)
∫ −

(z|X) dz

29. VAEͷ໨తؔ਺ͷٻΊํ
29
log pθ
(X) − ℒ(θ, ϕ; X)
=

(z|X) pθ
(X)
dz

(z|X)

(z|X)
log
log pθ
(X)
∫ −
=

(z|X) pθ
(X)
dz

(z|X)

(z|X)
log

log pθ
(X) dz −
= ∫
log

(z|X) pθ
(X)
dz

(z|X) dz

(z|X)

(z|X)

(X) qϕ
(z|X)

30. log

(z|X) pθ
(X)
dz

(X)
VAEͷ໨తؔ਺ͷٻΊํ
30
log pθ
(X) − ℒ(θ, ϕ; X)
=
=
= DKL
( qϕ
(z|X) ∥ pθ
(z|X) )

(z|X)

(z|X)

log

(z|X)
dz

(z|X)

(z|X)

31. VAEͷ໨తؔ਺ͷٻΊํ
31
Ҏ্ΑΓ
log pθ
(X) = ℒ(θ, ϕ; X) + DKL
( qϕ
(z|X) ∥ pθ
(z|X) )
΋ͱ΋ͱͷ໨త͸ Λۙࣅ͢Δ ΛٻΊΔ͜ͱ

(z|X) qϕ
(z|X)
→ Λ࠷খԽ͢Ε͹Α͍
DKL
( qϕ
(z|X) ∥ pθ
(z|X) )
͸ ͷ΋ͱͰҰఆͳͷͰ
log pθ
(X) θ
ͷ࠷খԽ 㱻 ͷ࠷େԽ
DKL
( qϕ
(z|X) ∥ pθ
(z|X) ) ℒ(θ, ϕ; X)

32. VAEͷ໨తؔ਺ͷٻΊํ
32
Ҏ্ΑΓ
log pθ
(X) = ℒ(θ, ϕ; X) + DKL
( qϕ
(z|X) ∥ pθ
(z|X) )
΋ͱ΋ͱͷ໨త͸ Λۙࣅ͢Δ ΛٻΊΔ͜ͱ

(z|X) qϕ
(z|X)
→ Λ࠷খԽ͢Ε͹Α͍
DKL
( qϕ
(z|X) ∥ pθ
(z|X) )
͸ ͷ΋ͱͰҰఆͳͷͰ
log pθ
(X) θ
ͷ࠷খԽ 㱻 ͷ࠷େԽ
DKL
( qϕ
(z|X) ∥ pθ
(z|X) ) ℒ(θ, ϕ; X)

33. VAEͷ໨తؔ਺ͷٻΊํ
33
Ҏ্ΑΓ
log pθ
(X) = ℒ(θ, ϕ; X) + DKL
( qϕ
(z|X) ∥ pθ
(z|X) )
΋ͱ΋ͱͷ໨త͸ Λۙࣅ͢Δ ΛٻΊΔ͜ͱ

(z|X) qϕ
(z|X)
→ Λ࠷খԽ͢Ε͹Α͍
DKL
( qϕ
(z|X) ∥ pθ
(z|X) )
͸ ͷ΋ͱͰҰఆͳͷͰ
log pθ
(X) θ
ͷ࠷খԽ 㱻 ͷ࠷େԽ
DKL
( qϕ
(z|X) ∥ pθ
(z|X) ) ℒ(θ, ϕ; X)
্ࣜӈล ୈ1߲ͱୈ2߲ͷ࿨͕ෆม
→ ୈ2߲͕খ͘͞ͳΔͳΒ
ୈ1߲͸େ͖͘ͳΒͳ͍ͱ͍͚ͳ͍

34. dz
VAEͷ໨తؔ਺ͷٻΊํ
34
ℒ(θ, ϕ; X) =

(X, z)
dz

(z|X)

(z|X)
log

(z|X)

(z|X)
log

=

(X|z) pθ
(z)

(z|X) log

= pθ
(X|z) dz

(z|X)

(z|X)
log

(z)
dz +
ͱ͜ΖͰɺม෼ԼքΛ͞Βʹ෼ղͯ͠ΈΔͱ

35. VAEͷ໨తؔ਺ͷٻΊํ
35
ℒ(θ, ϕ; X)

(z|X) log

= pθ
(X|z) dz

(z|X)

(z|X)
log
∫ pθ
(z)
dz −

(z|X) log

= pθ
(X|z) dz − DKL
( qϕ
(z|X) ∥ pθ
(z) )
໬౓ ਖ਼ଇԽ߲

(z|X) log

= pθ
(X|z) dz

(z|X)

(z|X)
log

(z)
dz +

36. VAEͷ໨తؔ਺ͷٻΊํ
36
ͷ࠷େԽ 㱻 ͷ࠷খԽͳͷͰɺ
ଛࣦؔ਺͕ҎԼͷΑ͏ʹఆΊΒΕΔ
ℒ(θ, ϕ; X) −ℒ(θ, ϕ; X)
−ℒ(θ, ϕ; X) = qϕ
(z|X) log
∫ pθ
(X|z) dz

DKL
( qϕ
(z|X) ∥ pθ
(z) )
= −
DKL
( qϕ
(z|X) ∥ pθ
(z) ) Eqϕ
(z|X)
[ log pθ
(X|z) ]
ਖ਼ଇԽ߲ ࠶ߏ੒ޡࠩ

37. VAEͷ໨తؔ਺ͷٻΊํ
37
ͷ࠷େԽ 㱻 ͷ࠷খԽͳͷͰɺ
ଛࣦؔ਺͕ҎԼͷΑ͏ʹఆΊΒΕΔ
ℒ(θ, ϕ; X) −ℒ(θ, ϕ; X)
−ℒ(θ, ϕ; X) = qϕ
(z|X) log
∫ pθ
(X|z) dz

DKL
( qϕ
(z|X) ∥ pθ
(z) )
= −
DKL
( qϕ
(z|X) ∥ pθ
(z) ) Eqϕ
(z|X)
[ log pθ
(X|z) ]
ਖ਼ଇԽ߲ ࠶ߏ੒ޡࠩ
ʹΨ΢ε෼෍Λ
Ծఆ͢Ε͹ɺղੳతʹ
ଛࣦؔ਺ΛٻΊΒΕΔ

(z)

38. VAEͷϞσϧߏ଄ (࠶ܝ)
38
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder

39. VAEͷϞσϧߏ଄ (࠶ܝ)
39
જࡏදݱ
z
x

Encoder
μ
σ
x’
Decoder
ຊ౰͸͜͜ʹ
reperameterization trick
ͱ͍͏ςΫ͕ڬ·Δ

40. VAEͷٖࣅίʔυ: Encoder
40

41. VAEͷٖࣅίʔυ: Encoder
41
࣮૷ͱͯ͠͸
ઢܗ૚ʹೋވʹ௨͚ͩ͢

42. VAEͷٖࣅίʔυ: શମ
42

43. VAEͷٖࣅίʔυ: શମ
43
αϯϓϦϯάͯ֫͠ಘͨ͠
જࡏදݱ͔ΒೖྗΛ࠶ߏ੒

44. •જࡏදݱͷ෼෍ʹط஌ͷ֬཰෼෍ΛԾఆֶͯ͠शΛߦ͏ੜ੒Ϟσϧ

• ࣍ݩѹॖɾҙຯͷ͋Δදݱͷநग़ / αϯϓϦϯάʹΑΔੜ੒͕Մೳ

•ग़ࣗ͸ҟͳΔ͕ɺAuto-EncoderͱࣅͨΞʔΩςΫνϟΛඋ͑Δ

• Auto-Encoderʹજࡏදݱʹؔ͢Δਖ਼ଇԽ߲Λ௥Ճͨ͠΋ͷͱΈͳͤΔ

• ਖ਼ଇԽ߲ʹΑΓVAE͸AEΑΓ΋ؤ݈ (ͱݴΘΕΔ)
VAEͷ·ͱΊ
44

45. Optimus

46. ಋೖ
•VAEͱ͸

•VAEͷ໨తؔ਺ͷಋग़

Optimus
•Ϟσϧߏ଄

•ଛࣦؔ਺

•BERTͱGPT-2ͷ౷߹

•ධՁ࣮ݧ
໨࣍
46

47. •VAE (ม෼ࣗݾූ߸Խث)ϕʔεͷࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ

• ஫ҙ: طଘͷࣄલֶशࡁΈݴޠϞσϧ͸͔ͬ͠Γར༻

•EncoderʹBERTɺDecoder͸GPT-2

• ೋͭͷϞσϧΛ͏·͘౷߹ͯ͠
VAEΛߏ੒ɺ౷߹ख๏΋޻෉

•จੜ੒ʹ͓͚ΔධՁࢦඪɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ

• જࡏදݱͷઢܗิ׬ʹΑΔҙຯతʹͳΊΒ͔ͳจੜ੒͕Մೳ

• NLPʹ͓͚ΔVAE + ࣄલֶशͷ༗༻ੑΛࣔ͢
࿦จ֓ཁ (࠶ܝ)
47

48. •EncoderʹBERTΛར༻ɺ[CLS]Λจදݱͱͯ͠༻͍Δ

•DecoderʹGPT-2Λར༻ɺજࡏදݱʹैͬͯจੜ੒Λߦ͏

•શମͱͯ͠VAEతʹೖྗจΛ࠶ߏ੒Ͱ͖ΔΑ͏ʹֶश
Ϟσϧߏ଄: ؆୯൛
48

49. Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛
49
[CLS] w1 w2 …
BERT
μ
σ
WE

50. Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛
50
z
[CLS] w1 w2 …
BERT
reparameterization
trick
μ
σ
WE
sampling

51. Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛
51
z
[CLS] w1 w2 …
BERT
GPT-2
reparameterization
trick
μ
σ
WE
/
WM
WD
sampling

52. Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛
52
z
[CLS] w1 w2 …
[CLS] w1 w2 …
w1 w2 w3 …
BERT
GPT-2
reparameterization
trick
μ
σ
WE
/
WM
WD
sampling

53. •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻

• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔

• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏)

• ʹΑͬͯજࡏදݱ͕ “ա౓ʹ” ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙
β, λ
β
β = 0
λ
ଛࣦؔ਺
53

54. •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻

• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔

• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏)

• ʹΑͬͯજࡏදݱ͕ “ա౓ʹ” ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙
β, λ
β
β = 0
λ
ଛࣦؔ਺
54

55. •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻

• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔

• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏)

• ʹΑͬͯજࡏදݱ͕ “ա౓ʹ” ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙
β, λ
β
β = 0
λ
ଛࣦؔ਺
55

56. •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻

• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔

• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏)

• ʹΑͬͯજࡏදݱ͕ “ա౓ʹ” ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙
β, λ
β
β = 0
λ
ଛࣦؔ਺
56
ϋΠύϥ͕ଟ͍😇

57. •BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹ͸େ·͔ʹೋͭͷ໰୊͕ଘࡏ

1. ෼͔ͪॻ͖
•BERTͱGPT-2͸ҟͳΔޠኮΛ࣋ͪɺ෼͔ͪॻ͖ख๏͕ҟͳΔ

•ೖྗͱग़ྗͰҟͳΔtokenizerΛ࢖͏͜ͱͰղܾ

2. જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒
•GPT-2͸৚݅෇͖ςΩετੜ੒ͷͨΊͷػߏΛඋ͍͑ͯͳ͍

•ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ੒͢Δ͔ʁ

• જࡏදݱͱGPT-2ͷੜ੒ػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧ
BERTͱGPT-2ͷ౷߹
57
prompting͸·ͨผͷ࿩

58. •BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹ͸େ·͔ʹೋͭͷ໰୊͕ଘࡏ

1. ෼͔ͪॻ͖
•BERTͱGPT-2͸ҟͳΔޠኮΛ࣋ͪɺ෼͔ͪॻ͖ख๏͕ҟͳΔ

•ೖྗͱग़ྗͰҟͳΔtokenizerΛ࢖͏͜ͱͰղܾ

2. જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒
•GPT-2͸৚݅෇͖ςΩετੜ੒ͷͨΊͷػߏΛඋ͍͑ͯͳ͍

•ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ੒͢Δ͔ʁ

• જࡏදݱͱGPT-2ͷੜ੒ػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧ
BERTͱGPT-2ͷ౷߹
58
prompting͸·ͨผͷ࿩

59. •BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹ͸େ·͔ʹೋͭͷ໰୊͕ଘࡏ

1. ෼͔ͪॻ͖
•BERTͱGPT-2͸ҟͳΔޠኮΛ࣋ͪɺ෼͔ͪॻ͖ख๏͕ҟͳΔ

•ೖྗͱग़ྗͰҟͳΔtokenizerΛ࢖͏͜ͱͰղܾ

2. જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒
•GPT-2͸৚݅෇͖ςΩετੜ੒ͷͨΊͷػߏΛඋ͍͑ͯͳ͍

•ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ੒͢Δ͔ʁ

• જࡏදݱͱGPT-2ͷੜ੒ػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧ
BERTͱGPT-2ͷ౷߹
59
prompting͸·ͨผͷ࿩

60. Memory
•જࡏදݱΛ૚ͷ਺ͷϕΫτϧʹม׵

•จੜ੒࣌ʹ֤૚ͰϕΫτϧΛݟͳ͕Βੜ੒

Embedding
•જࡏදݱΛม׵ͯ͠୯ޠຒΊࠐΈʹՃࢉ

•BERTͷposition embeddingͷΑ͏ʹ
જࡏදݱΛ༻͍Δ
BERTͱGPT-2ͷ౷߹: જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒
60
prompting͸·ͨผͷ࿩

61. Memory
•જࡏදݱΛ૚ͷ਺ͷϕΫτϧʹม׵

•จੜ੒࣌ʹ֤૚ͰϕΫτϧΛݟͳ͕Βੜ੒

Embedding
•જࡏදݱΛม׵ͯ͠୯ޠຒΊࠐΈʹՃࢉ

•BERTͷposition embeddingͷΑ͏ʹ
જࡏදݱΛ༻͍Δ
BERTͱGPT-2ͷ౷߹: જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒
61
prompting͸·ͨผͷ࿩

62. Language Modeling
•Optimus͕จΛਖ਼͘͠ੜ੒Ͱ͖Δ͔ධՁ

•จੜ੒ʹ͓͚ΔPerplexity (PPL), MI

Guided Language Generation
•ಛఆͷ৚݅ʹैͬͨจΛਖ਼͘͠ੜ੒Ͱ͖Δ͔ධՁ

•ର࿩Ԡ౴ੜ੒ɺಛఆελΠϧͰͷԠ౴ੜ੒ɺϥϕϧͰ৚݅෇͚ͨ͠จੜ੒

Low-resource Language Understanding
•௿ࢿݯઃఆͰͷOptimusͷ༗༻ੑΛݕূ

•จຒΊࠐΈϕʔεͰGLUEΛղ͍ͯੑೳݕূ
ධՁ࣮ݧ
62

63. •જࡏදݱ࣍ݩ: 32

• ެ։͞Ε͍ͯΔ࣮૷͔Β൑அ

•VAEͱͯ͠ͷ܇࿅σʔλ: ӳޠWikipedia 199ສจ

•จੜ੒ܥͷλεΫͰ͸͞ΒʹͦΕͧΕͷσʔληοτͰ1 epochֶ͚ͩश

•ֶशͷ޻෉͕͍Ζ͍Ζ

• Λֶशதʹ૿Ճͤ͞ΔͳͲ

•Low-resource Language UnderstandingͰ͸Encoder (BERT)ͷ[CLS]ʹରԠ
͢ΔදݱΛར༻

• ͳͷͰɺϕΫτϧͷ࣍ݩ਺͸32Ͱ͸ͳ͘768
β
࣮ݧઃఆ
63
જࡏදݱͷ࣍ݩ਺͕࿦จʹ໌ه͞Ε͍ͯͳ͍ؾ͕͢Δ…

64. •طଘͷখ͞ͳVAEΑΓඇৗʹߴ͍ੑೳ

• ڊେͳϞσϧɾڊେίʔύεͰͷࣄલֶश͸VAEͰ΋΍͸Γ༗ޮ

• ʹΑΔจੜ੒ͷੑೳͱજࡏදݱͷ඼࣭ͷτϨʔυΦϑ͕ଘࡏ
λ
ධՁ࣮ݧ: Language Modeling
64

65. •طଘͷখ͞ͳVAEΑΓඇৗʹߴ͍ੑೳ

• ڊେͳϞσϧɾڊେίʔύεͰͷࣄલֶश͸VAEͰ΋΍͸Γ༗ޮ

• ʹΑΔจੜ੒ͷੑೳͱજࡏදݱͷ඼࣭ͷτϨʔυΦϑ͕ଘࡏ
λ
ධՁ࣮ݧ: Language Modeling
65

66. •طଘͷখ͞ͳVAEΑΓඇৗʹߴ͍ੑೳ

• ڊେͳϞσϧɾڊେίʔύεͰͷࣄલֶश͸VAEͰ΋΍͸Γ༗ޮ

• ʹΑΔจੜ੒ͷੑೳͱજࡏදݱͷ඼࣭ͷτϨʔυΦϑ͕ଘࡏ
λ
ධՁ࣮ݧ: Language Modeling
66

67. •3/4ͷσʔληοτͰGPT-2ͷPPLΑΓ΋௿͍PPLΛୡ੒

• ಛʹSNLIͳͲಛ༗ͷయܕతͳจ͕ଟ͍σʔληοτͰߴ͍ੑೳ
ධՁ࣮ݧ: Language Modeling
67

68. •OptimusͷજࡏදݱΛ༻͍Δ͜ͱͰจදݱͷԋࢉ͕Մೳ

• Λ΋ͱʹจੜ੒

•͜ͷ݁ՌΛͲ͏ड͚औΕ͹͍͍ͷ͔…?
zD
= zB
− zA
+ zC
ධՁ࣮ݧ: Guided Language Generation
68
࿦จͰ঺հ͞Ε͍ͯΔ σϞαΠτ ͸ΞΫηεͰ͖ͳ͘ͳ͍ͬͯΔ😇

69. •ೋͭͷจͷજࡏදݱͷ
ઢܗิ׬ʹΑΔੜ੒

•VAEͷજࡏۭ͕ؒͳΊΒ͔
ͳ͜ͱʹΑΔԸܙ

•શؔ͘܎ͷͳ͍จ͸
ग़͖͍ͯͯͳ͍ɺ͘Β͍ͷ
ؾ͔࣋ͪ

• ิ׬͞Εͨจͷޠኮ͸
ݩͷจͱࣅ͍ͯΔ
ධՁ࣮ݧ: Guided Language Generation
69

70. •ೋͭͷจͷજࡏදݱͷ
ઢܗิ׬ʹΑΔੜ੒

•VAEͷજࡏۭ͕ؒͳΊΒ͔
ͳ͜ͱʹΑΔԸܙ

•શؔ͘܎ͷͳ͍จ͸
ग़͖͍ͯͯͳ͍ɺ͘Β͍ͷ
ؾ͔࣋ͪ

• ิ׬͞Εͨจͷޠኮ͸
ݩͷจͱࣅ͍ͯΔ
ධՁ࣮ݧ: Guided Language Generation
70

71. •3ͭͷλεΫͰ࣮ݧɾߴ͍ੑೳ

• ର࿩Ԡ౴ੜ੒

• ಛఆελΠϧͷจੜ੒

• ৚݅෇͖ੜ੒

•৚݅෇͖ੜ੒Ͱ͸ײ৘෼ྨͷ
ϥϕϧʹجͮ͘ςΩετΛੜ੒

• ੜ੒จͷϥϕϧ෼ྨ֬཰΍
ੜ੒จͷଟ༷ੑͰߴ͍ੑೳ
ධՁ࣮ݧ: Guided Language Generation
71
ৄ͍࣮͠ݧઃఆɾλεΫઆ໌ʹ͍ͭͯ͸ݩ࿦จΛࢀরͷ͜ͱ

72. •3ͭͷλεΫͰ࣮ݧɾߴ͍ੑೳ

• ର࿩Ԡ౴ੜ੒

• ಛఆελΠϧͷจੜ੒

• ৚݅෇͖ੜ੒

•৚݅෇͖ੜ੒Ͱ͸ײ৘෼ྨͷ
ϥϕϧʹجͮ͘ςΩετΛੜ੒

• ੜ੒จͷϥϕϧ෼ྨ֬཰΍
ੜ੒จͷଟ༷ੑͰߴ͍ੑೳ
ධՁ࣮ݧ: Guided Language Generation
72
ৄ͍࣮͠ݧઃఆɾλεΫઆ໌ʹ͍ͭͯ͸ݩ࿦จΛࢀরͷ͜ͱ

73. •OptimusͷEncoderදݱΛ༻͍ͯ
ઢܗ෼ྨثΛ܇࿅

• Yelpσʔληοτͷײ৘෼ྨλεΫ

•܇࿅ࣄྫ਺ʹΑΔੑೳͷมԽΛ؍࡯
•Optimus͸܇࿅ࣄྫ਺͕খͯ͘͞΋
ൺֱతߴ͍෼ྨੑೳ

• ੑೳ্͕͕Δͷ͕एׯૣ͍

• ಛʹ
fi
ne-tuningͳ͠ͷ৔߹ʹ΋ͱͷ
BERTΑΓ΋ੑೳ͕ߴ͍

• VAEͷֶशΛ௨ͯ͠ྑ͍જࡏۭؒ
Λ֫ಘ͍ͯ͠Δ͜ͱΛࣔࠦ
ධՁ࣮ݧ: Low-resource Language Understanding
73

74. •OptimusͷEncoderදݱΛ༻͍ͯ
ઢܗ෼ྨثΛ܇࿅

• Yelpσʔληοτͷײ৘෼ྨλεΫ

•܇࿅ࣄྫ਺ʹΑΔੑೳͷมԽΛ؍࡯
•Optimus͸܇࿅ࣄྫ਺͕খͯ͘͞΋
ൺֱతߴ͍෼ྨੑೳ

• ੑೳ্͕͕Δͷ͕एׯૣ͍

• ಛʹ
fi
ne-tuningͳ͠ͷ৔߹ʹ΋ͱͷ
BERTΑΓ΋ੑೳ͕ߴ͍

• VAEͷֶशΛ௨ͯ͠ྑ͍જࡏۭؒ
Λ֫ಘ͍ͯ͠Δ͜ͱΛࣔࠦ
ධՁ࣮ݧ: Low-resource Language Understanding
74

75. •OptimusͱBERTͷจදݱ
ͷ෼෍ΛՄࢹԽ

• Yelpσʔληοτͷ
։ൃηοτΛจදݱʹม׵

•Optimusͷํ͕จදݱͷ෼෍͕
Ұ༷Ͱϥϕϧ͝ͱͷմ͕ΑΓ
໌֬

• ಛʹɺBERTΑΓજࡏදݱ͕
Ұ༷ʹ෼෍͍ͯ͠Δ

• ͱݴ͑ΔΑ͏ͳؾ͕͢Δ
ධՁ࣮ݧ: Low-resource Language Understanding
75

76. •OptimusͱBERTͷจදݱ
ͷ෼෍ΛՄࢹԽ

• Yelpσʔληοτͷ
։ൃηοτΛจදݱʹม׵

•Optimusͷํ͕จදݱͷ෼෍͕
Ұ༷Ͱϥϕϧ͝ͱͷմ͕ΑΓ
໌֬

• ಛʹɺBERTΑΓજࡏදݱ͕
Ұ༷ʹ෼෍͍ͯ͠Δ

• ͱݴ͑ΔΑ͏ͳؾ͕͢Δ
ධՁ࣮ݧ: Low-resource Language Understanding
76

77. •OptimusͷGLUEͰͷੑೳΛධՁ

• จຒΊࠐΈΛೖྗͱ͢Δઢܗ෼ྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔

•Fine-tuningͳ͠ͷ৔߹ʹݩͷBERTΑΓ΋ߴ͍ੑೳ

• Optimus͸BERTΑΓ΋ྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ

• BERTͷQQPͷੑೳ͕௿͗͢Δͷ͕ؾʹͳΔ͕…

•Fine-tuning͋Γͷ৔߹͸ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰ౰વ͔)
ධՁ࣮ݧ: Low-resource Language Understanding
77
Ͳ͏ͤͳΒSentEvalͰ΋࣮ݧͯ͠ཉ͔͕ͬͨ͠…

78. •OptimusͷGLUEͰͷੑೳΛධՁ

• จຒΊࠐΈΛೖྗͱ͢Δઢܗ෼ྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔

•Fine-tuningͳ͠ͷ৔߹ʹݩͷBERTΑΓ΋ߴ͍ੑೳ

• Optimus͸BERTΑΓ΋ྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ

• BERTͷQQPͷੑೳ͕௿͗͢Δͷ͕ؾʹͳΔ͕…

•Fine-tuning͋Γͷ৔߹͸ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰ౰વ͔)
ධՁ࣮ݧ: Low-resource Language Understanding
78
Ͳ͏ͤͳΒSentEvalͰ΋࣮ݧͯ͠ཉ͔͕ͬͨ͠…

79. •OptimusͷGLUEͰͷੑೳΛධՁ

• จຒΊࠐΈΛೖྗͱ͢Δઢܗ෼ྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔

•Fine-tuningͳ͠ͷ৔߹ʹݩͷBERTΑΓ΋ߴ͍ੑೳ

• Optimus͸BERTΑΓ΋ྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ

• BERTͷQQPͷੑೳ͕௿͗͢Δͷ͕ؾʹͳΔ͕…

•Fine-tuning͋Γͷ৔߹͸ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰ౰વ͔)
ධՁ࣮ݧ: Low-resource Language Understanding
79
Ͳ͏ͤͳΒSentEvalͰ΋࣮ݧͯ͠ཉ͔͕ͬͨ͠…

80. •VAEϕʔεͷେن໛ࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ

•EncoderʹBERTɺDecoderʹGPT-2Λ্ख͘౷߹ͯ͠VAEΛߏ੒

•จੜ੒ɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ

• ಛʹطଘͷখ͞ͳVAEΛେ্͖͘ճΔੑೳ

• VAEʹ͓͚Δࣄલֶशͷ༗ޮੑΛࣔ͢

ײ૝
•BERTͳͲͷطଘࣄલֶशࡁΈݴޠϞσϧΛར༻ͤͣɺfrom scratchͰֶश͢
ΔͱͲ͏ͳΔͷ͔͕ؾʹͳΔ

• ܭࢉϦιʔεతʹݫ͔ͬͨ͠໛༷(Sec. 6 DiscussionΛࢀর)

•ࣄલֶश + VAEͳ࿩ͱͯ͠͸໘ന͍͕ɺԠ༻ൣғ͸ݶఆత͔
·ͱΊ
80

81. •VAEϕʔεͷେن໛ࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ

•EncoderʹBERTɺDecoderʹGPT-2Λ্ख͘౷߹ͯ͠VAEΛߏ੒

•จੜ੒ɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ

• ಛʹطଘͷখ͞ͳVAEΛେ্͖͘ճΔੑೳ

• VAEʹ͓͚Δࣄલֶशͷ༗ޮੑΛࣔ͢

ײ૝
•BERTͳͲͷطଘࣄલֶशࡁΈݴޠϞσϧΛར༻ͤͣɺfrom scratchͰֶश͢
ΔͱͲ͏ͳΔͷ͔͕ؾʹͳΔ

• ܭࢉϦιʔεతʹݫ͔ͬͨ͠໛༷(Sec. 6 DiscussionΛࢀর)

•ࣄલֶश + VAEͳ࿩ͱͯ͠͸໘ന͍͕ɺԠ༻ൣғ͸ݶఆత͔
·ͱΊ
81