Hayato Tsukagoshi
October 18, 2022
1.4k

# [輪講資料] Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

Optimusを支えるVAEの目的関数の導出から丁寧に紹介します。

October 18, 2022

## Transcript

1. ### Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

Graduate school of Informatics, Nagoya University, Japan. ൃදऀ: Hayato Tsukagoshi Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, and Jianfeng Gao EMNLP 2020 URL: https://aclanthology.org/2020.emnlp-main.378/
2. ### •VAE (ม෼ࣗݾූ߸Խث)ϕʔεͷࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ • ஫ҙ: طଘͷࣄલֶशࡁΈݴޠϞσϧ͸͔ͬ͠Γར༻ •EncoderʹBERTɺDecoder͸GPT-2 • ೋͭͷϞσϧΛ͏·͘౷߹ͯ͠  VAEΛߏ੒ɺ౷߹ख๏΋޻෉ •จੜ੒ʹ͓͚ΔධՁࢦඪɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ

• જࡏදݱͷઢܗิ׬ʹΑΔҙຯతʹͳΊΒ͔ͳจੜ੒͕Մೳ • NLPʹ͓͚ΔVAE + ࣄલֶशͷ༗༻ੑΛࣔ͢ ࿦จ֓ཁ 2
3. ### •VAE͸ཧ࿦తɾٕज़తʹ໘ന͍͕(ಛʹNLPͰ)͋·Γ஫໨͞Ε͍ͯͳ͍ • BERTͳͲͷࣄલֶशࡁΈϞσϧ͕؆୯ɾڧྗ • ҰԠଟ༷ੑΛॏࢹ͢Δจੜ੒λεΫͰ͸࢖ΘΕ͍ͯΔΑ͏͕ͩ… •ࣗ෼༻ʹVAEʹ͍ͭͯษڧɾഎܠ஌ࣝͷ·ͱΊ௚͕͔ͨͬͨ͠͠ • ਺ࣜΛ͋·Γ͓֮͑ͯΒͣ… •ࣗ෼ͷݚڀͰ࢖͏͔΋͠Εͣڵຯ͕͋ͬͨ •

VAEϕʔεͷจຒΊࠐΈϞσϧ͸΄ͱΜͲݟͳ͍ͷͰ • BERT- fl owͱ͔͸ࢥ૝͕ۙͦ͏Ͱ͸͋Δ બఆཧ༝ 3

7. ### •ग़ྗ͚ͩͰͳ͘ೖྗͷ෼෍΋ϞσϧԽ͢Δख๏ * • զʑ͕Α͘࢖͏ͷ͸ࣝผϞσϧ (෼ྨ͚ͩߦ͏) •σʔλ͕ԿΒ͔ͷ֬཰෼෍ʹج͍ͮͯੜ੒͞ΕΔͱߟ͑Δ • ؍ଌσʔλ͔Β؍ଌσʔλ͕ै͏֬཰෼෍Λਪఆ͢Δ •ը૾෼໺ʹ͓͚ΔGAN͕༗໊ •

NLPͰ͸ҙ֎ͱ͋·Γݟͳ͍ʁ ੜ੒Ϟσϧ 7 * ύλʔϯೝࣝͱػցֶश ্ר p.42Λࢀর
8. ### •தؒදݱ͔ΒೖྗΛ࠶ߏ੒Ͱ͖ΔΑ͏ʹ܇࿅͢ΔϞσϧ • ੜ੒ϞσϧͷҰछ • ڭࢣͳֶ͠श͕Մೳ •தؒදݱ͸ೖྗͷѹॖ͞ΕͨදݱͱΈͳͤΔ • ඇઢܗͰෳࡶͳ࣍ݩѹॖ͕Ͱ͖Δ • ΫϥελϦϯά΍ҟৗݕ஌ɾϊΠζআڈͳͲʹ΋࢖ΘΕΔ

Auto-Encoder (AE): ࣗݾූ߸Խث 8
9. ### •Auto-Encoderͷજࡏදݱͷ෼෍ʹ੍໿ΛՃ͑ͨ΋ͷ (ͱݟ၏ͤΔ) • AEͱ͸ҟͳΔಈػͱཧ࿦എܠΛ͕࣋ͭɺࣅͨ΋ͷͱղऍͰ͖Δ • જࡏදݱʹର͢Δ੍໿ʹΑͬͯσʔλͷੜ੒͕༰қʹ • Kingma et al.,

2013. Auto-Encoding Variational Bayes ͰఏҊ •જࡏදݱͷ෼෍ʹ͸೚ҙͷࣄલ෼෍ (prior) Λબ΂Δ • ଟ͘ͷ৔߹͸ඪ४ਖ਼ن෼෍ (standard normal distribution) •ଛࣦؔ਺ͱͯ͠ೋͭͷଛࣦΛ଍͠߹Θͤͯ༻͍Δ • ࠶ߏ੒ޡࠩ • જࡏදݱͷ෼෍ʹ͍ͭͯͷଛࣦ Variational Auto-Encoder (VAE): ม෼ࣗݾූ߸Խث 9

x’ Decoder
11. ### VAEͷϞσϧߏ଄ 11 જࡏදݱ  z x Wμ Wσ Encoder μ σ

x’ Decoder ೖྗΛϕΫτϧදݱʹม׵
12. ### VAEͷϞσϧߏ଄ 12 ෼ࢄڞ෼ࢄߦྻ͸ΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏ જࡏදݱ  z x Wμ Wσ Encoder μ

σ x’ Decoder ϕΫτϧදݱ͔ΒΨ΢ε෼෍ͷ  ฏۉͱ෼ࢄڞ෼ࢄߦྻΛग़ྗ
13. ### VAEͷϞσϧߏ଄ 13 ෼ࢄڞ෼ࢄߦྻ͸ΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏ જࡏදݱ  z x Wμ Wσ Encoder μ

σ x’ Decoder ฏۉͱ෼ࢄڞ෼ࢄߦྻΛ༻͍ͯΨ΢ε෼ ෍͔ΒαϯϓϦϯάɺજࡏදݱΛ֫ಘ
14. ### VAEͷϞσϧߏ଄ 14 ෼ࢄڞ෼ࢄߦྻ͸ΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏ જࡏදݱ  z x Wμ Wσ Encoder μ

σ x’ Decoder જࡏදݱ͔Βग़ྗΛ࠶ߏ੒
15. ### AEͱVAEͷϞσϧߏ଄ͷൺֱ 15 જࡏදݱ  z x Wμ Wσ Encoder μ σ

x’ Decoder જࡏදݱ  z x Encoder x’ Decoder AE VAE
16. ### AEͱVAEͷϞσϧߏ଄ͷൺֱ 16 જࡏදݱ  z x Wμ Wσ Encoder μ σ

x’ Decoder જࡏදݱ  z x Encoder x’ Decoder AE VAE જࡏදݱΛαϯϓϦϯά͢Δ ͨΊͷॲཧͱ  જࡏදݱͷ෼෍ʹؔ͢Δ  ଛࣦ͕૿͑Δ͚ͩ
17. ### •AEͰ͸જࡏදݱ͕ͲͷΑ͏ʹ෼෍͍ͯ͠Δ͔ෆ໌ • VAEͰ͸ط஌ͷ֬཰෼෍ʹ͚ۙͮΔΑ͏ʹֶशΛߦ͏ • ط஌෼෍͔ΒͷαϯϓϦϯάͰࣗવͳσʔλͷੜ੒͕ߦ͑Δ •ਖ਼ଇԽೳྗ͕͋ΓAEΑΓؤ݈ • Denoising Auto-EncoderͳͲͱಉ༷ •

PCA΍SVDͱҟͳΓɺඇઢܗม׵Ͱೖྗσʔλͷѹॖ͕ߦ͑Δ VAEͷར఺ 17
18. ### GAN •ࣝผث(Discriminator)͕ੜ੒ث(Generator)ͷग़ྗΛ෼ྨͰ͖ͳ͍Α͏ʹֶश VAE •જࡏදݱͷ෼෍͕ࣄલ෼෍ʹۙͮ͘Α͏ʹ + ೖྗΛ࠶ߏ੒͢ΔΑ͏ʹֶश Normalizing fl ow •ٯม׵Մೳͳࣸ૾Λֶशɺෳࡶͳજࡏදݱͷ෼෍Λߏ੒

•VAEͱ૊Έ߹ΘͤՄೳ Di ff usion Models •ॱํ޲ͰϊΠζՃࢉɺٯํ޲ͰϊΠζΛআڈ͢ΔΑ͏ʹϞσϧΛֶश VAEͱͦͷଞͷੜ੒Ϟσϧͷൺֱ 18 ม෼ਪ࿦ͱ Normalizing Flow
19. ### •VAEͷଛࣦؔ਺͸ҎԼͷೋͭͷ଍͠߹Θͤ • ࠶ߏ੒ޡࠩ • ਖ਼ଇԽ߲ (જࡏදݱͷ෼෍ʹ͍ͭͯͷଛࣦ) • ͸Encoderͷύϥϝʔλɺ ͸Decoderͷύϥϝʔλ ϕ

θ VAEͷ໨తؔ਺ 19 ℒ = − DKL ( qϕ (z|X) ∥ pθ (z) ) Eqϕ (z|X) [ log pθ (X|z) ] ਖ਼ଇԽ߲ ࠶ߏ੒ޡࠩ
20. ### •ͦ΋ͦ΋ͷVAE (΋͘͠͸ม෼ϕΠζ)ͷ͓ؾ࣋ͪ • σʔλ ʹӅ͞Εͨੑ࣭ Λදݱ͢Δࣄޙ֬཰෼෍ Λ஌Γ͍ͨ •࣮ࡍʹ͸ ΍ ͸Θ͔Βͳ͍͜ͱ͕΄ͱΜͲ

• Λۙࣅͨ͠ Ͱଥڠ • ͸ͲͷΑ͏ʹٻΊΔ͔ʁ • ͜ͷ֬཰෼෍΋ͲͷΑ͏ʹͳΔ͔Θ͔Βͳ͍ • Λͱ͔͔ͬΓʹࣜΛ͜Ͷ͘Γ·Θͯ͠ΈΔ X Z pθ (Z|X) pθ (X) pθ (Z|X) pθ (Z|X) qϕ (Z|X) qϕ (Z|X) pθ (X) VAEͷ໨తؔ਺ͷٻΊํ 20
21. ### •ͦ΋ͦ΋ͷVAE (΋͘͠͸ม෼ϕΠζ)ͷ͓ؾ࣋ͪ • σʔλ ʹӅ͞Εͨੑ࣭ Λදݱ͢Δࣄޙ֬཰෼෍ Λ஌Γ͍ͨ •࣮ࡍʹ͸ ΍ ͸Θ͔Βͳ͍͜ͱ͕΄ͱΜͲ

• Λۙࣅͨ͠ Ͱଥڠ • ͸ͲͷΑ͏ʹٻΊΔ͔ʁ • ͜ͷ֬཰෼෍΋ͲͷΑ͏ʹͳΔ͔Θ͔Βͳ͍ • Λͱ͔͔ͬΓʹࣜΛ͜Ͷ͘Γ·Θͯ͠ΈΔ X Z pθ (Z|X) pθ (X) pθ (Z|X) pθ (Z|X) qϕ (Z|X) qϕ (Z|X) pθ (X) VAEͷ໨తؔ਺ͷٻΊํ 21
22. ### VAEͷ໨తؔ਺ͷٻΊํ 22 log pθ (X) = log ∫ pθ (X,

z) dz = log ∫ pθ (X, z) qϕ (z|X) qϕ (z|X) dz = log ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) ҎԼͷΑ͏ʹࣜมܗΛͯ͠ΈΔ zͰपลԽͨ͠΋ͷ  ͱΈͳ͢
23. ### VAEͷ໨తؔ਺ͷٻΊํ 23 log pθ (X) = log ∫ pθ (X,

z) dz = log ∫ pθ (X, z) qϕ (z|X) qϕ (z|X) dz = log ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) ҎԼͷΑ͏ʹࣜมܗΛͯ͠ΈΔ 1Λ͔͚ͯ΋͍ͬ͠ΐ
24. ### VAEͷ໨తؔ਺ͷٻΊํ 24 ΠΣϯηϯͷෆ౳ࣜΑΓɺ  ͸Ԝؔ਺ (্ʹತ) Ͱ͋Δ͜ͱʹ஫ҙ͢Δͱ f(x) = log(x) ∫

pθ (X, z) dz qϕ (z|X) qϕ (z|X) log log pθ (X) ≥ ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) log pθ (X, z) dz qϕ (z|X) qϕ (z|X) log ∫ ͢ͳΘͪ ≥
25. ### VAEͷ໨తؔ਺ͷٻΊํ 25 ΠΣϯηϯͷෆ౳ࣜΑΓɺ  ͸Ԝؔ਺ (্ʹತ) Ͱ͋Δ͜ͱʹ஫ҙ͢Δͱ f(x) = log(x) ∫

pθ (X, z) dz qϕ (z|X) qϕ (z|X) log log pθ (X) ≥ ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) log pθ (X, z) dz qϕ (z|X) qϕ (z|X) log ∫ ͢ͳΘͪ ≥
26. ### VAEͷ໨తؔ਺ͷٻΊํ 26 ͜͜ͰӈลΛ ͱ͓͘ͱ log pθ (X) ≥ ℒ(θ, ϕ;

X) ℒ(θ, ϕ; X) = ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) log ͱॻ͚Δɻ͜ͷ Λ  ELBO (Evidence Lower BOund): ม෼Լք ͱݺͿ ℒ(θ, ϕ; X)
27. ### VAEͷ໨తؔ਺ͷٻΊํ 27 ELBOΛม෼Լݶͱॻ͘͜ͱ΋͋Δ͕ɺlower limit (Լݶ)Ͱ͸ͳ͘lower boundͳͷͰԼք͕ਖ਼͍͠Μ͡Όͳ͍͔ͱࢥ͍ͬͯΔ ͜͜ͰӈลΛ ͱ͓͘ͱ log pθ

(X) ≥ ℒ(θ, ϕ; X) ℒ(θ, ϕ; X) = ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) log ͱॻ͚Δɻ͜ͷ Λ  ELBO (Evidence Lower BOund): ม෼Լք ͱݺͿ ℒ(θ, ϕ; X)
28. ### VAEͷ໨తؔ਺ͷٻΊํ 28 ͱ͜ΖͰઌ΄Ͳͷෆ౳ࣜͷ྆ลͷࠩ ʹ͍ͭͯߟ͑ͯΈΔͱ log pθ (X) − ℒ(θ, ϕ;

X) = ∫ pθ (X, z) dz qϕ (z|X) qϕ (z|X) log log pθ (X) − = ∫ pθ (z|X) pθ (X) dz qϕ (z|X) qϕ (z|X) log log pθ (X) ∫ − qϕ (z|X) dz
29. ### VAEͷ໨తؔ਺ͷٻΊํ 29 log pθ (X) − ℒ(θ, ϕ; X) =

∫ pθ (z|X) pθ (X) dz qϕ (z|X) qϕ (z|X) log log pθ (X) ∫ − = ∫ pθ (z|X) pθ (X) dz qϕ (z|X) qϕ (z|X) log ∫ log pθ (X) dz − = ∫ log pθ (z|X) pθ (X) dz qϕ (z|X) dz qϕ (z|X) qϕ (z|X) pθ (X) qϕ (z|X)
30. ### ∫ log pθ (z|X) pθ (X) dz pθ (X) VAEͷ໨తؔ਺ͷٻΊํ

30 log pθ (X) − ℒ(θ, ϕ; X) = = = DKL ( qϕ (z|X) ∥ pθ (z|X) ) qϕ (z|X) qϕ (z|X) ∫ log pθ (z|X) dz qϕ (z|X) qϕ (z|X)
31. ### VAEͷ໨తؔ਺ͷٻΊํ 31 Ҏ্ΑΓ log pθ (X) = ℒ(θ, ϕ; X)

+ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ΋ͱ΋ͱͷ໨త͸ Λۙࣅ͢Δ ΛٻΊΔ͜ͱ pθ (z|X) qϕ (z|X) → Λ࠷খԽ͢Ε͹Α͍ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ͸ ͷ΋ͱͰҰఆͳͷͰ log pθ (X) θ ͷ࠷খԽ 㱻 ͷ࠷େԽ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ℒ(θ, ϕ; X)
32. ### VAEͷ໨తؔ਺ͷٻΊํ 32 Ҏ্ΑΓ log pθ (X) = ℒ(θ, ϕ; X)

+ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ΋ͱ΋ͱͷ໨త͸ Λۙࣅ͢Δ ΛٻΊΔ͜ͱ pθ (z|X) qϕ (z|X) → Λ࠷খԽ͢Ε͹Α͍ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ͸ ͷ΋ͱͰҰఆͳͷͰ log pθ (X) θ ͷ࠷খԽ 㱻 ͷ࠷େԽ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ℒ(θ, ϕ; X)
33. ### VAEͷ໨తؔ਺ͷٻΊํ 33 Ҏ্ΑΓ log pθ (X) = ℒ(θ, ϕ; X)

+ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ΋ͱ΋ͱͷ໨త͸ Λۙࣅ͢Δ ΛٻΊΔ͜ͱ pθ (z|X) qϕ (z|X) → Λ࠷খԽ͢Ε͹Α͍ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ͸ ͷ΋ͱͰҰఆͳͷͰ log pθ (X) θ ͷ࠷খԽ 㱻 ͷ࠷େԽ DKL ( qϕ (z|X) ∥ pθ (z|X) ) ℒ(θ, ϕ; X) ্ࣜӈล ୈ1߲ͱୈ2߲ͷ࿨͕ෆม → ୈ2߲͕খ͘͞ͳΔͳΒ  ୈ1߲͸େ͖͘ͳΒͳ͍ͱ͍͚ͳ͍
34. ### dz VAEͷ໨తؔ਺ͷٻΊํ 34 ℒ(θ, ϕ; X) = ∫ pθ (X,

z) dz qϕ (z|X) qϕ (z|X) log qϕ (z|X) qϕ (z|X) log ∫ = pθ (X|z) pθ (z) qϕ (z|X) log ∫ = pθ (X|z) dz qϕ (z|X) qϕ (z|X) log ∫ pθ (z) dz + ͱ͜ΖͰɺม෼ԼքΛ͞Βʹ෼ղͯ͠ΈΔͱ
35. ### VAEͷ໨తؔ਺ͷٻΊํ 35 ℒ(θ, ϕ; X) qϕ (z|X) log ∫ =

pθ (X|z) dz qϕ (z|X) qϕ (z|X) log ∫ pθ (z) dz − qϕ (z|X) log ∫ = pθ (X|z) dz − DKL ( qϕ (z|X) ∥ pθ (z) ) ໬౓ ਖ਼ଇԽ߲ qϕ (z|X) log ∫ = pθ (X|z) dz qϕ (z|X) qϕ (z|X) log ∫ pθ (z) dz +
36. ### VAEͷ໨తؔ਺ͷٻΊํ 36 ͷ࠷େԽ 㱻 ͷ࠷খԽͳͷͰɺ  ଛࣦؔ਺͕ҎԼͷΑ͏ʹఆΊΒΕΔ ℒ(θ, ϕ; X) −ℒ(θ,

ϕ; X) −ℒ(θ, ϕ; X) = qϕ (z|X) log ∫ pθ (X|z) dz − DKL ( qϕ (z|X) ∥ pθ (z) ) = − DKL ( qϕ (z|X) ∥ pθ (z) ) Eqϕ (z|X) [ log pθ (X|z) ] ਖ਼ଇԽ߲ ࠶ߏ੒ޡࠩ
37. ### VAEͷ໨తؔ਺ͷٻΊํ 37 ͷ࠷େԽ 㱻 ͷ࠷খԽͳͷͰɺ  ଛࣦؔ਺͕ҎԼͷΑ͏ʹఆΊΒΕΔ ℒ(θ, ϕ; X) −ℒ(θ,

ϕ; X) −ℒ(θ, ϕ; X) = qϕ (z|X) log ∫ pθ (X|z) dz − DKL ( qϕ (z|X) ∥ pθ (z) ) = − DKL ( qϕ (z|X) ∥ pθ (z) ) Eqϕ (z|X) [ log pθ (X|z) ] ਖ਼ଇԽ߲ ࠶ߏ੒ޡࠩ ʹΨ΢ε෼෍Λ  Ծఆ͢Ε͹ɺղੳతʹ  ଛࣦؔ਺ΛٻΊΒΕΔ pθ (z)

σ x’ Decoder
39. ### VAEͷϞσϧߏ଄ (࠶ܝ) 39 જࡏදݱ  z x Wμ Wσ Encoder μ

σ x’ Decoder ຊ౰͸͜͜ʹ reperameterization trick  ͱ͍͏ςΫ͕ڬ·Δ

44. ### •જࡏදݱͷ෼෍ʹط஌ͷ֬཰෼෍ΛԾఆֶͯ͠शΛߦ͏ੜ੒Ϟσϧ • ࣍ݩѹॖɾҙຯͷ͋Δදݱͷநग़ / αϯϓϦϯάʹΑΔੜ੒͕Մೳ •ग़ࣗ͸ҟͳΔ͕ɺAuto-EncoderͱࣅͨΞʔΩςΫνϟΛඋ͑Δ • Auto-Encoderʹજࡏදݱʹؔ͢Δਖ਼ଇԽ߲Λ௥Ճͨ͠΋ͷͱΈͳͤΔ • ਖ਼ଇԽ߲ʹΑΓVAE͸AEΑΓ΋ؤ݈

(ͱݴΘΕΔ) VAEͷ·ͱΊ 44

47. ### •VAE (ม෼ࣗݾූ߸Խث)ϕʔεͷࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ • ஫ҙ: طଘͷࣄલֶशࡁΈݴޠϞσϧ͸͔ͬ͠Γར༻ •EncoderʹBERTɺDecoder͸GPT-2 • ೋͭͷϞσϧΛ͏·͘౷߹ͯ͠  VAEΛߏ੒ɺ౷߹ख๏΋޻෉ •จੜ੒ʹ͓͚ΔධՁࢦඪɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ

• જࡏදݱͷઢܗิ׬ʹΑΔҙຯతʹͳΊΒ͔ͳจੜ੒͕Մೳ • NLPʹ͓͚ΔVAE + ࣄલֶशͷ༗༻ੑΛࣔ͢ ࿦จ֓ཁ (࠶ܝ) 47

WE
50. ### Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛ 50 z [CLS] w1 w2 … BERT reparameterization

trick μ σ WE sampling
51. ### Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛ 51 z [CLS] w1 w2 … BERT GPT-2

reparameterization trick μ σ WE / WM WD sampling
52. ### Ϟσϧߏ଄: ΋͏ͪΐͬͱࡉ͔͍൛ 52 z [CLS] w1 w2 … [CLS] w1

w2 … w1 w2 w3 … BERT GPT-2 reparameterization trick μ σ WE / WM WD sampling
53. ### •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻ • ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔ • ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏) • ʹΑͬͯજࡏදݱ͕ “ա౓ʹ”

ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙ β, λ β β = 0 λ ଛࣦؔ਺ 53
54. ### •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻ • ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔ • ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏) • ʹΑͬͯજࡏදݱ͕ “ա౓ʹ”

ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙ β, λ β β = 0 λ ଛࣦؔ਺ 54
55. ### •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻ • ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔ • ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏) • ʹΑͬͯજࡏදݱ͕ “ա౓ʹ”

ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙ β, λ β β = 0 λ ଛࣦؔ਺ 55
56. ### •௨ৗͷVAEͷଛࣦؔ਺ʹϋΠύʔύϥϝʔλ Λ௥Ճͯ͠ར༻ • ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ੔ • ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯά͸ߦ͏) • ʹΑͬͯજࡏදݱ͕ “ա౓ʹ”

ࣄલ෼෍ʹۙͮ͘ͷΛ๷͙ β, λ β β = 0 λ ଛࣦؔ਺ 56 ϋΠύϥ͕ଟ͍😇
57. ### •BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹ͸େ·͔ʹೋͭͷ໰୊͕ଘࡏ 1. ෼͔ͪॻ͖ •BERTͱGPT-2͸ҟͳΔޠኮΛ࣋ͪɺ෼͔ͪॻ͖ख๏͕ҟͳΔ •ೖྗͱग़ྗͰҟͳΔtokenizerΛ࢖͏͜ͱͰղܾ 2. જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒ •GPT-2͸৚݅෇͖ςΩετੜ੒ͷͨΊͷػߏΛඋ͍͑ͯͳ͍ •ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ੒͢Δ͔ʁ •

જࡏදݱͱGPT-2ͷੜ੒ػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧ BERTͱGPT-2ͷ౷߹ 57 prompting͸·ͨผͷ࿩
58. ### •BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹ͸େ·͔ʹೋͭͷ໰୊͕ଘࡏ 1. ෼͔ͪॻ͖ •BERTͱGPT-2͸ҟͳΔޠኮΛ࣋ͪɺ෼͔ͪॻ͖ख๏͕ҟͳΔ •ೖྗͱग़ྗͰҟͳΔtokenizerΛ࢖͏͜ͱͰղܾ 2. જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒ •GPT-2͸৚݅෇͖ςΩετੜ੒ͷͨΊͷػߏΛඋ͍͑ͯͳ͍ •ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ੒͢Δ͔ʁ •

જࡏදݱͱGPT-2ͷੜ੒ػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧ BERTͱGPT-2ͷ౷߹ 58 prompting͸·ͨผͷ࿩
59. ### •BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹ͸େ·͔ʹೋͭͷ໰୊͕ଘࡏ 1. ෼͔ͪॻ͖ •BERTͱGPT-2͸ҟͳΔޠኮΛ࣋ͪɺ෼͔ͪॻ͖ख๏͕ҟͳΔ •ೖྗͱग़ྗͰҟͳΔtokenizerΛ࢖͏͜ͱͰղܾ 2. જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒ •GPT-2͸৚݅෇͖ςΩετੜ੒ͷͨΊͷػߏΛඋ͍͑ͯͳ͍ •ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ੒͢Δ͔ʁ •

જࡏදݱͱGPT-2ͷੜ੒ػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧ BERTͱGPT-2ͷ౷߹ 59 prompting͸·ͨผͷ࿩
60. ### Memory •જࡏදݱΛ૚ͷ਺ͷϕΫτϧʹม׵ •จੜ੒࣌ʹ֤૚ͰϕΫτϧΛݟͳ͕Βੜ੒ Embedding •જࡏදݱΛม׵ͯ͠୯ޠຒΊࠐΈʹՃࢉ •BERTͷposition embeddingͷΑ͏ʹ  જࡏදݱΛ༻͍Δ BERTͱGPT-2ͷ౷߹: જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒

60 prompting͸·ͨผͷ࿩
61. ### Memory •જࡏදݱΛ૚ͷ਺ͷϕΫτϧʹม׵ •จੜ੒࣌ʹ֤૚ͰϕΫτϧΛݟͳ͕Βੜ੒ Embedding •જࡏදݱΛม׵ͯ͠୯ޠຒΊࠐΈʹՃࢉ •BERTͷposition embeddingͷΑ͏ʹ  જࡏදݱΛ༻͍Δ BERTͱGPT-2ͷ౷߹: જࡏදݱΛ༻͍ͨ৚݅෇͖ੜ੒

61 prompting͸·ͨผͷ࿩
62. ### Language Modeling •Optimus͕จΛਖ਼͘͠ੜ੒Ͱ͖Δ͔ධՁ •จੜ੒ʹ͓͚ΔPerplexity (PPL), MI Guided Language Generation •ಛఆͷ৚݅ʹैͬͨจΛਖ਼͘͠ੜ੒Ͱ͖Δ͔ධՁ

•ର࿩Ԡ౴ੜ੒ɺಛఆελΠϧͰͷԠ౴ੜ੒ɺϥϕϧͰ৚݅෇͚ͨ͠จੜ੒ Low-resource Language Understanding •௿ࢿݯઃఆͰͷOptimusͷ༗༻ੑΛݕূ •จຒΊࠐΈϕʔεͰGLUEΛղ͍ͯੑೳݕূ ධՁ࣮ݧ 62
63. ### •જࡏදݱ࣍ݩ: 32 • ެ։͞Ε͍ͯΔ࣮૷͔Β൑அ •VAEͱͯ͠ͷ܇࿅σʔλ: ӳޠWikipedia 199ສจ •จੜ੒ܥͷλεΫͰ͸͞ΒʹͦΕͧΕͷσʔληοτͰ1 epochֶ͚ͩश •ֶशͷ޻෉͕͍Ζ͍Ζ

• Λֶशதʹ૿Ճͤ͞ΔͳͲ •Low-resource Language UnderstandingͰ͸Encoder (BERT)ͷ[CLS]ʹରԠ ͢ΔදݱΛར༻ • ͳͷͰɺϕΫτϧͷ࣍ݩ਺͸32Ͱ͸ͳ͘768 β ࣮ݧઃఆ 63 જࡏදݱͷ࣍ݩ਺͕࿦จʹ໌ه͞Ε͍ͯͳ͍ؾ͕͢Δ…

68. ### •OptimusͷજࡏදݱΛ༻͍Δ͜ͱͰจදݱͷԋࢉ͕Մೳ • Λ΋ͱʹจੜ੒ •͜ͷ݁ՌΛͲ͏ड͚औΕ͹͍͍ͷ͔…? zD = zB − zA +

zC ධՁ࣮ݧ: Guided Language Generation 68 ࿦จͰ঺հ͞Ε͍ͯΔ σϞαΠτ ͸ΞΫηεͰ͖ͳ͘ͳ͍ͬͯΔ😇
69. ### •ೋͭͷจͷજࡏදݱͷ  ઢܗิ׬ʹΑΔੜ੒ •VAEͷજࡏۭ͕ؒͳΊΒ͔  ͳ͜ͱʹΑΔԸܙ •શؔ͘܎ͷͳ͍จ͸  ग़͖͍ͯͯͳ͍ɺ͘Β͍ͷ  ؾ͔࣋ͪ • ิ׬͞Εͨจͷޠኮ͸  ݩͷจͱࣅ͍ͯΔ

ධՁ࣮ݧ: Guided Language Generation 69
70. ### •ೋͭͷจͷજࡏදݱͷ  ઢܗิ׬ʹΑΔੜ੒ •VAEͷજࡏۭ͕ؒͳΊΒ͔  ͳ͜ͱʹΑΔԸܙ •શؔ͘܎ͷͳ͍จ͸  ग़͖͍ͯͯͳ͍ɺ͘Β͍ͷ  ؾ͔࣋ͪ • ิ׬͞Εͨจͷޠኮ͸  ݩͷจͱࣅ͍ͯΔ

ධՁ࣮ݧ: Guided Language Generation 70
71. ### •3ͭͷλεΫͰ࣮ݧɾߴ͍ੑೳ • ର࿩Ԡ౴ੜ੒ • ಛఆελΠϧͷจੜ੒ • ৚݅෇͖ੜ੒ •৚݅෇͖ੜ੒Ͱ͸ײ৘෼ྨͷ  ϥϕϧʹجͮ͘ςΩετΛੜ੒ •

ੜ੒จͷϥϕϧ෼ྨ֬཰΍  ੜ੒จͷଟ༷ੑͰߴ͍ੑೳ  ධՁ࣮ݧ: Guided Language Generation 71 ৄ͍࣮͠ݧઃఆɾλεΫઆ໌ʹ͍ͭͯ͸ݩ࿦จΛࢀরͷ͜ͱ
72. ### •3ͭͷλεΫͰ࣮ݧɾߴ͍ੑೳ • ର࿩Ԡ౴ੜ੒ • ಛఆελΠϧͷจੜ੒ • ৚݅෇͖ੜ੒ •৚݅෇͖ੜ੒Ͱ͸ײ৘෼ྨͷ  ϥϕϧʹجͮ͘ςΩετΛੜ੒ •

ੜ੒จͷϥϕϧ෼ྨ֬཰΍  ੜ੒จͷଟ༷ੑͰߴ͍ੑೳ  ධՁ࣮ݧ: Guided Language Generation 72 ৄ͍࣮͠ݧઃఆɾλεΫઆ໌ʹ͍ͭͯ͸ݩ࿦จΛࢀরͷ͜ͱ
73. ### •OptimusͷEncoderදݱΛ༻͍ͯ  ઢܗ෼ྨثΛ܇࿅ • Yelpσʔληοτͷײ৘෼ྨλεΫ •܇࿅ࣄྫ਺ʹΑΔੑೳͷมԽΛ؍࡯  •Optimus͸܇࿅ࣄྫ਺͕খͯ͘͞΋  ൺֱతߴ͍෼ྨੑೳ • ੑೳ্͕͕Δͷ͕एׯૣ͍ •

ಛʹ fi ne-tuningͳ͠ͷ৔߹ʹ΋ͱͷ  BERTΑΓ΋ੑೳ͕ߴ͍ • VAEͷֶशΛ௨ͯ͠ྑ͍જࡏۭؒ  Λ֫ಘ͍ͯ͠Δ͜ͱΛࣔࠦ ධՁ࣮ݧ: Low-resource Language Understanding 73
74. ### •OptimusͷEncoderදݱΛ༻͍ͯ  ઢܗ෼ྨثΛ܇࿅ • Yelpσʔληοτͷײ৘෼ྨλεΫ •܇࿅ࣄྫ਺ʹΑΔੑೳͷมԽΛ؍࡯  •Optimus͸܇࿅ࣄྫ਺͕খͯ͘͞΋  ൺֱతߴ͍෼ྨੑೳ • ੑೳ্͕͕Δͷ͕एׯૣ͍ •

ಛʹ fi ne-tuningͳ͠ͷ৔߹ʹ΋ͱͷ  BERTΑΓ΋ੑೳ͕ߴ͍ • VAEͷֶशΛ௨ͯ͠ྑ͍જࡏۭؒ  Λ֫ಘ͍ͯ͠Δ͜ͱΛࣔࠦ ධՁ࣮ݧ: Low-resource Language Understanding 74
75. ### •OptimusͱBERTͷจදݱ  ͷ෼෍ΛՄࢹԽ • Yelpσʔληοτͷ  ։ൃηοτΛจදݱʹม׵ •Optimusͷํ͕จදݱͷ෼෍͕  Ұ༷Ͱϥϕϧ͝ͱͷմ͕ΑΓ  ໌֬ • ಛʹɺBERTΑΓજࡏදݱ͕

Ұ༷ʹ෼෍͍ͯ͠Δ • ͱݴ͑ΔΑ͏ͳؾ͕͢Δ ධՁ࣮ݧ: Low-resource Language Understanding 75
76. ### •OptimusͱBERTͷจදݱ  ͷ෼෍ΛՄࢹԽ • Yelpσʔληοτͷ  ։ൃηοτΛจදݱʹม׵ •Optimusͷํ͕จදݱͷ෼෍͕  Ұ༷Ͱϥϕϧ͝ͱͷմ͕ΑΓ  ໌֬ • ಛʹɺBERTΑΓજࡏදݱ͕

Ұ༷ʹ෼෍͍ͯ͠Δ • ͱݴ͑ΔΑ͏ͳؾ͕͢Δ ධՁ࣮ݧ: Low-resource Language Understanding 76
77. ### •OptimusͷGLUEͰͷੑೳΛධՁ • จຒΊࠐΈΛೖྗͱ͢Δઢܗ෼ྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔ •Fine-tuningͳ͠ͷ৔߹ʹݩͷBERTΑΓ΋ߴ͍ੑೳ • Optimus͸BERTΑΓ΋ྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ • BERTͷQQPͷੑೳ͕௿͗͢Δͷ͕ؾʹͳΔ͕… •Fine-tuning͋Γͷ৔߹͸ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰ౰વ͔)

ධՁ࣮ݧ: Low-resource Language Understanding 77 Ͳ͏ͤͳΒSentEvalͰ΋࣮ݧͯ͠ཉ͔͕ͬͨ͠…
78. ### •OptimusͷGLUEͰͷੑೳΛධՁ • จຒΊࠐΈΛೖྗͱ͢Δઢܗ෼ྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔ •Fine-tuningͳ͠ͷ৔߹ʹݩͷBERTΑΓ΋ߴ͍ੑೳ • Optimus͸BERTΑΓ΋ྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ • BERTͷQQPͷੑೳ͕௿͗͢Δͷ͕ؾʹͳΔ͕… •Fine-tuning͋Γͷ৔߹͸ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰ౰વ͔)

ධՁ࣮ݧ: Low-resource Language Understanding 78 Ͳ͏ͤͳΒSentEvalͰ΋࣮ݧͯ͠ཉ͔͕ͬͨ͠…
79. ### •OptimusͷGLUEͰͷੑೳΛධՁ • จຒΊࠐΈΛೖྗͱ͢Δઢܗ෼ྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔ •Fine-tuningͳ͠ͷ৔߹ʹݩͷBERTΑΓ΋ߴ͍ੑೳ • Optimus͸BERTΑΓ΋ྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ • BERTͷQQPͷੑೳ͕௿͗͢Δͷ͕ؾʹͳΔ͕… •Fine-tuning͋Γͷ৔߹͸ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰ౰વ͔)

ධՁ࣮ݧ: Low-resource Language Understanding 79 Ͳ͏ͤͳΒSentEvalͰ΋࣮ݧͯ͠ཉ͔͕ͬͨ͠…
80. ### •VAEϕʔεͷେن໛ࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ •EncoderʹBERTɺDecoderʹGPT-2Λ্ख͘౷߹ͯ͠VAEΛߏ੒ •จੜ੒ɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ • ಛʹطଘͷখ͞ͳVAEΛେ্͖͘ճΔੑೳ • VAEʹ͓͚Δࣄલֶशͷ༗ޮੑΛࣔ͢ ײ૝ •BERTͳͲͷطଘࣄલֶशࡁΈݴޠϞσϧΛར༻ͤͣɺfrom scratchͰֶश͢

ΔͱͲ͏ͳΔͷ͔͕ؾʹͳΔ • ܭࢉϦιʔεతʹݫ͔ͬͨ͠໛༷(Sec. 6 DiscussionΛࢀর) •ࣄલֶश + VAEͳ࿩ͱͯ͠͸໘ന͍͕ɺԠ༻ൣғ͸ݶఆత͔ ·ͱΊ 80
81. ### •VAEϕʔεͷେن໛ࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ •EncoderʹBERTɺDecoderʹGPT-2Λ্ख͘౷߹ͯ͠VAEΛߏ੒ •จੜ੒ɾ৚݅෇͖ੜ੒ɾ௿ࢿݯઃఆͷλεΫͰߴ͍ੑೳ • ಛʹطଘͷখ͞ͳVAEΛେ্͖͘ճΔੑೳ • VAEʹ͓͚Δࣄલֶशͷ༗ޮੑΛࣔ͢ ײ૝ •BERTͳͲͷطଘࣄલֶशࡁΈݴޠϞσϧΛར༻ͤͣɺfrom scratchͰֶश͢

ΔͱͲ͏ͳΔͷ͔͕ؾʹͳΔ • ܭࢉϦιʔεతʹݫ͔ͬͨ͠໛༷(Sec. 6 DiscussionΛࢀর) •ࣄલֶश + VAEͳ࿩ͱͯ͠͸໘ന͍͕ɺԠ༻ൣғ͸ݶఆత͔ ·ͱΊ 81