事前学習済み言語モデルを統合することによって構築される大規模Variational Auto-Encoder (VAE)モデルのOptimusと、その論文について解説した資料です。 Optimusを支えるVAEの目的関数の導出から丁寧に紹介します。
Optimus: Organizing Sentences via Pre-trainedModeling of a Latent SpaceGraduate school of Informatics, Nagoya University, Japan.ൃදऀ: Hayato TsukagoshiChunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, and Jianfeng GaoEMNLP 2020URL: https://aclanthology.org/2020.emnlp-main.378/
View Slide
•VAE (มࣗݾූ߸Խث)ϕʔεͷࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ• ҙ: طଘͷࣄલֶशࡁΈݴޠϞσϧ͔ͬ͠Γར༻•EncoderʹBERTɺDecoderGPT-2• ೋͭͷϞσϧΛ͏·͘౷߹ͯ͠ VAEΛߏɺ౷߹ख๏•จੜʹ͓͚ΔධՁࢦඪɾ͖݅ੜɾࢿݯઃఆͷλεΫͰߴ͍ੑೳ• જࡏදݱͷઢܗิʹΑΔҙຯతʹͳΊΒ͔ͳจੜ͕Մೳ• NLPʹ͓͚ΔVAE + ࣄલֶशͷ༗༻ੑΛࣔ͢จ֓ཁ2
•VAEཧతɾٕज़తʹ໘ന͍͕(ಛʹNLPͰ)͋·Γ͞Ε͍ͯͳ͍• BERTͳͲͷࣄલֶशࡁΈϞσϧ͕؆୯ɾڧྗ• ҰԠଟ༷ੑΛॏࢹ͢ΔจੜλεΫͰΘΕ͍ͯΔΑ͏͕ͩ…•ࣗ༻ʹVAEʹ͍ͭͯษڧɾഎܠࣝͷ·ͱΊ͕͔ͨͬͨ͠͠• ࣜΛ͋·Γ͓֮͑ͯΒͣ…•ࣗͷݚڀͰ͏͔͠Εͣڵຯ͕͋ͬͨ• VAEϕʔεͷจຒΊࠐΈϞσϧ΄ͱΜͲݟͳ͍ͷͰ• BERT-flowͱ͔ࢥ͕ۙͦ͏Ͱ͋Δબఆཧ༝3
ಋೖ•VAEͱ•VAEͷతؔͷಋग़Optimus•Ϟσϧߏ•ଛࣦؔ•BERTͱGPT-2ͷ౷߹•ධՁ࣮ݧ࣍4
ಋೖ
ಋೖ•VAEͱ•VAEͷతؔͷಋग़Optimus•Ϟσϧߏ•ଛࣦؔ•BERTͱGPT-2ͷ౷߹•ධՁ࣮ݧ࣍6
•ग़ྗ͚ͩͰͳ͘ೖྗͷϞσϧԽ͢Δख๏ *• զʑ͕Α͘͏ͷࣝผϞσϧ (ྨ͚ͩߦ͏)•σʔλ͕ԿΒ͔ͷ֬ʹج͍ͮͯੜ͞ΕΔͱߟ͑Δ• ؍ଌσʔλ͔Β؍ଌσʔλ͕ै͏֬Λਪఆ͢Δ•ը૾ʹ͓͚ΔGAN͕༗໊• NLPͰҙ֎ͱ͋·Γݟͳ͍ʁੜϞσϧ7* ύλʔϯೝࣝͱػցֶश ্ר p.42Λࢀর
•தؒදݱ͔ΒೖྗΛ࠶ߏͰ͖ΔΑ͏ʹ܇࿅͢ΔϞσϧ• ੜϞσϧͷҰछ• ڭࢣͳֶ͠श͕Մೳ•தؒදݱೖྗͷѹॖ͞ΕͨදݱͱΈͳͤΔ• ඇઢܗͰෳࡶͳ࣍ݩѹॖ͕Ͱ͖Δ• ΫϥελϦϯάҟৗݕɾϊΠζআڈͳͲʹΘΕΔAuto-Encoder (AE): ࣗݾූ߸Խث8
•Auto-Encoderͷજࡏදݱͷʹ੍ΛՃ͑ͨͷ (ͱݟ၏ͤΔ)• AEͱҟͳΔಈػͱཧഎܠΛ͕࣋ͭɺࣅͨͷͱղऍͰ͖Δ• જࡏදݱʹର͢Δ੍ʹΑͬͯσʔλͷੜ͕༰қʹ• Kingma et al., 2013. Auto-Encoding Variational Bayes ͰఏҊ•જࡏදݱͷʹҙͷࣄલ (prior) ΛબΔ• ଟ͘ͷ߹ඪ४ਖ਼ن (standard normal distribution)•ଛࣦؔͱͯ͠ೋͭͷଛࣦΛ͠߹Θͤͯ༻͍Δ• ࠶ߏޡࠩ• જࡏදݱͷʹ͍ͭͯͷଛࣦVariational Auto-Encoder (VAE): มࣗݾූ߸Խث9
VAEͷϞσϧߏ10જࡏදݱ zxWμWσEncoderμσx’Decoder
VAEͷϞσϧߏ11જࡏදݱ zxWμWσEncoderμσx’DecoderೖྗΛϕΫτϧදݱʹม
VAEͷϞσϧߏ12ࢄڞࢄߦྻΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏જࡏදݱ zxWμWσEncoderμσx’DecoderϕΫτϧදݱ͔ΒΨεͷ ฏۉͱࢄڞࢄߦྻΛग़ྗ
VAEͷϞσϧߏ13ࢄڞࢄߦྻΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏જࡏදݱ zxWμWσEncoderμσx’DecoderฏۉͱࢄڞࢄߦྻΛ༻͍ͯΨε͔ΒαϯϓϦϯάɺજࡏදݱΛ֫ಘ
VAEͷϞσϧߏ14ࢄڞࢄߦྻΊΜͲ͏ͳͷͰجຊతʹର֯ߦྻͱΈͳͯ͠͠·͏જࡏදݱ zxWμWσEncoderμσx’Decoderજࡏදݱ͔Βग़ྗΛ࠶ߏ
AEͱVAEͷϞσϧߏͷൺֱ15જࡏදݱ zxWμWσEncoderμσx’Decoderજࡏදݱ zx Encoder x’DecoderAEVAE
AEͱVAEͷϞσϧߏͷൺֱ16જࡏදݱ zxWμWσEncoderμσx’Decoderજࡏදݱ zx Encoder x’DecoderAEVAEજࡏදݱΛαϯϓϦϯά͢ΔͨΊͷॲཧͱ જࡏදݱͷʹؔ͢Δ ଛࣦ͕૿͑Δ͚ͩ
•AEͰજࡏදݱ͕ͲͷΑ͏ʹ͍ͯ͠Δ͔ෆ໌• VAEͰطͷ֬ʹ͚ۙͮΔΑ͏ʹֶशΛߦ͏• ط͔ΒͷαϯϓϦϯάͰࣗવͳσʔλͷੜ͕ߦ͑Δ•ਖ਼ଇԽೳྗ͕͋ΓAEΑΓؤ݈• Denoising Auto-EncoderͳͲͱಉ༷• PCASVDͱҟͳΓɺඇઢܗมͰೖྗσʔλͷѹॖ͕ߦ͑ΔVAEͷར17
GAN•ࣝผث(Discriminator)͕ੜث(Generator)ͷग़ྗΛྨͰ͖ͳ͍Α͏ʹֶशVAE•જࡏදݱͷ͕ࣄલʹۙͮ͘Α͏ʹ + ೖྗΛ࠶ߏ͢ΔΑ͏ʹֶशNormalizingflow•ٯมՄೳͳࣸ૾ΛֶशɺෳࡶͳજࡏදݱͷΛߏ•VAEͱΈ߹ΘͤՄೳDiffusion Models•ॱํͰϊΠζՃࢉɺٯํͰϊΠζΛআڈ͢ΔΑ͏ʹϞσϧΛֶशVAEͱͦͷଞͷੜϞσϧͷൺֱ18มਪͱ Normalizing Flow
•VAEͷଛࣦؔҎԼͷೋͭͷ͠߹Θͤ• ࠶ߏޡࠩ• ਖ਼ଇԽ߲ (જࡏදݱͷʹ͍ͭͯͷଛࣦ)• Encoderͷύϥϝʔλɺ Decoderͷύϥϝʔλϕ θVAEͷతؔ19ℒ = −DKL( qϕ(z|X) ∥ pθ(z) ) Eqϕ(z|X)[ log pθ(X|z) ]ਖ਼ଇԽ߲ ࠶ߏޡࠩ
•ͦͦͷVAE (͘͠มϕΠζ)ͷ͓ؾ࣋ͪ• σʔλ ʹӅ͞Εͨੑ࣭ Λදݱ͢Δࣄޙ֬ ΛΓ͍ͨ•࣮ࡍʹ Θ͔Βͳ͍͜ͱ͕΄ͱΜͲ• Λۙࣅͨ͠ Ͱଥڠ• ͲͷΑ͏ʹٻΊΔ͔ʁ• ͜ͷ֬ͲͷΑ͏ʹͳΔ͔Θ͔Βͳ͍• Λͱ͔͔ͬΓʹࣜΛ͜Ͷ͘Γ·Θͯ͠ΈΔX Z pθ(Z|X)pθ(X) pθ(Z|X)pθ(Z|X) qϕ(Z|X)qϕ(Z|X)pθ(X)VAEͷతؔͷٻΊํ20
•ͦͦͷVAE (͘͠มϕΠζ)ͷ͓ؾ࣋ͪ• σʔλ ʹӅ͞Εͨੑ࣭ Λදݱ͢Δࣄޙ֬ ΛΓ͍ͨ•࣮ࡍʹ Θ͔Βͳ͍͜ͱ͕΄ͱΜͲ• Λۙࣅͨ͠ Ͱଥڠ• ͲͷΑ͏ʹٻΊΔ͔ʁ• ͜ͷ֬ͲͷΑ͏ʹͳΔ͔Θ͔Βͳ͍• Λͱ͔͔ͬΓʹࣜΛ͜Ͷ͘Γ·Θͯ͠ΈΔX Z pθ(Z|X)pθ(X) pθ(Z|X)pθ(Z|X) qϕ(Z|X)qϕ(Z|X)pθ(X)VAEͷతؔͷٻΊํ21
VAEͷతؔͷٻΊํ22log pθ(X) = log∫pθ(X, z) dz= log∫pθ(X, z)qϕ(z|X)qϕ(z|X)dz= log∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)ҎԼͷΑ͏ʹࣜมܗΛͯ͠ΈΔzͰपลԽͨ͠ͷ ͱΈͳ͢
VAEͷతؔͷٻΊํ23log pθ(X) = log∫pθ(X, z) dz= log∫pθ(X, z)qϕ(z|X)qϕ(z|X)dz= log∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)ҎԼͷΑ͏ʹࣜมܗΛͯ͠ΈΔ1Λ͔͚͍ͯͬ͠ΐ
VAEͷతؔͷٻΊํ24ΠΣϯηϯͷෆࣜΑΓɺ Ԝؔ (্ʹತ) Ͱ͋Δ͜ͱʹҙ͢Δͱf(x) = log(x)∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)loglog pθ(X) ≥∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)logpθ(X, z)dzqϕ(z|X)qϕ(z|X)log∫͢ͳΘͪ≥
VAEͷతؔͷٻΊํ25ΠΣϯηϯͷෆࣜΑΓɺ Ԝؔ (্ʹತ) Ͱ͋Δ͜ͱʹҙ͢Δͱf(x) = log(x)∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)loglog pθ(X) ≥∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)logpθ(X, z)dzqϕ(z|X)qϕ(z|X)log∫͢ͳΘͪ≥
VAEͷతؔͷٻΊํ26͜͜ͰӈลΛͱ͓͘ͱlog pθ(X) ≥ ℒ(θ, ϕ; X)ℒ(θ, ϕ; X) =∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)logͱॻ͚Δɻ͜ͷ Λ ELBO (Evidence Lower BOund): มԼք ͱݺͿℒ(θ, ϕ; X)
VAEͷతؔͷٻΊํ27ELBOΛมԼݶͱॻ͘͜ͱ͋Δ͕ɺlower limit (Լݶ)Ͱͳ͘lower boundͳͷͰԼք͕ਖ਼͍͠Μ͡Όͳ͍͔ͱࢥ͍ͬͯΔ͜͜ͰӈลΛͱ͓͘ͱlog pθ(X) ≥ ℒ(θ, ϕ; X)ℒ(θ, ϕ; X) =∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)logͱॻ͚Δɻ͜ͷ Λ ELBO (Evidence Lower BOund): มԼք ͱݺͿℒ(θ, ϕ; X)
VAEͷతؔͷٻΊํ28ͱ͜ΖͰઌ΄Ͳͷෆࣜͷ྆ลͷࠩʹ͍ͭͯߟ͑ͯΈΔͱlog pθ(X) − ℒ(θ, ϕ; X)= ∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)loglog pθ(X) −=∫pθ(z|X) pθ(X)dzqϕ(z|X)qϕ(z|X)loglog pθ(X)∫ −qϕ(z|X) dz
VAEͷతؔͷٻΊํ29log pθ(X) − ℒ(θ, ϕ; X)=∫pθ(z|X) pθ(X)dzqϕ(z|X)qϕ(z|X)loglog pθ(X)∫ −=∫pθ(z|X) pθ(X)dzqϕ(z|X)qϕ(z|X)log∫log pθ(X) dz −= ∫logpθ(z|X) pθ(X)dzqϕ(z|X) dzqϕ(z|X)qϕ(z|X)pθ(X) qϕ(z|X)
∫logpθ(z|X) pθ(X)dzpθ(X)VAEͷతؔͷٻΊํ30log pθ(X) − ℒ(θ, ϕ; X)=== DKL( qϕ(z|X) ∥ pθ(z|X) )qϕ(z|X)qϕ(z|X)∫logpθ(z|X)dzqϕ(z|X)qϕ(z|X)
VAEͷతؔͷٻΊํ31Ҏ্ΑΓlog pθ(X) = ℒ(θ, ϕ; X) + DKL( qϕ(z|X) ∥ pθ(z|X) )ͱͱͷత Λۙࣅ͢Δ ΛٻΊΔ͜ͱpθ(z|X) qϕ(z|X)→ Λ࠷খԽ͢ΕΑ͍DKL( qϕ(z|X) ∥ pθ(z|X) ) ͷͱͰҰఆͳͷͰlog pθ(X) θͷ࠷খԽ 㱻 ͷ࠷େԽDKL( qϕ(z|X) ∥ pθ(z|X) ) ℒ(θ, ϕ; X)
VAEͷతؔͷٻΊํ32Ҏ্ΑΓlog pθ(X) = ℒ(θ, ϕ; X) + DKL( qϕ(z|X) ∥ pθ(z|X) )ͱͱͷత Λۙࣅ͢Δ ΛٻΊΔ͜ͱpθ(z|X) qϕ(z|X)→ Λ࠷খԽ͢ΕΑ͍DKL( qϕ(z|X) ∥ pθ(z|X) ) ͷͱͰҰఆͳͷͰlog pθ(X) θͷ࠷খԽ 㱻 ͷ࠷େԽDKL( qϕ(z|X) ∥ pθ(z|X) ) ℒ(θ, ϕ; X)
VAEͷతؔͷٻΊํ33Ҏ্ΑΓlog pθ(X) = ℒ(θ, ϕ; X) + DKL( qϕ(z|X) ∥ pθ(z|X) )ͱͱͷత Λۙࣅ͢Δ ΛٻΊΔ͜ͱpθ(z|X) qϕ(z|X)→ Λ࠷খԽ͢ΕΑ͍DKL( qϕ(z|X) ∥ pθ(z|X) ) ͷͱͰҰఆͳͷͰlog pθ(X) θͷ࠷খԽ 㱻 ͷ࠷େԽDKL( qϕ(z|X) ∥ pθ(z|X) ) ℒ(θ, ϕ; X)্ࣜӈล ୈ1߲ͱୈ2߲ͷ͕ෆม→ ୈ2߲͕খ͘͞ͳΔͳΒ ୈ1߲େ͖͘ͳΒͳ͍ͱ͍͚ͳ͍
dzVAEͷతؔͷٻΊํ34ℒ(θ, ϕ; X) =∫pθ(X, z)dzqϕ(z|X)qϕ(z|X)logqϕ(z|X)qϕ(z|X)log∫=pθ(X|z) pθ(z)qϕ(z|X) log∫= pθ(X|z) dzqϕ(z|X)qϕ(z|X)log∫pθ(z)dz +ͱ͜ΖͰɺมԼքΛ͞Βʹղͯ͠ΈΔͱ
VAEͷతؔͷٻΊํ35ℒ(θ, ϕ; X)qϕ(z|X) log∫= pθ(X|z) dzqϕ(z|X)qϕ(z|X)log∫ pθ(z)dz −qϕ(z|X) log∫= pθ(X|z) dz − DKL( qϕ(z|X) ∥ pθ(z) ) ਖ਼ଇԽ߲qϕ(z|X) log∫= pθ(X|z) dzqϕ(z|X)qϕ(z|X)log∫pθ(z)dz +
VAEͷతؔͷٻΊํ36ͷ࠷େԽ 㱻 ͷ࠷খԽͳͷͰɺ ଛࣦ͕ؔҎԼͷΑ͏ʹఆΊΒΕΔℒ(θ, ϕ; X) −ℒ(θ, ϕ; X)−ℒ(θ, ϕ; X) = qϕ(z|X) log∫ pθ(X|z) dz−DKL( qϕ(z|X) ∥ pθ(z) )= −DKL( qϕ(z|X) ∥ pθ(z) ) Eqϕ(z|X)[ log pθ(X|z) ]ਖ਼ଇԽ߲ ࠶ߏޡࠩ
VAEͷతؔͷٻΊํ37ͷ࠷େԽ 㱻 ͷ࠷খԽͳͷͰɺ ଛࣦ͕ؔҎԼͷΑ͏ʹఆΊΒΕΔℒ(θ, ϕ; X) −ℒ(θ, ϕ; X)−ℒ(θ, ϕ; X) = qϕ(z|X) log∫ pθ(X|z) dz−DKL( qϕ(z|X) ∥ pθ(z) )= −DKL( qϕ(z|X) ∥ pθ(z) ) Eqϕ(z|X)[ log pθ(X|z) ]ਖ਼ଇԽ߲ ࠶ߏޡࠩʹΨεΛ Ծఆ͢Εɺղੳతʹ ଛࣦؔΛٻΊΒΕΔpθ(z)
VAEͷϞσϧߏ (࠶ܝ)38જࡏදݱ zxWμWσEncoderμσx’Decoder
VAEͷϞσϧߏ (࠶ܝ)39જࡏදݱ zxWμWσEncoderμσx’Decoderຊ͜͜ʹreperameterization trick ͱ͍͏ςΫ͕ڬ·Δ
VAEͷٖࣅίʔυ: Encoder40
VAEͷٖࣅίʔυ: Encoder41࣮ͱͯ͠ ઢܗʹೋވʹ௨͚ͩ͢
VAEͷٖࣅίʔυ: શମ42
VAEͷٖࣅίʔυ: શମ43αϯϓϦϯάͯ֫͠ಘͨ͠ જࡏදݱ͔ΒೖྗΛ࠶ߏ
•જࡏදݱͷʹطͷ֬ΛԾఆֶͯ͠शΛߦ͏ੜϞσϧ• ࣍ݩѹॖɾҙຯͷ͋Δදݱͷநग़ / αϯϓϦϯάʹΑΔੜ͕Մೳ•ग़ࣗҟͳΔ͕ɺAuto-EncoderͱࣅͨΞʔΩςΫνϟΛඋ͑Δ• Auto-Encoderʹજࡏදݱʹؔ͢Δਖ਼ଇԽ߲ΛՃͨ͠ͷͱΈͳͤΔ• ਖ਼ଇԽ߲ʹΑΓVAEAEΑΓؤ݈ (ͱݴΘΕΔ)VAEͷ·ͱΊ44
Optimus
ಋೖ•VAEͱ•VAEͷతؔͷಋग़Optimus•Ϟσϧߏ•ଛࣦؔ•BERTͱGPT-2ͷ౷߹•ධՁ࣮ݧ࣍46
•VAE (มࣗݾූ߸Խث)ϕʔεͷࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ• ҙ: طଘͷࣄલֶशࡁΈݴޠϞσϧ͔ͬ͠Γར༻•EncoderʹBERTɺDecoderGPT-2• ೋͭͷϞσϧΛ͏·͘౷߹ͯ͠ VAEΛߏɺ౷߹ख๏•จੜʹ͓͚ΔධՁࢦඪɾ͖݅ੜɾࢿݯઃఆͷλεΫͰߴ͍ੑೳ• જࡏදݱͷઢܗิʹΑΔҙຯతʹͳΊΒ͔ͳจੜ͕Մೳ• NLPʹ͓͚ΔVAE + ࣄલֶशͷ༗༻ੑΛࣔ͢จ֓ཁ (࠶ܝ)47
•EncoderʹBERTΛར༻ɺ[CLS]Λจදݱͱͯ͠༻͍Δ•DecoderʹGPT-2Λར༻ɺજࡏදݱʹैͬͯจੜΛߦ͏•શମͱͯ͠VAEతʹೖྗจΛ࠶ߏͰ͖ΔΑ͏ʹֶशϞσϧߏ: ؆୯൛48
Ϟσϧߏ: ͏ͪΐͬͱࡉ͔͍൛49[CLS] w1 w2 …BERTμσWE
Ϟσϧߏ: ͏ͪΐͬͱࡉ͔͍൛50z[CLS] w1 w2 …BERTreparameterizationtrickμσWEsampling
Ϟσϧߏ: ͏ͪΐͬͱࡉ͔͍൛51z[CLS] w1 w2 …BERTGPT-2reparameterizationtrickμσWE/WMWDsampling
Ϟσϧߏ: ͏ͪΐͬͱࡉ͔͍൛52z[CLS] w1 w2 …[CLS] w1 w2 …w1 w2 w3 …BERTGPT-2reparameterizationtrickμσWE/WMWDsampling
•௨ৗͷVAEͷଛࣦؔʹϋΠύʔύϥϝʔλ ΛՃͯ͠ར༻• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯάߦ͏)• ʹΑͬͯજࡏදݱ͕ “աʹ” ࣄલʹۙͮ͘ͷΛ͙β, λββ = 0λଛࣦؔ53
•௨ৗͷVAEͷଛࣦؔʹϋΠύʔύϥϝʔλ ΛՃͯ͠ར༻• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯάߦ͏)• ʹΑͬͯજࡏදݱ͕ “աʹ” ࣄલʹۙͮ͘ͷΛ͙β, λββ = 0λଛࣦؔ54
•௨ৗͷVAEͷଛࣦؔʹϋΠύʔύϥϝʔλ ΛՃͯ͠ར༻• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯάߦ͏)• ʹΑͬͯજࡏදݱ͕ “աʹ” ࣄલʹۙͮ͘ͷΛ͙β, λββ = 0λଛࣦؔ55
•௨ৗͷVAEͷଛࣦؔʹϋΠύʔύϥϝʔλ ΛՃͯ͠ར༻• ʹΑͬͯਖ਼ଇԽͷڧ͞Λௐ• ͷͱ͖ʹAuto-Encoderͱ΄΅ಉ͡ʹ (αϯϓϦϯάߦ͏)• ʹΑͬͯજࡏදݱ͕ “աʹ” ࣄલʹۙͮ͘ͷΛ͙β, λββ = 0λଛࣦؔ56ϋΠύϥ͕ଟ͍😇
•BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹେ·͔ʹೋͭͷ͕ଘࡏ1. ͔ͪॻ͖•BERTͱGPT-2ҟͳΔޠኮΛ࣋ͪɺ͔ͪॻ͖ख๏͕ҟͳΔ•ೖྗͱग़ྗͰҟͳΔtokenizerΛ͏͜ͱͰղܾ2. જࡏදݱΛ༻͍͖ͨ݅ੜ•GPT-2͖݅ςΩετੜͷͨΊͷػߏΛඋ͍͑ͯͳ͍•ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ͢Δ͔ʁ• જࡏදݱͱGPT-2ͷੜػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧBERTͱGPT-2ͷ౷߹57prompting·ͨผͷ
•BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹେ·͔ʹೋͭͷ͕ଘࡏ1. ͔ͪॻ͖•BERTͱGPT-2ҟͳΔޠኮΛ࣋ͪɺ͔ͪॻ͖ख๏͕ҟͳΔ•ೖྗͱग़ྗͰҟͳΔtokenizerΛ͏͜ͱͰղܾ2. જࡏදݱΛ༻͍͖ͨ݅ੜ•GPT-2͖݅ςΩετੜͷͨΊͷػߏΛඋ͍͑ͯͳ͍•ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ͢Δ͔ʁ• જࡏදݱͱGPT-2ͷੜػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧBERTͱGPT-2ͷ౷߹58prompting·ͨผͷ
•BERTͱGPT-2Λ౷߹ͯ͠VAEΛߏங͢Δʹେ·͔ʹೋͭͷ͕ଘࡏ1. ͔ͪॻ͖•BERTͱGPT-2ҟͳΔޠኮΛ࣋ͪɺ͔ͪॻ͖ख๏͕ҟͳΔ•ೖྗͱग़ྗͰҟͳΔtokenizerΛ͏͜ͱͰղܾ2. જࡏදݱΛ༻͍͖ͨ݅ੜ•GPT-2͖݅ςΩετੜͷͨΊͷػߏΛඋ͍͑ͯͳ͍•ͲͷΑ͏ʹBERTΛ༻͍ͯಘΒΕͨજࡏදݱ͔ΒςΩετΛੜ͢Δ͔ʁ• જࡏදݱͱGPT-2ͷੜػߏΛ౷߹͢Δ2ͭͷख๏Λ࣮ݧBERTͱGPT-2ͷ౷߹59prompting·ͨผͷ
Memory•જࡏදݱΛͷͷϕΫτϧʹม•จੜ࣌ʹ֤ͰϕΫτϧΛݟͳ͕ΒੜEmbedding•જࡏදݱΛมͯ͠୯ޠຒΊࠐΈʹՃࢉ•BERTͷposition embeddingͷΑ͏ʹ જࡏදݱΛ༻͍ΔBERTͱGPT-2ͷ౷߹: જࡏදݱΛ༻͍͖ͨ݅ੜ60prompting·ͨผͷ
Memory•જࡏදݱΛͷͷϕΫτϧʹม•จੜ࣌ʹ֤ͰϕΫτϧΛݟͳ͕ΒੜEmbedding•જࡏදݱΛมͯ͠୯ޠຒΊࠐΈʹՃࢉ•BERTͷposition embeddingͷΑ͏ʹ જࡏදݱΛ༻͍ΔBERTͱGPT-2ͷ౷߹: જࡏදݱΛ༻͍͖ͨ݅ੜ61prompting·ͨผͷ
Language Modeling•Optimus͕จΛਖ਼͘͠ੜͰ͖Δ͔ධՁ•จੜʹ͓͚ΔPerplexity (PPL), MIGuided Language Generation•ಛఆͷ݅ʹैͬͨจΛਖ਼͘͠ੜͰ͖Δ͔ධՁ•ରԠੜɺಛఆελΠϧͰͷԠੜɺϥϕϧͰ͚݅ͨ͠จੜLow-resource Language Understanding•ࢿݯઃఆͰͷOptimusͷ༗༻ੑΛݕূ•จຒΊࠐΈϕʔεͰGLUEΛղ͍ͯੑೳݕূධՁ࣮ݧ62
•જࡏදݱ࣍ݩ: 32• ެ։͞Ε͍ͯΔ࣮͔Βஅ•VAEͱͯ͠ͷ܇࿅σʔλ: ӳޠWikipedia 199ສจ•จੜܥͷλεΫͰ͞ΒʹͦΕͧΕͷσʔληοτͰ1 epochֶ͚ͩश•ֶशͷ͕͍Ζ͍Ζ• Λֶशதʹ૿Ճͤ͞ΔͳͲ•Low-resource Language UnderstandingͰEncoder (BERT)ͷ[CLS]ʹରԠ͢ΔදݱΛར༻• ͳͷͰɺϕΫτϧͷ࣍ݩ32Ͱͳ͘768β࣮ݧઃఆ63જࡏදݱͷ࣍ݩ͕จʹ໌ه͞Ε͍ͯͳ͍ؾ͕͢Δ…
•طଘͷখ͞ͳVAEΑΓඇৗʹߴ͍ੑೳ• ڊେͳϞσϧɾڊେίʔύεͰͷࣄલֶशVAEͰΓ༗ޮ• ʹΑΔจੜͷੑೳͱજࡏදݱͷ࣭ͷτϨʔυΦϑ͕ଘࡏλධՁ࣮ݧ: Language Modeling64
•طଘͷখ͞ͳVAEΑΓඇৗʹߴ͍ੑೳ• ڊେͳϞσϧɾڊେίʔύεͰͷࣄલֶशVAEͰΓ༗ޮ• ʹΑΔจੜͷੑೳͱજࡏදݱͷ࣭ͷτϨʔυΦϑ͕ଘࡏλධՁ࣮ݧ: Language Modeling65
•طଘͷখ͞ͳVAEΑΓඇৗʹߴ͍ੑೳ• ڊେͳϞσϧɾڊେίʔύεͰͷࣄલֶशVAEͰΓ༗ޮ• ʹΑΔจੜͷੑೳͱજࡏදݱͷ࣭ͷτϨʔυΦϑ͕ଘࡏλධՁ࣮ݧ: Language Modeling66
•3/4ͷσʔληοτͰGPT-2ͷPPLΑΓ͍PPLΛୡ• ಛʹSNLIͳͲಛ༗ͷయܕతͳจ͕ଟ͍σʔληοτͰߴ͍ੑೳධՁ࣮ݧ: Language Modeling67
•OptimusͷજࡏදݱΛ༻͍Δ͜ͱͰจදݱͷԋࢉ͕Մೳ• Λͱʹจੜ•͜ͷ݁ՌΛͲ͏ड͚औΕ͍͍ͷ͔…?zD= zB− zA+ zCධՁ࣮ݧ: Guided Language Generation68จͰհ͞Ε͍ͯΔ σϞαΠτ ΞΫηεͰ͖ͳ͘ͳ͍ͬͯΔ😇
•ೋͭͷจͷજࡏදݱͷ ઢܗิʹΑΔੜ•VAEͷજࡏۭ͕ؒͳΊΒ͔ ͳ͜ͱʹΑΔԸܙ•શؔ͘ͷͳ͍จ ग़͖͍ͯͯͳ͍ɺ͘Β͍ͷ ؾ͔࣋ͪ• ิ͞Εͨจͷޠኮ ݩͷจͱࣅ͍ͯΔධՁ࣮ݧ: Guided Language Generation69
•ೋͭͷจͷજࡏදݱͷ ઢܗิʹΑΔੜ•VAEͷજࡏۭ͕ؒͳΊΒ͔ ͳ͜ͱʹΑΔԸܙ•શؔ͘ͷͳ͍จ ग़͖͍ͯͯͳ͍ɺ͘Β͍ͷ ؾ͔࣋ͪ• ิ͞Εͨจͷޠኮ ݩͷจͱࣅ͍ͯΔධՁ࣮ݧ: Guided Language Generation70
•3ͭͷλεΫͰ࣮ݧɾߴ͍ੑೳ• ରԠੜ• ಛఆελΠϧͷจੜ• ͖݅ੜ•͖݅ੜͰײྨͷ ϥϕϧʹجͮ͘ςΩετΛੜ• ੜจͷϥϕϧྨ֬ ੜจͷଟ༷ੑͰߴ͍ੑೳ ධՁ࣮ݧ: Guided Language Generation71ৄ͍࣮͠ݧઃఆɾλεΫઆ໌ʹ͍ͭͯݩจΛࢀরͷ͜ͱ
•3ͭͷλεΫͰ࣮ݧɾߴ͍ੑೳ• ରԠੜ• ಛఆελΠϧͷจੜ• ͖݅ੜ•͖݅ੜͰײྨͷ ϥϕϧʹجͮ͘ςΩετΛੜ• ੜจͷϥϕϧྨ֬ ੜจͷଟ༷ੑͰߴ͍ੑೳ ධՁ࣮ݧ: Guided Language Generation72ৄ͍࣮͠ݧઃఆɾλεΫઆ໌ʹ͍ͭͯݩจΛࢀরͷ͜ͱ
•OptimusͷEncoderදݱΛ༻͍ͯ ઢܗྨثΛ܇࿅• YelpσʔληοτͷײྨλεΫ•܇࿅ࣄྫʹΑΔੑೳͷมԽΛ؍ •Optimus܇࿅ࣄྫ͕খͯ͘͞ ൺֱతߴ͍ྨੑೳ• ੑೳ্͕͕Δͷ͕एׯૣ͍• ಛʹfine-tuningͳ͠ͷ߹ʹͱͷ BERTΑΓੑೳ͕ߴ͍• VAEͷֶशΛ௨ͯ͠ྑ͍જࡏۭؒ Λ֫ಘ͍ͯ͠Δ͜ͱΛࣔࠦධՁ࣮ݧ: Low-resource Language Understanding73
•OptimusͷEncoderදݱΛ༻͍ͯ ઢܗྨثΛ܇࿅• YelpσʔληοτͷײྨλεΫ•܇࿅ࣄྫʹΑΔੑೳͷมԽΛ؍ •Optimus܇࿅ࣄྫ͕খͯ͘͞ ൺֱతߴ͍ྨੑೳ• ੑೳ্͕͕Δͷ͕एׯૣ͍• ಛʹfine-tuningͳ͠ͷ߹ʹͱͷ BERTΑΓੑೳ͕ߴ͍• VAEͷֶशΛ௨ͯ͠ྑ͍જࡏۭؒ Λ֫ಘ͍ͯ͠Δ͜ͱΛࣔࠦධՁ࣮ݧ: Low-resource Language Understanding74
•OptimusͱBERTͷจදݱ ͷΛՄࢹԽ• Yelpσʔληοτͷ ։ൃηοτΛจදݱʹม•Optimusͷํ͕จදݱͷ͕ Ұ༷Ͱϥϕϧ͝ͱͷմ͕ΑΓ ໌֬• ಛʹɺBERTΑΓજࡏදݱ͕ Ұ༷ʹ͍ͯ͠Δ• ͱݴ͑ΔΑ͏ͳؾ͕͢ΔධՁ࣮ݧ: Low-resource Language Understanding75
•OptimusͱBERTͷจදݱ ͷΛՄࢹԽ• Yelpσʔληοτͷ ։ൃηοτΛจදݱʹม•Optimusͷํ͕จදݱͷ͕ Ұ༷Ͱϥϕϧ͝ͱͷմ͕ΑΓ ໌֬• ಛʹɺBERTΑΓજࡏදݱ͕ Ұ༷ʹ͍ͯ͠Δ• ͱݴ͑ΔΑ͏ͳؾ͕͢ΔධՁ࣮ݧ: Low-resource Language Understanding76
•OptimusͷGLUEͰͷੑೳΛධՁ• จຒΊࠐΈΛೖྗͱ͢ΔઢܗྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔•Fine-tuningͳ͠ͷ߹ʹݩͷBERTΑΓߴ͍ੑೳ• OptimusBERTΑΓྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ• BERTͷQQPͷੑೳ͕͗͢Δͷ͕ؾʹͳΔ͕…•Fine-tuning͋Γͷ߹ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰવ͔)ධՁ࣮ݧ: Low-resource Language Understanding77Ͳ͏ͤͳΒSentEvalͰ࣮ݧͯ͠ཉ͔͕ͬͨ͠…
•OptimusͷGLUEͰͷੑೳΛධՁ• จຒΊࠐΈΛೖྗͱ͢ΔઢܗྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔•Fine-tuningͳ͠ͷ߹ʹݩͷBERTΑΓߴ͍ੑೳ• OptimusBERTΑΓྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ• BERTͷQQPͷੑೳ͕͗͢Δͷ͕ؾʹͳΔ͕…•Fine-tuning͋Γͷ߹ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰવ͔)ධՁ࣮ݧ: Low-resource Language Understanding78Ͳ͏ͤͳΒSentEvalͰ࣮ݧͯ͠ཉ͔͕ͬͨ͠…
•OptimusͷGLUEͰͷੑೳΛධՁ• จຒΊࠐΈΛೖྗͱ͢ΔઢܗྨثʹΑͬͯͲΕ΄Ͳͷੑೳ͕ग़Δ͔•Fine-tuningͳ͠ͷ߹ʹݩͷBERTΑΓߴ͍ੑೳ• OptimusBERTΑΓྑ͍จදݱ͕֫ಘͰ͖͍ͯΔʁ• BERTͷQQPͷੑೳ͕͗͢Δͷ͕ؾʹͳΔ͕…•Fine-tuning͋Γͷ߹ͦ͜·ͰมΘΒͳ͍ (ݩ͕BERTͳͷͰવ͔)ධՁ࣮ݧ: Low-resource Language Understanding79Ͳ͏ͤͳΒSentEvalͰ࣮ݧͯ͠ཉ͔͕ͬͨ͠…
•VAEϕʔεͷେنࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ•EncoderʹBERTɺDecoderʹGPT-2Λ্ख͘౷߹ͯ͠VAEΛߏ•จੜɾ͖݅ੜɾࢿݯઃఆͷλεΫͰߴ͍ੑೳ• ಛʹطଘͷখ͞ͳVAEΛେ্͖͘ճΔੑೳ• VAEʹ͓͚Δࣄલֶशͷ༗ޮੑΛࣔ͢ײ•BERTͳͲͷطଘࣄલֶशࡁΈݴޠϞσϧΛར༻ͤͣɺfrom scratchͰֶश͢ΔͱͲ͏ͳΔͷ͔͕ؾʹͳΔ• ܭࢉϦιʔεతʹݫ͔༷ͬͨ͠(Sec. 6 DiscussionΛࢀর)•ࣄલֶश + VAEͳͱͯ͠໘ന͍͕ɺԠ༻ൣғݶఆత͔·ͱΊ80
•VAEϕʔεͷେنࣄલֶशࡁΈݴޠϞσϧOptimusΛఏҊ•EncoderʹBERTɺDecoderʹGPT-2Λ্ख͘౷߹ͯ͠VAEΛߏ•จੜɾ͖݅ੜɾࢿݯઃఆͷλεΫͰߴ͍ੑೳ• ಛʹطଘͷখ͞ͳVAEΛେ্͖͘ճΔੑೳ• VAEʹ͓͚Δࣄલֶशͷ༗ޮੑΛࣔ͢ײ•BERTͳͲͷطଘࣄલֶशࡁΈݴޠϞσϧΛར༻ͤͣɺfrom scratchͰֶश͢ΔͱͲ͏ͳΔͷ͔͕ؾʹͳΔ• ܭࢉϦιʔεతʹݫ͔༷ͬͨ͠(Sec. 6 DiscussionΛࢀর)•ࣄલֶश + VAEͳͱͯ͠໘ന͍͕ɺԠ༻ൣғݶఆత͔·ͱΊ81