Generalized Data Augmentation for Low-Resource Translation ཧݚAIP / ౦๺େֶ סɾླ໦ݚڀࣨ ਗ਼໺ॢ Generalized Data Augmentation for Low-Resource Translation Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, Graham Neubig Language Technologies Institute, Carnegie Mellon University {mengzhox, xiangk, aanastas, gneubig} Abstract Translation to or from low-resource languages (LRLs) poses challenges for machine transla- tion in terms of both adequacy and fluency.

ͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • Low Resourceݴޠରͷ৔߹ٯ຋༁Ͱͷੑೳ޲্͕ࠔ೉ • ΞΠσΞ • ݴޠతʹ͍ۙHigh ResourceݴޠରͷσʔλΛ ͏·͘׆༻͢Δʢྫ: ΞθϧόΠδϟϯޠͱτϧίޠʣ • ߩݙ • High Resourceݴޠͷ࢖͍ํ͸ඇࣗ໌ͳͷͰɼ৭Μͳ ख๏ͷ૊Έ߹ΘͤΛ໢ཏతʹ࣮ݧ • High ResourceݴޠΛ୯ޠ୯ҐͰLow Resourceݴޠʹ ஔ׵͢Δͷ͕ྑ͍ • High ResourceݴޠΛܦ༝͢Δ͜ͱͰɼ୯Ұݴޠσʔλ Λ׆༻Մೳͱࣔͨ͠ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 2 τϧίͱΞθϧόΠδϟϯ͸͍ۙ

എܠ: Backtranslation͕͍͢͝ʂ • Backtranslation (ҎԼɼٯ຋༁) • ٯ຋༁Ϟσϧͷग़ྗͨ͠຋༁จΛ৽͍ٙ͠ࣅର༁ σʔλͱͯ͠༻͍Δख๏ • େྔɾߴ඼࣭ͳ୯Ұݴޠίʔύε͕࢖͑Δʂ • ػց຋༁ͷData Augmentationख๏ͱͯ͠Ұൠత • ٙࣅର༁σʔλͷྔʹରͯ͠ੑೳ͕εέʔϧ͢Δ

എܠ: Low Resourceͷ৔߹ɼ ٯ຋༁͸͘͢͝ͳ͍… • Low Resource (LRL)ͷ৔߹ • ͜͜Ͱ͸਺ઍ~਺ສจରΛLow Resourceͱ͢Δ • ٯ຋༁ʹΑΔੑೳ޲্͸ݶఆత • Ή͠Ζੑೳ͕ѱԽ͢Δ৔߹΋…

ΞΠσΞ: High Resourceͳ ݴޠରΛ׆༻͢Δ • High Resource Language (HRL) Λ͏·͘׆༻͍ͨ͠ • ͨͩ͠ɼͲ͏΍ͬͯHRLΛ׆༻͢Ε͹͍͍͔͸ඇࣗ໌ • ৭Μͳख๏ͷ૊Έ߹ΘͤΛ໢ཏతʹࢼͯ͠ɼ ݁ՌΛใࠂ • ʢ৽͍͠ํ๏࿦ͷఏҊͰ͸ͳ͍ʣ • ʢ஌ݟͷڞ༗͕ओͳߩݙʣ • ʢGeneralized Data Augmentationͱ͸ʁ) September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 5 LRL ENG HRL ENG ౷ޠߏ଄΍ޠኮ͕ Highly-relatedͰ͋Δͱ Ծఆ͢Δ

Generalized Data Augmentation ͷશମ૾ uages ransla- uency. nts of ective his pa- r data ransla- ingual high- we ex- hod to mak- : Available Resource : Generated Resource LRL ENG [c] HRL ENG [b] ENG [a] HRL LRL ENG LRL ENG LRL ENG [1] ENG!LRL [2] ENG!HRL [4] HRL!LRL [3] HRL!LRL Figure 1: With a low-resource language (LRL) and a related high-resource language (HRL), typical data aug- mentation scenarios use any available parallel data [b] September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 6 ٯ຋༁Ͱ͸ੑೳ޲্͕ࠔ೉ʜ ͜ͷͭͷ૊Έ߹ΘͤΛࢼ͢

׆༻๏1: HRLàLRL • HRLͱENGͷର༁ίʔύε͸͋Δఔ౓ଘࡏ • HRLΛLRLʹ຋༁͢Ε͹ɼLRLͱENGͷٙࣅର༁ ίʔύε͕࡞ΕΔ • ਅͷର༁ίʔύεͱֶࠞͥͯश September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 7 a Augmentation for Low-Resource Translation iang Kong, Antonios Anastasopoulos, Graham Neubig echnologies Institute, Carnegie Mellon University angk, aanastas, gneubig} ource languages machine transla- acy and fluency. arge amounts of as an effective ms. In this pa- mework for data machine transla- ide monolingual a related high- ecifically, we ex- : Available Resource : Generated Resource LRL ENG [c] HRL ENG [b] ENG [a] HRL LRL ENG LRL ENG LRL ENG [1] ENG!LRL [2] ENG!HRL [4] HRL!LRL [3] HRL!LRL Figure 1: With a low-resource language (LRL) and a

׆༻๏2: ENGàHRLàLRL • ENG͔ΒHRLΛܦ༝ͯ͠LRLʹ຋༁ • ENGàHRLͷٯ຋༁ϞσϧΛ׆༻ • ENGͷ୯Ұݴޠίʔύε͔ΒɼLRLàENGͳ ٙࣅର༁σʔλΛ֫ಘՄೳ • خ͠͞: ENGͷ୯Ұݴޠίʔύε͸΄΅ແݶʹଘࡏ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 8 ta Augmentation for Low-Resource Translation Xiang Kong, Antonios Anastasopoulos, Graham Neubig Technologies Institute, Carnegie Mellon University iangk, aanastas, gneubig} t esource languages r machine transla- uacy and fluency. large amounts of ed as an effective lems. In this pa- amework for data e machine transla- -side monolingual gh a related high- : Available Resource : Generated Resource LRL ENG [c] HRL ENG [b] ENG [a] HRL LRL ENG LRL ENG LRL ENG [1] ENG!LRL [2] ENG!HRL [4] HRL!LRL [3] HRL!LRL

Ͳ͏΍ͬͯHRL à LRL͢Δ͔ʁ • ͦ΋ͦ΋HRL͔ΒLRL΁ͷ຋༁͕Low Resource • Ծఆ: HRLͱLRL͸ݴޠతʹࣅ͍ͯΔ • ڭࢣͳ͠ͷख๏Ͱ΋ͦΕͳΓͷ຋༁ਫ਼౓͕ݟࠐΊΔ • ୯ޠ୯Ґͷஔ͖׵͑ & ڭࢣͳ͠.5Λར༻ September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 9 Data Example Sentence Pivot BLEU SLE (GLG) Pero con todo, veste obrigado a agardar nas mans dunha serie de estraños moi profesionais. SHE (POR) Em vez disso, somos obrigados a esperar nas mãos de uma série de estranhos muito profissionais. 0.09 ˆ Sw H )L En vez disso, somos obrigados a esperar nas mans de unha serie de estraños moito profesionais. 0.18 ˆ Sm H )L En vez diso, somos obrigados a esperar nas mans dunha serie de estraños moi profesionais. 0.54 TLE But instead, you are forced there to wait in the hands of a series of very professional strangers. Table 3: A POR-GLG pivoting example with corresponding pivot BLEU scores. Edits by word substitution or M-UMT are highlighted. UMT’s scores are 2 to 10 BLEU points worse than ୯ޠ୯Ґ ͷஔ͖׵͑ ڭࢣͳ͠ .5

(1) ୯ޠ୯Ґͷஔ͖׵͑ 1. ݸʑͷݴޠͰ୯ޠϕΫτϧΛֶश͓͖ͯ͠ɼ ࣸ૾WΛֶश [Xing+2015] 2. ୯ޠϕΫτϧۭؒͰۙ๣ͷ୯ޠϖΞΛ ࣙॻʹ௥Ճ 3. HRLதͷ֤୯ޠΛରԠ͢ΔLRLͷ୯ޠͰஔ׵ • ରԠ͢Δ୯ޠ͕ແ͚Ε͹ແࢹ

(2) ڭࢣͳ͠MT • طଘͷڭࢣͳ͠MTͷख๏ͱ΄ͱΜͲಉ͡ • ʢ࿦จதͰ͸"Modified UMT"ͱදه͞Ε͍ͯΔ͕ɼҧ͍ ͕෼͔Βͳ͔ͬͨ…) • ʢڪΒ͘ɼ༧ΊHRLàLRLʹ୯ޠஔ׵͍ͯ͠Δͷ͕ࠩ෼ʣ • Denoising Auto-encoderͱIterative Back-translation ͷ2͔ͭΒlossΛܭࢉɾॏΈ෇͖࿨Λ ໨తؔ਺ͱֶͯ͠श

࣮ݧઃఆ • σʔλ: Multilingual TED corpus [Qi+2018] • ݴޠର: • ୯Ұݴޠίʔύεʹ͸WikipediaΛར༻ • Ϟσϧ: Transformer (4 layer) • ϕʔεϥΠϯ: HRLͱLRLͷଟݴޠNMT

࣮ݧ݁Ռɿؤுͬͨ Training Data BLEU for X)ENG AZE BEL GLG SLK (TUR) (RUS) (POR) (CES) Results from Literature SDE (Wang et al., 2019) 12.89 18.71 31.16 29.16 many-to-many (Aharoni et al., 2019) 12.78 21.73 30.65 29.54 Standard NMT 1 {SLE SHE , TLE THE } (supervised MT) 11.83 16.34 29.51 28.12 2 {ML , ME } (unsupervised MT) 0.47 0.18 1.15 0.75 Standard Supervised Back-translation 3 + { ˆ Ss E )L , ME } 11.84 15.72 29.19 29.79 4 + { ˆ Ss E )H , ME } 12.46 16.40 30.07 30.60 Augmentation from HRL-ENG 5 + { ˆ Ss H )L , THE } (supervised MT) 11.92 15.79 29.91 28.52 6 + { ˆ Su H )L , THE } (unsupervised MT) 11.86 13.83 29.80 28.69 7 + { ˆ Sw H )L , THE } (word subst.) 14.87 23.56 32.02 29.60 8 + { ˆ Sm H )L , THE } (modified UMT) 14.72 23.31 32.27 29.55 9 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.24 24.25 32.30 30.00 Augmention from ENG by pivoting 10 + { ˆ Sw E )H )L , ME } (word subst.) 14.18 21.74 31.72 30.90 11 + { ˆ Sm E )H )L , ME } (modified UMT) 13.71 19.94 31.39 30.22 Combinations 12 + { ˆ Sw H )L ˆ Sw E )H )L , THE ME } (word subst.) 15.74 24.51 33.16 32.07 13 + { ˆ Sw H )L ˆ Sm H )L , THE THE } 15.91 23.69 32.55 31.58 + { ˆ Sw E )H )L ˆ Sm E )H )L , ME ME } Table 2: Evaluation of translation performance over four language pairs. Rows 1 and 2 show pre-training BLEU September 28, 2019 RIKEN AIP / Inui-Suzuki Laboratory 13 طଘݚڀͷ࿦จ஋ ϕʔεϥΠϯ Data Augmentation ͷ࠷ߴ஋ (͜ͷล·ͰΠέϧ)

෼ੳ1: ௨ৗͷٯ຋༁Λ࢖ͬͨ৔߹ • ENGàLRLͰٯ຋༁Λͯ͠΋ੑೳ͸্͕Βͳ͍ • Ή͠ΖԼ͕Δ • ENGàHRLΛ௥Ճ͢Δͱੑೳඍ૿ • Ұ෦ͷίʔύε(BEL)Ͱ͸ޮՌ͸ݶఆత • ʢHRLͱLRLͷྨࣅ౓͕Өڹ͍ͯͦ͠͏ʣ

෼ੳ2: HRLàLRLͨ͠৔߹ͷ݁Ռ • HRLଆΛ୯ޠ୯ҐͰஔ׵͢Δ͚ͩͰܶతʹੑೳ޲্ • ڭࢣͳ͠MTͰ΋ੑೳ޲্͢Δ͕ɼ୯ޠஔ׵ͱಉ౳͔ ͦΕҎԼ • ʢڭࢣͳ͠MTͷํֶ͕श͕େมͳͷͰɼ͜ͷ݁Ռ͸ऐ͍͠ʣ • ୯ޠஔ׵ & ڭࢣͳ͠MTͷ૊Έ߹ΘͤͰߋʹੑೳ޲্

෼ੳ3: ENGàHRLàLRLͷޮՌ • HRLΛܦ༝ͯ͠ٯ຋༁͢Δ͜ͱͰɼ୯Ұݴޠίʔ ύεΛ༗ޮ׆༻͢Δ͜ͱ͕Մೳ • HRLàLRLͷ৔߹ͱ܏޲͸ࣅ͍ͯΔ • ୯ޠ୯Ґͷஔ׵ > ڭࢣͳ͠MT

෼ੳ4: ݁ہɼڭࢣͳ͠MT͸ͩΊ • ୯ޠ୯Ґͷஔ׵Λͨ͠HRLàLRLͳσʔλͱɼ ENGàHRLàLRLͳσʔλΛ૊Έ߹Θͤͨ৔߹͕ ࠷΋ੑೳ͕ྑ͍ • ↑ʹ௥Ճͯ͠ڭࢣͳ͠.5Λ࢖͏ͱੑೳѱԽʜ

ڭࢣͳ͠MT্͕ख͍͔͘͘ͳ͍ʁ • ࣮ݧͰ͸Ұ؏ͯ͠ʮ୯ޠ୯Ґͷஔ׵ > ڭࢣͳ͠MTʯ • ڭࢣͳ͠MT͸ͪΌΜͱ຋༁Ͱ͖ͯΔͷ͔ʁ • →຋༁͸ग़དྷ͍ͯΔ(pivot BLEU͸্ঢ)͕ɼੑೳʹߩ ݙ͠ͳ͍(translation BLEU͸Լ߱) • ஶऀ͍Θ͘ɼ୯ޠ୯Ґͷஔ׵ޙʹڭࢣͳ͠MT ͍ͯ͠Δͷ͕ݪҼͱͷ͜ͱ(ᡰʹམͪͳ͍…)

ʢ࠶ܝʣͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • Low Resourceݴޠରͷ৔߹ٯ຋༁Ͱͷੑೳ޲্͕ࠔ೉ • ΞΠσΞ • ݴޠతʹ͍ۙHigh ResourceݴޠରͷσʔλΛ ͏·͘׆༻͢Δʢྫ: ΞθϧόΠδϟϯޠͱτϧίޠʣ • ߩݙ • High Resourceݴޠͷ࢖͍ํ͸ඇࣗ໌ͳͷͰɼ৭Μͳ ख๏ͷ૊Έ߹ΘͤΛ໢ཏతʹ࣮ݧ • High ResourceݴޠΛ୯ޠ୯ҐͰLow Resourceݴޠʹ ஔ׵͢Δͷ͕ྑ͍ • High ResourceݴޠΛܦ༝͢Δ͜ͱͰɼ୯Ұݴޠσʔλ Λ׆༻Մೳͱࣔͨ͠

ײ૝ • Generalized Data Augmentationײʹ͚ܽΔؾ͕͢Δ • Կ͕ "generalized"ʁ • ΋ͬͱΨΠυϥΠϯతͳ৘ใɾ࣮ݧ͕ཉ͍͠ • ڭࢣͳ͠MT͕ޮՌബͳͷ͸ͳ͔ͥʁ • HRLàLRLͷ຋༁ੑೳ͸ߴ͍ͷ͚ͩͲ… • ௚ײతʹ͸ɼٙࣅσʔλͷ࣭͕ߴ͍΄Ͳੑೳ޲্ʹ د༩͢Δ͸ͣ • ࣭͕த్൒୺ʹߴ͍͜ͱ͕ݪҼͩΖ͏͔ʁ • ٙࣅσʔλ͸ٙࣅσʔλͱͯ۠͠ผग़དྷͨ΄͏͕Α͍ ͱ͍͏ใࠂ΋͋Δ • [Edunov+2018] Understanding Back-Translation at Scale • [Caswell+2019] Tagged Back-Translation • Figure1͕ૉ੖Β͍͠ʢٙࣅσʔλͷ࡞Γํͷେ࿮͕ ֓؍Ͱ͖Δʣ