Upgrade to Pro — share decks privately, control downloads, hide ads and more …

マルチリンガルな言語モデル入門:これまでとこれから

Ryokan RI
August 31, 2023

 マルチリンガルな言語モデル入門:これまでとこれから

情報処理学会 第257回 自然言語処理研究発表会(2023)招待講演

https://sites.google.com/sig-nl.ipsj.or.jp/sig-nl/%E7%A0%94%E7%A9%B6%E7%99%BA%E8%A1%A8%E4%BC%9A/NL257

Ryokan RI

August 31, 2023
Tweet

More Decks by Ryokan RI

Other Decks in Research

Transcript

  1. ‣ 2018 - 2023: ౦ژେֶ ௽Ԭݚڀࣨʢम࢜ - ത࢜ʣ ‣ 2023

    -: LINE גࣜձࣾʢNLP ΤϯδχΞʣ ࣗݾ঺հ ࠷ۙɺֶੜ࣌୅ʹΠϯλʔϯͰ ͓ੈ࿩ʹͳͬͨํʑͱຊΛॻ͖·ͨ͠ɻ
  2. ੈքதͷݴޠͱσʔλྔ 7 ݴޠʢ೔ຊޠɺӳޠͳͲʣ 2191 ݴޠ 222 ݴޠ The State and

    Fate of Linguistic Diversity and Inclusion in the NLP World (Joshi et al., ACL 2020) Figure 1 ΑΓ
  3. InstructGPT ͷ࿦จͷ࣮ݧʹΑΔͱ… ‣ Instruct Tuning ͷֶशσʔλ͸96%͕ӳޠ ‣ ϑΝΠϯνϡʔχϯάͨ͠ GPT3 ͸ଞͷݴޠʹ΋͏·͘

    ൚Խ͍ͯ͠Δ ChatGPT ΋ݴޠԣஅసҠֶशΛ͍ͯ͠Δʁ Q. ͳͥ͜ͷΑ͏ͳ͜ͱ͕Ͱ͖Δͷ͔ʁ ‣ ϕʔεͷݴޠϞσϧ͕ෳ਺ݴޠͰ܇࿅͞Ε͓ͯΓݴޠԣஅసҠֶश ʹରԠͰ͖ΔೳྗΛ͍࣋ͬͯΔɻ
  4. ϚϧνϦϯΨϧͳݴޠϕΫτϧͷ࡞Γํ ᶃ ର༁σʔλΛ࢖ͬͯϕΫτϧΛֶश͢Δ Bilingual Word Representations with Monolingual Quality in

    Mind (Luong et al., LatentVar 2015) Figure 1 ΑΓ ςΩετͷपғͷ୯ޠΛ༧ଌ͢Δ͜ͱʹՃ͑ͯɺର༁ͱͳΔ୯ޠ΋༧ଌ͢Δɻ
  5. ର༁σʔλ͋Γ ‣ ର༁ࣙॻΛ༻ҙɻର༁ͷ୯ޠϕΫτϧಉ͕࢜Ұக͢ΔΑ͏ʹϚοϐϯά Λֶशɻ ର༁σʔλͳ͠ʢର༁ࣙॻੜ੒ ⁵ Ϛοϐϯάͷ֫ಘʣ ‣ ιʔε/λʔήοτݴޠͷϕΫτϧͷݟ෼͚͕͔ͭͳ͘ͳΔϚοϐϯάΛ ఢରతֶशͰ֫ಘɻ

    ‣ ୯ޠϕΫτϧಉ࢜ͷྨࣅ౓ϕΫτϧͷྨࣅ౓Ͱର༁ࣙॻΛੜ੒ɻ ϚοϐϯάʹΑΔ ϚϧνϦϯΨϧͳ୯ޠϕΫτϧͷ֫ಘ Exploiting Similarities among Languages for Machine Translation (Mikolov et al., arXiv 2013) Word translation without parallel data (Lample et al., ICLR 2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings (Artetxe et al., ACL 2018)
  6. ୯ޠϕΫτϧۭؒಉܕͷԾఆͷݶք ୯ޠϕΫτϧۭؒͷܗ͕߹க͢ΔͨΊʹ͸ɺ ‣ ֶशίʔύεͷυϝΠϯ͕͋Δఔ౓Ұக͢Δඞཁ͕͋Δɻ ‣ ݴޠ͕ܥ౷తʹ͋Δఔ౓͍ۙ͠ ඞཁ͕͋Δɻ On the Limitations

    of Unsupervised Bilingual Dictionary Induction (Søgaard et al., ACL 2018) ͨͩ͠ɺ͍ͭͰ΋ର༁σʔλͳ͠ͰϚϧνϦϯΨϧͳ୯ޠϕΫτϧΛ ֫ಘ͢Δ͜ͱ͕͍ͭͰ΋͏·͘ߦ͔͘ͱ͍͏ͱͦΜͳ͜ͱ΋ͳ͘… 😢
  7. ϚϧνϦϯΨϧͳ BERT ͷੑೳ XTREME: A Massively Multilingual Multi-task Benchmark for

    Evaluating Cross-lingual Generalization (Hu et al., ICML 2020) Table 2 ΑΓ ‣ mBERT ͸ 100ݴޠҎ্ͷ Wikipedia هࣄͰࣄલֶश͞Ε͍ͯΔɻ ‣ ෯޿͍λεΫͰϑΝΠϯνϡʔχϯά࣌ʹݟ͍ͯͳ͍ݴޠͰͷධՁ ͰνϟϯεϨʔτΑΓང͔ʹߴ͍είΞΛ͍ࣔͯ͠Δɻ
  8. ౰ॳ͞͞΍͔Ε͍ͯͨᷚ ڞ௨ͷจࣈྻԾઆ आ༻ͳͲ΋͋Γɺදه͕Ұக͢Δ୯ޠ΋ଟ਺ʂ EN DE sing singen EN FR banana

    banane ಉ͡ޠ଒ͷݴޠʹ͸ࣅͨΑ͏ͳจࣈྻ͕ଟ͍ɻ ςΩετ͸αϒϫʔυ෼ׂ͞ΕΔͷͰɺಉ͡αϒϫʔυ͕ ҟͳΔݴޠͰಉ͡ҙຯ߹͍Λ࣋ͭ͜ͱʹͳΔɻ 🍌 🎶
  9. ڞ௨ͷจࣈྻԾઆͷݕূ ڞ௨ͷจࣈྻ͕େ੾ʁ ➡︎ ڞ௨ͷจࣈྻ͕શ͘ݱΕͳ͍ঢ়گͰ ϚϧνϦϯΨϧͳ BERT Λֶश͢ΔͱͲ͏ͳΔ͔ʁ ҟͳΔݴޠ͔ΒͷτʔΫϯͷ ID ͕ॏͳΒͳ͚Ε͹Α͍

    ‣ ยํͷݴޠͷจࣈͷϢχίʔυΛͣΒͯ͠มͳจࣈʹ͢Δ ‣ τʔΫϯͷ ID ΛͣΒ͢ ࣮ݧઃఆͷ࡞Γํ Emerging Cross-lingual Structure in Pretrained Language Models (Conneau et al., ACL 2020) Cross-Lingual Ability of Multilingual BERT: An Empirical Study (K et al., ICLR 2020)
  10. ϚϧνϦϯΨϧϞσϧ͸ ChatGPT Ͱ΋͏े෼ʁ -τʔΫφΠθʔγϣϯͷඇޮ཰͞- Do All Languages Cost the Same?

    Tokenization in the Era of Commercial Language Models (Ahia et al., NAACL 2023) Figure 5 ΑΓ ಉ͡಺༰ͷςΩετͰ΋ӳޠʹൺ΂ͯ೔ຊޠ͸̎ഒ΄ͲͷτʔΫϯΛফඅ͢Δɻ