Neural Machine Translation with Byte-Level Subwords

Neural Machine Translation with Byte-Level Subwords

A42dd3541cd40296dcd8a5e6b4a01bef?s=128

Scatter Lab Inc.

May 15, 2020
Tweet

Transcript

  1. Byte-level BPE: Neural Machine Translation with Byte-level Subwords ੉઱ഘ (ML

    Research Scientist, Pingpong)
  2. Neural Machine Translation with Byte-level Subwords Overview • “Neural Machine

    Translation with Byte-level Subwords” • Changhan Wang, Kyunghyun Cho, and Jiatao Gu (Facebook AI Research) • AAAI 2020 (arXiv 2019)
  3. 1. Introduction Neural Machine Translation with Byte-level Subwords (Wang et

    al., 2019)
  4. Byte-Pair Encoding (BPE) 1. Introduction • ࠼بо ֫਷ Character हਸ

    ߽೤೧աх vocab = all_unique_characters while len(vocab) <= max_vocab_size: pair = get_max_pair(corpus) corpus = merge_vocab(corpus, pair) vocab.append(pair)
  5. Character? Byte? 1. Introduction • Character (a, b, c, о,

    ա, ׮, …) • ઱۽ BPEೞݶ character-levelਸ ݈ೣ • Textۄח ѱ sequence of character۽ॄ ಴അೞח ѱ ੗োझ۞ਕࢲ • Byte (E3, 81, AE, …) • Compactness: 256ѐ੄ ష௾݅ ੓ਵݶ ޤٚ ٜ݅ ࣻ ੓਺ • ঱যী ࢚ҙ হ੉ ࢎਊೡ ࣻ ੓਺
  6. Character? Byte? 1. Introduction • Character (a, b, c, о,

    ա, ׮, …) • ઱۽ BPEೞݶ character-levelਸ ݈ೣ • Textۄח ѱ sequence of character۽ॄ ಴അೞח ѱ ੗োझ۞ਕࢲ • Byte (E3, 81, AE, …) • Compactness: 256ѐ੄ ష௾݅ ੓ਵݶ ޤٚ ٜ݅ ࣻ ੓਺ • ঱যী ࢚ҙ হ੉ ࢎਊೡ ࣻ ੓਺
  7. Character-level BPE੄ ೠ҅ 1. Introduction • Vocabularyীࢲ characterо ցޖ ݆਷

    ठ܃ਸ ର૑ೡ ࣻ ੓਺ • Rare character from noisy text • Character-rich languages (such as CJK languages) • ৈ۞ ঱যܳ ׮ܖӝী ࠗ੸೤ೣ • bilingual and multilingual • 150ѐ੄ ঱যܳ ழߡೞ۰ݶ 138K੄ ਬפ௏٘ characterо ೙ਃೣ • ߈ݶ, UTF-8 byteח 256ѐ ઺ী 248ѐ݅ ੓ਵݶ ׮ ழߡೡ ࣻ ੓਺
  8. 2. Byte-level BPE Neural Machine Translation with Byte-level Subwords (Wang

    et al., 2019)
  9. Byte-level BPE (BBPE) 2. Byte-level BPE • ӝࠄ੸ਵ۽ ਬפ௏٘ characterܳ

    UTF-8۽ ੋ௏٬ೣ • 1 ਬפ௏٘ = 1~4 byte • ੋ௏٬ ػ sequence of bytesী ؀೧ࢲ BPE ೟णਸ दఇ • ୭ઙ vocab: UTF-8 byte set + BPEܳ ా೧ ୶о غח variable-length n-gram bytes Byte Sequence: EA B0 80 EB 82 98 EB 8B A4 EB 9D BC EB A7 88 EB B0 94 EC 82 AC Byte set: EA, B0, 80, EB, 82, 98, 8B, A4, 9D, BC, A7, 88, B0, 94, EC, 82, AC Variable-length n-gram bytes: EA B0, EB 82 98, A4 EB, …
  10. Byte-level BPE (BBPE) 2. Byte-level BPE • ӝࠄ੸ਵ۽ ਬפ௏٘ characterܳ

    UTF-8۽ ੋ௏٬ೣ • 1 ਬפ௏٘ = 1~4 byte • ੋ௏٬ ػ sequence of bytesী ؀೧ࢲ BPE ೟णਸ दఇ • ୭ઙ vocab: UTF-8 byte set + BPEܳ ా೧ ୶о غח variable-length n-gram bytes ೞա੄ characterо ଂѐ૗
  11. Contextualization 2. Byte-level BPE • ݫੋ ݽ؛ী ٜযоӝ ੹ী Contextualization੉

    ೙ਃೞ׮Ҋ ೣ • рױೠ CNN੉ա GRUܳ కਕࢲ ݫੋ ݽ؛ੋ Transformerী ٜযоח ߑध
  12. Decoding 2. Byte-level BPE • ݽٚ ޙ੢਷ byte sequence۽ ಴അೡ

    ࣻ ੓૑݅, 
 যڃ byte sequenceח ޙ੢ਵ۽ ࠂਗ(decoding)ೞӝ গݒೣ • Ex) Generation, Translation
  13. Decoding 2. Byte-level BPE • Empirically, ೟णػ ݽ؛ীࢲ ੜޅػ byte

    sequenceܳ outputਵ۽ ղࠁղח ҃਋ח ٘ޛ׮Ҋ ೣ • प೷೧ࠄ Ѣীࢲח Ѣ੄ হ঻Ҋ, प೷ ࣁ౴ ઺ 165K example੄ large testsetীࢲ ઑରب ٘ޛ঻਺ • ডр ೟ण੉ ؏ ػ ݽ؛ীࢲח ઺ࠂػ byteܳ ߈ࠂೞח ޙઁо ੓਺ • ੉۠ ী۞ ಁఢٜ੉ ୭؀ೠ ݆਷ ਬפ௏٘ character۽ ࢶഋदрী ࠂਗೞҊ੗ ೣ • Dynamic Programming ӝ߈੄ ঌҊ્ܻਸ ઁউ
  14. Decoding: algorithm 2. Byte-level BPE • Byte sequence о ઱য૗

    • ܳ ীࢲ ࠂਗ оמೠ ୭؀ character ѐࣻۄҊ ೞ੗ • ח dynamic programmingਸ ా೧ࢲ ইې৬ э੉ ҅࢑ೡ ࣻ ੓਺ {B}N k=1 f(k) {B}N k=1 f(k) • о ৢ߄ܲ character੉ݶ , ইפݶ 0 • ਤ੄ ܳ ੤ӈ੸ਵ۽ backtrackingೞݶࢲ ҅࢑ೞݶ ೧ܳ ҳೡ ࣻ ੓਺ {B}j k=i g(i, j) = 1 f(k)
  15. 3. Experiments Towards Universal Dialogue State Tracking, (Ren et al.,

    2018)
  16. Experimental Setting • Dataset • Bilingual: En-De, Ja-En, Si-En •

    Multilingual: Many-to-English (X-En) → TED Talk Corpus, 59ѐ ঱যী ؀ೠ parallel data • BPE & BBPE: Source + Target ޙ੢ী ؀೧ࢲ SentencePiece۽ ೟ण 3. Experiments
  17. Experimental Setting • Model and Learning • Transformer ࢎਊ •

    Vaswini et al., 2017 ࣁ౴ਸ ݆੉ ٮܴ • Inference and Evaluation • Beam size: En-Deח 4, աݠ૑ח 5 • We calculate casesensitive tokenized BLEU (Papineni et al. 2002) as the metrics using sacreBLEU (Post 2018). 3. Experiments
  18. Results: Qualitative Comparison: BPE vs. BBPE • Symbol Frequency Distribution

    3. Experiments BBPEо ഻ঁ ؊ ࠙࢑غয ੓਺. Long tail੉ Ѣ੄ হҊ Ӓ۠ ൞ӈೠ ױযח subword۽ ಴അ
  19. Results: Qualitative Comparison: BPE vs. BBPE • Ratio of BBPE

    tokens with partial characters 3. Experiments ੌࠄয৬ Multilingual਷ partial character੄ ࠺ਯ੉ ࢚׼ೣ. Character set: ੌࠄয(8K), Multilingual(11K)
  20. Results: Qualitative Comparison: BPE vs. BBPE 3. Experiments

  21. Results: Qualitative Comparison: BPE vs. BBPE • Cross-lingual Sharing •

    X-En੄ symbolsҗ ঴݃ա Ҁசө? • Ar, He, Ru, Ko, It ঱যী ؀೧ࢲ प೷ • ੹߈੸ਵ۽ BBPEо symbols੉ ݆੉ Ҁஜ • ݽ؛ ஏݶীࢲ parameter sharing੄ ੉ٙ • vocab ஏݶীࢲ universal modeling੄ ੉ٙ 3. Experiments
  22. Results: Qualitative Comparison: BPE vs. BBPE • Impact on Sequence

    Length 3. Experiments BBPEо ؊ fineೠ ױਤܳ ׮ܖ׮ࠁפө sequenceо ӡয૓׮Ҋ ࢤпೡ ࣻ ੓૑݅, ষ୒ ӟ Ѫب ইש
  23. Results: Importance of Contextualization • X-Enী ؀೧ࢲ 3о૑ ࣁ౴ਸ ࠺Ү

    • none • 1-layer CNN • 1-layer Bi-GRU • Fine-grained vocabੌࣻ۾ ബҗо ఀ 3. Experiments
  24. Results: BBPE on Noisy Character Sets • En-De ؘ੉ఠࣇীח non-latin

    alphabet੉ ખ ੓਺ • ੉۠ ੉ਬ۽ character set੉ 3.4Kա ؽ • BPEח character setਸ ׮ ನೣ೧ঠ ೞӝ ٸޙী ੉۠ ࠗ࠙਷ ݆਷ vocab ठ܃ਸ խ࠺दఇ • BBPE 2K, 4K৬ BPE 32Kо ࠺तೠ Ѿҗܳ ঳਺ • ೞ૑݅ ౵ۄ޷ఠ ࣻীࢲ ݆਷ ੉ٙਸ ࠆ 3. Experiments
  25. Results: BBPE on Character-Rich Languages • ઺Ҵয, ੌࠄযח 50Kо ֈח

    character setਸ о૗ • Ja-En ؘ੉ఠࣇ਷ ୨ 8K੄ character set੉Ҋ,
 top 2.4K੄ characterо ੹୓੄ 99%ܳ ழߡೣ • ੉۠ ੼ਸ Ҋ۰ೞৈ BBPE੄ ௼ӝܳ 4K۽ ࣁ౴ • BPEী ؀೧ࢲ comparableೠ ࢿמਸ ࠁ੐ 3. Experiments
  26. Results: BBPE on Many-to-En Translation 3. Experiments BBPE જ਺ Char/Byte

    ؊ જ਺ (?)
  27. • Impact on Sequence Length 3. Experiments Source৬ Target੄ ӡ੉

    ର੉о ݆੉ աࢲ attention੉ য۰ਕ૓٠. Ӓېࢲ (B)BPE ࢿמ੉ ڄয૓ ѱ ইקө? Results: BBPE on Many-to-En Translation
  28. Results: BBPE on Many-to-En Translation 3. Experiments Ӓۢীب ࠛҳೞҊ, ੹߈੸ਵ۽

    BBPEо ࢿמ੉ա ࣘب ݶীࢲ ߖ۠झо જ਷ Ѫ э਺
  29. Results: Transfer Learning on Unseen Characters • BBPEח ݽٚ UTF-8

    byteܳ ನೣೞӝ ٸޙী OOV ޙઁо ੓ਸ ࣻ হ਺ • ٮۄࢲ character set ੹ഃ উҀ஖ח ف ঱যী ؀೧ transferring੉ оמೣ • X-Enਵ۽ pre-trainingೠ ݽ؛ਸ Si-Enী ؀೧ࢲ Fine-tuningೞݶ transferо ੜ غח Ѫਸ ࠅ ࣻ ੓਺ 3. Experiments
  30. 4. Conclusion Towards Universal Dialogue State Tracking, (Ren et al.,

    2018)
  31. Contributions 4. Conclusion • Byte-level subword vocabularyܳ ݅٘ח BBPEܳ ઁউ

    • Character-based ӝߨী ࠺೧ࢲ ࢿמਸ ਬ૑ೞݶࢲ vocabularyܳ ݒ਋ ੘ѱ ٜ݅ ࣻ ੓਺ • Multilingual settingীࢲח ઙઙ ؊ ࢿמ੉ જӝب ೣ • OOV ޙઁب ੹ഃ হ਺ • ׮নೠ ঱যী transferringب оמೞҊ, ੉ח ݒ਋ genericೞҊ ࢿמ, training acceleration ݶীࢲ ੉ٙ੉ ੓਺ • Character-based ӝߨࠁ׮ sequence lengthب ؊ ૣইࢲ ࡅܲ ೟णҗ ୶ۿ੉ оמೣ
  32. Future Work 4. Conclusion • Source-Target੄ ӡ੉ ର੉о ௿ ٸ

    ࢿמ੉ ڄয૑ח ޙઁܳ ೧Ѿ೧ࠅ Ѫ • One-to-Many, Many-to-Many settingীࢲب ಣоܳ ೧ࠁҊ੗ ೣ
  33. хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽

    োۅ ઱ࣁਃ! ੉઱ഘ (ML Research Scientist, Pingpong) Email. joohong@scatterlab.co.kr Facebook. @roomylee Linked in. @roomylee