Slide 1

Slide 1 text

Byte-level BPE: Neural Machine Translation with Byte-level Subwords ੉઱ഘ (ML Research Scientist, Pingpong)

Slide 2

Slide 2 text

Neural Machine Translation with Byte-level Subwords Overview • “Neural Machine Translation with Byte-level Subwords” • Changhan Wang, Kyunghyun Cho, and Jiatao Gu (Facebook AI Research) • AAAI 2020 (arXiv 2019)

Slide 3

Slide 3 text

1. Introduction Neural Machine Translation with Byte-level Subwords (Wang et al., 2019)

Slide 4

Slide 4 text

Byte-Pair Encoding (BPE) 1. Introduction • ࠼بо ֫਷ Character हਸ ߽೤೧աх vocab = all_unique_characters while len(vocab) <= max_vocab_size: pair = get_max_pair(corpus) corpus = merge_vocab(corpus, pair) vocab.append(pair)

Slide 5

Slide 5 text

Character? Byte? 1. Introduction • Character (a, b, c, о, ա, ׮, …) • ઱۽ BPEೞݶ character-levelਸ ݈ೣ • Textۄח ѱ sequence of character۽ॄ ಴അೞח ѱ ੗োझ۞ਕࢲ • Byte (E3, 81, AE, …) • Compactness: 256ѐ੄ ష௾݅ ੓ਵݶ ޤٚ ٜ݅ ࣻ ੓਺ • ঱যী ࢚ҙ হ੉ ࢎਊೡ ࣻ ੓਺

Slide 6

Slide 6 text

Character? Byte? 1. Introduction • Character (a, b, c, о, ա, ׮, …) • ઱۽ BPEೞݶ character-levelਸ ݈ೣ • Textۄח ѱ sequence of character۽ॄ ಴അೞח ѱ ੗োझ۞ਕࢲ • Byte (E3, 81, AE, …) • Compactness: 256ѐ੄ ష௾݅ ੓ਵݶ ޤٚ ٜ݅ ࣻ ੓਺ • ঱যী ࢚ҙ হ੉ ࢎਊೡ ࣻ ੓਺

Slide 7

Slide 7 text

Character-level BPE੄ ೠ҅ 1. Introduction • Vocabularyীࢲ characterо ցޖ ݆਷ ठ܃ਸ ର૑ೡ ࣻ ੓਺ • Rare character from noisy text • Character-rich languages (such as CJK languages) • ৈ۞ ঱যܳ ׮ܖӝী ࠗ੸೤ೣ • bilingual and multilingual • 150ѐ੄ ঱যܳ ழߡೞ۰ݶ 138K੄ ਬפ௏٘ characterо ೙ਃೣ • ߈ݶ, UTF-8 byteח 256ѐ ઺ী 248ѐ݅ ੓ਵݶ ׮ ழߡೡ ࣻ ੓਺

Slide 8

Slide 8 text

2. Byte-level BPE Neural Machine Translation with Byte-level Subwords (Wang et al., 2019)

Slide 9

Slide 9 text

Byte-level BPE (BBPE) 2. Byte-level BPE • ӝࠄ੸ਵ۽ ਬפ௏٘ characterܳ UTF-8۽ ੋ௏٬ೣ • 1 ਬפ௏٘ = 1~4 byte • ੋ௏٬ ػ sequence of bytesী ؀೧ࢲ BPE ೟णਸ दఇ • ୭ઙ vocab: UTF-8 byte set + BPEܳ ా೧ ୶о غח variable-length n-gram bytes Byte Sequence: EA B0 80 EB 82 98 EB 8B A4 EB 9D BC EB A7 88 EB B0 94 EC 82 AC Byte set: EA, B0, 80, EB, 82, 98, 8B, A4, 9D, BC, A7, 88, B0, 94, EC, 82, AC Variable-length n-gram bytes: EA B0, EB 82 98, A4 EB, …

Slide 10

Slide 10 text

Byte-level BPE (BBPE) 2. Byte-level BPE • ӝࠄ੸ਵ۽ ਬפ௏٘ characterܳ UTF-8۽ ੋ௏٬ೣ • 1 ਬפ௏٘ = 1~4 byte • ੋ௏٬ ػ sequence of bytesী ؀೧ࢲ BPE ೟णਸ दఇ • ୭ઙ vocab: UTF-8 byte set + BPEܳ ా೧ ୶о غח variable-length n-gram bytes ೞա੄ characterо ଂѐ૗

Slide 11

Slide 11 text

Contextualization 2. Byte-level BPE • ݫੋ ݽ؛ী ٜযоӝ ੹ী Contextualization੉ ೙ਃೞ׮Ҋ ೣ • рױೠ CNN੉ա GRUܳ కਕࢲ ݫੋ ݽ؛ੋ Transformerী ٜযоח ߑध

Slide 12

Slide 12 text

Decoding 2. Byte-level BPE • ݽٚ ޙ੢਷ byte sequence۽ ಴അೡ ࣻ ੓૑݅, 
 যڃ byte sequenceח ޙ੢ਵ۽ ࠂਗ(decoding)ೞӝ গݒೣ • Ex) Generation, Translation

Slide 13

Slide 13 text

Decoding 2. Byte-level BPE • Empirically, ೟णػ ݽ؛ীࢲ ੜޅػ byte sequenceܳ outputਵ۽ ղࠁղח ҃਋ח ٘ޛ׮Ҋ ೣ • प೷೧ࠄ Ѣীࢲח Ѣ੄ হ঻Ҋ, प೷ ࣁ౴ ઺ 165K example੄ large testsetীࢲ ઑରب ٘ޛ঻਺ • ডр ೟ण੉ ؏ ػ ݽ؛ীࢲח ઺ࠂػ byteܳ ߈ࠂೞח ޙઁо ੓਺ • ੉۠ ী۞ ಁఢٜ੉ ୭؀ೠ ݆਷ ਬפ௏٘ character۽ ࢶഋदрী ࠂਗೞҊ੗ ೣ • Dynamic Programming ӝ߈੄ ঌҊ્ܻਸ ઁউ

Slide 14

Slide 14 text

Decoding: algorithm 2. Byte-level BPE • Byte sequence о ઱য૗ • ܳ ীࢲ ࠂਗ оמೠ ୭؀ character ѐࣻۄҊ ೞ੗ • ח dynamic programmingਸ ా೧ࢲ ইې৬ э੉ ҅࢑ೡ ࣻ ੓਺ {B}N k=1 f(k) {B}N k=1 f(k) • о ৢ߄ܲ character੉ݶ , ইפݶ 0 • ਤ੄ ܳ ੤ӈ੸ਵ۽ backtrackingೞݶࢲ ҅࢑ೞݶ ೧ܳ ҳೡ ࣻ ੓਺ {B}j k=i g(i, j) = 1 f(k)

Slide 15

Slide 15 text

3. Experiments Towards Universal Dialogue State Tracking, (Ren et al., 2018)

Slide 16

Slide 16 text

Experimental Setting • Dataset • Bilingual: En-De, Ja-En, Si-En • Multilingual: Many-to-English (X-En) → TED Talk Corpus, 59ѐ ঱যী ؀ೠ parallel data • BPE & BBPE: Source + Target ޙ੢ী ؀೧ࢲ SentencePiece۽ ೟ण 3. Experiments

Slide 17

Slide 17 text

Experimental Setting • Model and Learning • Transformer ࢎਊ • Vaswini et al., 2017 ࣁ౴ਸ ݆੉ ٮܴ • Inference and Evaluation • Beam size: En-Deח 4, աݠ૑ח 5 • We calculate casesensitive tokenized BLEU (Papineni et al. 2002) as the metrics using sacreBLEU (Post 2018). 3. Experiments

Slide 18

Slide 18 text

Results: Qualitative Comparison: BPE vs. BBPE • Symbol Frequency Distribution 3. Experiments BBPEо ഻ঁ ؊ ࠙࢑غয ੓਺. Long tail੉ Ѣ੄ হҊ Ӓ۠ ൞ӈೠ ױযח subword۽ ಴അ

Slide 19

Slide 19 text

Results: Qualitative Comparison: BPE vs. BBPE • Ratio of BBPE tokens with partial characters 3. Experiments ੌࠄয৬ Multilingual਷ partial character੄ ࠺ਯ੉ ࢚׼ೣ. Character set: ੌࠄয(8K), Multilingual(11K)

Slide 20

Slide 20 text

Results: Qualitative Comparison: BPE vs. BBPE 3. Experiments

Slide 21

Slide 21 text

Results: Qualitative Comparison: BPE vs. BBPE • Cross-lingual Sharing • X-En੄ symbolsҗ ঴݃ա Ҁசө? • Ar, He, Ru, Ko, It ঱যী ؀೧ࢲ प೷ • ੹߈੸ਵ۽ BBPEо symbols੉ ݆੉ Ҁஜ • ݽ؛ ஏݶীࢲ parameter sharing੄ ੉ٙ • vocab ஏݶীࢲ universal modeling੄ ੉ٙ 3. Experiments

Slide 22

Slide 22 text

Results: Qualitative Comparison: BPE vs. BBPE • Impact on Sequence Length 3. Experiments BBPEо ؊ fineೠ ױਤܳ ׮ܖ׮ࠁפө sequenceо ӡয૓׮Ҋ ࢤпೡ ࣻ ੓૑݅, ষ୒ ӟ Ѫب ইש

Slide 23

Slide 23 text

Results: Importance of Contextualization • X-Enী ؀೧ࢲ 3о૑ ࣁ౴ਸ ࠺Ү • none • 1-layer CNN • 1-layer Bi-GRU • Fine-grained vocabੌࣻ۾ ബҗо ఀ 3. Experiments

Slide 24

Slide 24 text

Results: BBPE on Noisy Character Sets • En-De ؘ੉ఠࣇীח non-latin alphabet੉ ખ ੓਺ • ੉۠ ੉ਬ۽ character set੉ 3.4Kա ؽ • BPEח character setਸ ׮ ನೣ೧ঠ ೞӝ ٸޙী ੉۠ ࠗ࠙਷ ݆਷ vocab ठ܃ਸ խ࠺दఇ • BBPE 2K, 4K৬ BPE 32Kо ࠺तೠ Ѿҗܳ ঳਺ • ೞ૑݅ ౵ۄ޷ఠ ࣻীࢲ ݆਷ ੉ٙਸ ࠆ 3. Experiments

Slide 25

Slide 25 text

Results: BBPE on Character-Rich Languages • ઺Ҵয, ੌࠄযח 50Kо ֈח character setਸ о૗ • Ja-En ؘ੉ఠࣇ਷ ୨ 8K੄ character set੉Ҋ,
 top 2.4K੄ characterо ੹୓੄ 99%ܳ ழߡೣ • ੉۠ ੼ਸ Ҋ۰ೞৈ BBPE੄ ௼ӝܳ 4K۽ ࣁ౴ • BPEী ؀೧ࢲ comparableೠ ࢿמਸ ࠁ੐ 3. Experiments

Slide 26

Slide 26 text

Results: BBPE on Many-to-En Translation 3. Experiments BBPE જ਺ Char/Byte ؊ જ਺ (?)

Slide 27

Slide 27 text

• Impact on Sequence Length 3. Experiments Source৬ Target੄ ӡ੉ ର੉о ݆੉ աࢲ attention੉ য۰ਕ૓٠. Ӓېࢲ (B)BPE ࢿמ੉ ڄয૓ ѱ ইקө? Results: BBPE on Many-to-En Translation

Slide 28

Slide 28 text

Results: BBPE on Many-to-En Translation 3. Experiments Ӓۢীب ࠛҳೞҊ, ੹߈੸ਵ۽ BBPEо ࢿמ੉ա ࣘب ݶীࢲ ߖ۠झо જ਷ Ѫ э਺

Slide 29

Slide 29 text

Results: Transfer Learning on Unseen Characters • BBPEח ݽٚ UTF-8 byteܳ ನೣೞӝ ٸޙী OOV ޙઁо ੓ਸ ࣻ হ਺ • ٮۄࢲ character set ੹ഃ উҀ஖ח ف ঱যী ؀೧ transferring੉ оמೣ • X-Enਵ۽ pre-trainingೠ ݽ؛ਸ Si-Enী ؀೧ࢲ Fine-tuningೞݶ transferо ੜ غח Ѫਸ ࠅ ࣻ ੓਺ 3. Experiments

Slide 30

Slide 30 text

4. Conclusion Towards Universal Dialogue State Tracking, (Ren et al., 2018)

Slide 31

Slide 31 text

Contributions 4. Conclusion • Byte-level subword vocabularyܳ ݅٘ח BBPEܳ ઁউ • Character-based ӝߨী ࠺೧ࢲ ࢿמਸ ਬ૑ೞݶࢲ vocabularyܳ ݒ਋ ੘ѱ ٜ݅ ࣻ ੓਺ • Multilingual settingীࢲח ઙઙ ؊ ࢿמ੉ જӝب ೣ • OOV ޙઁب ੹ഃ হ਺ • ׮নೠ ঱যী transferringب оמೞҊ, ੉ח ݒ਋ genericೞҊ ࢿמ, training acceleration ݶীࢲ ੉ٙ੉ ੓਺ • Character-based ӝߨࠁ׮ sequence lengthب ؊ ૣইࢲ ࡅܲ ೟णҗ ୶ۿ੉ оמೣ

Slide 32

Slide 32 text

Future Work 4. Conclusion • Source-Target੄ ӡ੉ ର੉о ௿ ٸ ࢿמ੉ ڄয૑ח ޙઁܳ ೧Ѿ೧ࠅ Ѫ • One-to-Many, Many-to-Many settingীࢲب ಣоܳ ೧ࠁҊ੗ ೣ

Slide 33

Slide 33 text

хࢎ೤פ׮✌ ୶о ૕ޙ ژח ҾӘೠ ੼੉ ੓׮ݶ ঱ઁٚ ইې োۅ୊۽ োۅ ઱ࣁਃ! ੉઱ഘ (ML Research Scientist, Pingpong) Email. [email protected] Facebook. @roomylee Linked in. @roomylee