Teaching English Speaking Language Models to Speak Taiwanese Mandarin

Teaching   English Speaking Language Models to Speak Taiwanese Mandarin
Experiments on transfer learning of English- trained language models to understand and generate Taiwanese Mandarin   (Traditional Chinese). Pokai Chang 2023/06

I’m not a machine learning professional. Everything in this talk
should be treated as non-scientific hobbyist content.

Why train a new language model?

Open source language models are emerging rapidly. Why train a
new language model?

However, most open source language models can’t understand and generate
Traditional Chinese well. Open source language models are emerging rapidly. Why train a new language model?

Most open source language models can’t understand and generate Traditional
Chinese well. (Screenshot from https://chat.lmsys.org/).

Chinese well. (Screenshot from https://chat.lmsys.org/). 4UBSUTMPPQJOH

Chinese well. (Screenshot from   https://chat.lmsys.org/).

Chinese well. (Screenshot from   https://chat.lmsys.org/). *OWBMJEHSBNNBS

Chinese well. (Screenshot from   https://chat.lmsys.org/). *OWBMJEHSBNNBS #SPLFO

If one can, then it’s very likely to be “polluted”
by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model?

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/).

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF 3FQMZ 4JNQMJGJFE$IJOFTF

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/).

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF 3FQMZ 4JNQMJGJFE$IJOFTF

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? Model: ikala/bloom-zh-3b-chat

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models.

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model): Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915.

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model)... Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915. 100% Simpli fi ed Chinese Traditional Chinese

by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model)... Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915. 5SBEJUJPOBM$IJOFTF

LLMs are known for their ability to do transfer learning.
Why train a new language model?

What if we can find a way to let an
English language model to learn Traditional Chinese? LLMs are known for their ability to do transfer learning. Why train a new language model?

What if we can find a way to let an
English language model to learn Traditional Chinese? LLMs are known for their ability to do transfer learning. Why train a new language model? • We can take advantage of the emerging English language models, and turn them into Traditional Chinese language models! 🦙🦙🦙 🦙🦙🦙 LLaMA, Pythia, MPT, … Reproducible process zh-tw LLaMA, zh-tw Pythia, zh-tw MPT, …

The Outcome: TW-Pythia-6.9B-Chat A Traditional Chinese & English bilingual model
based on pythia-6.9b

The Outcome: TW-Pythia-6.9B-Chat A Traditional Chinese & English bilingual model
based on pythia-6.9b In the video, the model claims that it was trained by OpenAI. It’s not true. The shown model was trained based on pythia-6.9b. The model says so because it’s trained with ShareGPT data, which includes conversations that the AI introduces itself as ChatGPT trained by OpenAI.

Acceptable English ↔ Traditional Chinese Translation Capabilities of TW-Pythia-6.9B-Chat Sometimes
it might outperform Google Translate

Acceptable English ↔ Traditional Chinese Translation Capabilities of TW-Pythia-6.9B-Chat Sometimes
it might outperform Google Translate (PPHMF5SBOTMBUF 4FMGUSBJOFE )BSEUPVOEFSTUBOE #FUUFS 4JNQMJGJFE$IJOFTFXPSE 5SBEJUJPOBM$IJOFTFXPSE

Acceptable English ↔ Traditional Chinese Translation Capabilities of TW-Pythia-6.9B-Chat

Write Fluent Traditional Chinese Capabilities of TW-Pythia-6.9B-Chat （中略）

Speaking Funny Non-sense Capabilities of TW-Pythia-6.9B-Chat

The Training Process

Why Based on Pythia? The Training Process

• Deliberately choose a model with poor Chinese. Why Based
on Pythia? The Training Process

• Deliberately choose a model with poor Chinese. • Mitigate
the impact of Simpli fi ed Chinese. Why Based on Pythia? The Training Process

the impact of Simpli fi ed Chinese. • Make the learning results stand out. Why Based on Pythia? The Training Process

the impact of Simpli fi ed Chinese. • Make the learning results stand out. • Pythia has various versions with different sizes (70M ~ 12B). • We can do experiments on small models and scale it up to larger ones. Why Based on Pythia? The Training Process

Three Major Steps… The Training Process 1 2 3 Expand
the Vocabulary Train the Embedding Layers Instruction Tuning Add 8k new Traditional Chinese Tokens fi

The Training Process 2 Expand the Vocabulary 1 3

The Training Process 2 Expand the Vocabulary 1 3 Ϙ
kkk

kkk • Hard for the model to understand the language.

kkk • Hard for the model to understand the language. • Resource consuming - more tokens are needed to process the same amount of text.

kkk

The Training Process 1 Train the Embedding Layers 2 3
• When we learn a new language, we do not need to re-learn basic logic and reasoning abilities.

• When we learn a new language, we do not need to re-learn basic logic and reasoning abilities. • It might be the same for language models - we might not need to train all the parameters.

• How to know that the model is really learning?

• Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ

• Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ 5PLFO*%OFXXPSE

The Training Process 1 Instruction Fine-Tuning 3 2 • Nothing
special. Just LoRA fi ne-tuning.

Training Framework https://github.com/zetavg/twlm

https://github.com/zetavg/twlm Training Framework %FGJOFIPXUPCVJMEUIFUPLFOJ[FS %FGJOFIPXUPEPUIFUSBJOJOH USBJOXIBUQBSBNFUFST XIJDI EBUBTFU IZQFSQBSBNFUFST

Future Work • A better way to measure how the
model is learning by monitoring the embedding of each new token. • Train a complete model and see how it performs. • Apply the training on more open source models. • Take the trained models and have some fun! • 拿政治⼈物接受問答的內容來 train ⼀個 model ⋯⋯

Teaching English Speaking Language Models to Sp...

Teaching English Speaking Language Models to Speak Taiwanese Mandarin

More Decks by Pokai Chang

Other Decks in Technology

Featured

Transcript