Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Teaching English Speaking Language Models to Speak Taiwanese Mandarin

Teaching English Speaking Language Models to Speak Taiwanese Mandarin

An attempt to re-train EN language models to understand and generate fluent Taiwanese Mandarin (Traditional Chinese).

https://github.com/zetavg/twlm

Pokai Chang

June 16, 2023
Tweet

More Decks by Pokai Chang

Other Decks in Technology

Transcript

  1. Teaching 
 English Speaking Language Models to Speak Taiwanese Mandarin

    Experiments on transfer learning of English- trained language models to understand and generate Taiwanese Mandarin 
 (Traditional Chinese). Pokai Chang 2023/06
  2. I’m not a machine learning professional. Everything in this talk

    should be treated as non-scientific hobbyist content.
  3. However, most open source language models can’t understand and generate

    Traditional Chinese well. Open source language models are emerging rapidly. Why train a new language model?
  4. Most open source language models can’t understand and generate Traditional

    Chinese well. (Screenshot from https://chat.lmsys.org/).
  5. Most open source language models can’t understand and generate Traditional

    Chinese well. (Screenshot from https://chat.lmsys.org/).
  6. Most open source language models can’t understand and generate Traditional

    Chinese well. (Screenshot from https://chat.lmsys.org/). 4UBSUTMPPQJOH
  7. Most open source language models can’t understand and generate Traditional

    Chinese well. (Screenshot from 
 https://chat.lmsys.org/).
  8. Most open source language models can’t understand and generate Traditional

    Chinese well. (Screenshot from 
 https://chat.lmsys.org/). *OWBMJEHSBNNBS
  9. Most open source language models can’t understand and generate Traditional

    Chinese well. (Screenshot from 
 https://chat.lmsys.org/). *OWBMJEHSBNNBS #SPLFO
  10. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model?
  11. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/).
  12. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF
  13. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF 3FQMZ 4JNQMJGJFE$IJOFTF
  14. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/).
  15. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF
  16. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? (Screenshots from https://chat.lmsys.org/). 1SPNQU 5SBEJUJPOBM$IJOFTF 3FQMZ 4JNQMJGJFE$IJOFTF
  17. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? Model: ikala/bloom-zh-3b-chat
  18. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? Model: ikala/bloom-zh-3b-chat
  19. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models.
  20. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model): Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915.
  21. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model)... Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915. 100% Simpli fi ed Chinese Traditional Chinese
  22. If one can, then it’s very likely to be “polluted”

    by Simplified Chinese. Most open source models can’t generate Traditional Chinese well. Why train a new language model? • It make sense since Simpli fi ed Chinese data is more than 2× larger than Traditional Chinese in most of the corpus that is used to train those language models. • For example, BigScience ROOTS Corpus (used to train the BLOOM language model)... Simpli fi ed Chinese is 342× larger than Traditional Chinese. Unit: size in bytes. arXiv:2303.03915. 5SBEJUJPOBM$IJOFTF 
  23. What if we can find a way to let an

    English language model to learn Traditional Chinese? LLMs are known for their ability to do transfer learning. Why train a new language model?
  24. What if we can find a way to let an

    English language model to learn Traditional Chinese? LLMs are known for their ability to do transfer learning. Why train a new language model? • We can take advantage of the emerging English language models, and turn them into Traditional Chinese language models! 🦙🦙🦙 🦙🦙🦙 LLaMA, Pythia, MPT, … Reproducible process zh-tw LLaMA, zh-tw Pythia, zh-tw MPT, …
  25. The Outcome: TW-Pythia-6.9B-Chat A Traditional Chinese & English bilingual model

    based on pythia-6.9b In the video, the model claims that it was trained by OpenAI. It’s not true. The shown model was trained based on pythia-6.9b. The model says so because it’s trained with ShareGPT data, which includes conversations that the AI introduces itself as ChatGPT trained by OpenAI.
  26. Acceptable English ↔ Traditional Chinese Translation Capabilities of TW-Pythia-6.9B-Chat Sometimes

    it might outperform Google Translate (PPHMF5SBOTMBUF 4FMGUSBJOFE )BSEUPVOEFSTUBOE #FUUFS 4JNQMJGJFE$IJOFTFXPSE 5SBEJUJPOBM$IJOFTFXPSE
  27. • Deliberately choose a model with poor Chinese. • Mitigate

    the impact of Simpli fi ed Chinese. Why Based on Pythia? The Training Process
  28. • Deliberately choose a model with poor Chinese. • Mitigate

    the impact of Simpli fi ed Chinese. • Make the learning results stand out. Why Based on Pythia? The Training Process
  29. • Deliberately choose a model with poor Chinese. • Mitigate

    the impact of Simpli fi ed Chinese. • Make the learning results stand out. • Pythia has various versions with different sizes (70M ~ 12B). • We can do experiments on small models and scale it up to larger ones. Why Based on Pythia? The Training Process
  30. Three Major Steps… The Training Process 1 2 3 Expand

    the Vocabulary Train the Embedding Layers Instruction Tuning Add 8k new Traditional Chinese Tokens fi
  31. The Training Process 2 Expand the Vocabulary 1 3 ˜Ϙ™

    kkk • Hard for the model to understand the language.
  32. The Training Process 2 Expand the Vocabulary 1 3 ˜Ϙ™

    kkk • Hard for the model to understand the language. • Resource consuming - more tokens are needed to process the same amount of text.
  33. The Training Process 1 Train the Embedding Layers 2 3

    • When we learn a new language, we do not need to re-learn basic logic and reasoning abilities.
  34. The Training Process 1 Train the Embedding Layers 2 3

    • When we learn a new language, we do not need to re-learn basic logic and reasoning abilities. • It might be the same for language models - we might not need to train all the parameters.
  35. The Training Process 1 Train the Embedding Layers 2 3

    • How to know that the model is really learning?
  36. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  37. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  38. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  39. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  40. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  41. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  42. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ
  43. • How to know that the model is really learning?

    • Print the model output during training! ᅼۨٙཫ಻ ͍ᆽഈࣩ 5PLFO*%OFXXPSE
  44. Future Work • A better way to measure how the

    model is learning by monitoring the embedding of each new token. • Train a complete model and see how it performs. • Apply the training on more open source models. • Take the trained models and have some fun! • 拿政治⼈物接受問答的內容來 train ⼀個 model ⋯⋯