“Swallow” is a Large Language Model (LLM) developed by research teams at Tokyo Institute of Technology and the National Institute of Advanced Industrial Science and Technology (AIST). We are working on building better Japanese LLMs by continual pre-training from English LLMs to Japanese. So far, we have successfully enhanced Japanese performance by continual pre-training from models such as Llama 2, Mistral, Mixtral, and Llama 3. The use of a high-quality pre-training corpus has been a crucial factor in this achievement. In this presentation, I will mainly introduce the construction procedure of the “Swallow Corpus”, our proprietary large Japanese web corpus, and its combination with other corpora. Additionally, if possible, I will discuss the future tasks considering current challenges of Swallow and recent research trends.
This slide was prepared for presentation at Tokyo AI Advanced AI (TAI AAI) on 8/7.
Our team has also published a paper, which was accepted to COLM 2024.
This slide is mostly based on this paper.
https://arxiv.org/abs/2404.17733