LLMに日本語テキストを学習させる意義

LLM に日本語テキストを学習させる意義第261回自然言語処理研究発表会齋藤幸史郎1, 水木栄2,1, 大井
聖也1, 中村泰士1, 塩谷泰平1, 前田航希1, Youmi Ma1, 服部翔1, 藤井一喜1, 岡本拓己1, 石田茂樹1, 高村大也2, 横田理央1, 岡崎直観1 1: 東京工業大学 2: 産業技術総合研究所

1.背景と概要

- 本研究の背景：日本語に強いLLMの開発 3 1. 背景と概要非日本語 1. 学習データに日本語が殆ど含まれない「英語フルスクラッチ」
(Llama, Mistral, Mixtral, …) 日本語 2. 学習データに日本語が多く含まれる「日本語フルスクラッチ」 (CyberAgentLM, LLM-jp, Sarashina 2, …) 4. 学習データが多言語 (∋日本語) 「多言語フルスクラッチ」 (C4AI Command-R, Qwen2, …) 日本語 3. 日本語テキストで追加学習「日本語継続事前学習」 (Swallow, ELYZA LLM for JP, RakutenAI, …)

4 多様なLLM 比較日本語に強いLLMの特徴日本語テキストを学習する効果・意義を調べたい目的 35種類のLLMに対して日英19件のタスクを用いて評価「評価タスク間のスコアの相関」と「タスク性能の言語個別性」の二点で分析方法
英語資源からでも学習可能かも：日本語の一般教養、算術推論、コード生成日本語資源の学習効果が顕著：日本の知識に関する質問応答や英日機械翻訳知見 - 研究の目的と得られた知見 1. 背景と概要

2. 関連研究と本研究の新規性

- 関連研究：非英語テキストを学習することの効果 6 2. 関連研究と研究の新規性 - Wuら[1]によれば、必ずしも各言語の性能改善をもたらすわけではない多言語テキスト学習に関する先行研究の報告特定の言語における性能向上を測った先行研究 -
Choiら[2]は、Llama2の韓国語能力の向上を目指した - Zhaoら[3]は、Llama2の中国語性能の向上を目指した非英語テキストを学習する効果を多様なモデルとタスクで網羅的には評価していない

- 関連研究：タスク性能の相関と能力因子 7 2. 関連研究と研究の新規性 - 知識を要するQAタスク同士は高い相関を示す - 算術推論・コード生成の能力に関係する因子が存在するタスクを解く能力因子に関係する先行研究[4],
[5], [6]の報告能力因子の源泉に関係する先行研究 - Ruanら[4]は、同一モデルファミリーにおける異なるパラメータ数のモデルを分析することで、計算予算と能力因子のスケール則を論じたいずれも英語単言語に対する分析に留まる

- 本研究の新規性 8 2. 関連研究と本研究の新規性本研究は 35種類のLLMに対して日英19件のタスクを用いて評価をすることで日本語テキストを学習する意義について一般的な知見を求めた点で新規性があり、日本語LLM開発に貢献している

3. 方法論

10 3. 方法論「英語フルスクラッチ」「日本語フルスクラッチ」「日本語継続事前学習」「多言語フルスクラッチ」 Llama 2, Llama
3, Mistral, Mixtral, Yi-1.5 Llama 3 Swallow, Swallow, Swallow-MX, Japanese Stable LM, Rakuten, Youri, ELYZA-Japanese-Llama 2, KARAKURI LM CyberAgentLM2, Sarashina2, Fugaku-LLM, LLM-jp C4AI Command-R, Qwen2, Qwen1.5 合計 35 種構築手法や学習コーパスに依らない一般的な知見が得られることを期待 - 評価モデルの選定：構築手法や学習コーパスの異なる35種のモデル

- 評価モデルの選定：モデルの構築手法に関する情報の収集 11 3. 方法論「どのように」「どれほどの大きさの」「どんなコーパスで」開発したのかを調査 (完全版は A. 評価の詳細
に掲載) 「英語フルスクラッチ」「日本語フルスクラッチ」「日本語継続事前学習」「多言語フルスクラッチ」

- 評価タスクの選定：包括的に能力を測るための19種の日英タスク 12 3. 方法論同様のタスクを日英で解いた時の相違点を調べると共にできるだけ多くの能力因子を捉える狙い合計 19 種
NIILC[8] TriviaQA[9] OpenBookQA[10] JcommonsenseQA[11] HellaSwag[12] 百科事典的知識・常識 JSQuAD[8] SQuAD2[20] XWINO[19] 読解一般教養論理推論・算術推論 JHumanEval[23] HumanEval[24] コード生成 XL-Sum[17] WMT20 (en-ja)[18] WMT20 (ja-en)[18] 要約・翻訳青: 日本語黒: 英語黄: 対応緑: 対訳対訳対訳対訳対応対応 JEMHopQA[13] BBH[14] MGSM[15] GSM8K[16] JMMLU[21] MMLU[22] (タスクの例は A. 評価の詳細に掲載)

- 評価結果の集計：タスク・モデル毎に評価結果をまとめた 13 3. 方法論 35種のモデルを19件のタスクについて公平な条件下で評価した Swallow LLM. (2024). Evaluation
Results of English / Japanese LLMs Using Swallow-Evaluation ver.202407 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13219138 Swallow LLM. 日本語LLM評価🔗. https://swallow- llm.github.io/evaluation/index.ja.html CSV Interactive Table

4. 評価結果の分析：日本語を学習させる効果

- 評価タスク間のスコアの相関：言語横断的に相関の高いタスク群 15 4. 評価結果の分析：日本語を学習させる効果一般教養 (日: JMMLU, 英: MMLU)
: 0.91 コード生成 (日: JHumanEval , 英: HumanEval) : 0.98 算術推論 (日: MGSM, 英: GSM8K) : 0.94 これらの(対訳関係にある)タスクは英語テキストのみの学習で日本語での性能向上もできる可能性青: 日本語タスク黒: 英語タスク

- 評価タスク間のスコアの相関：相関の低いタスク群 16 4. 評価結果の分析：日本語を学習させる効果は他タスクのスコアとの相関が比較的低い青: 日本語タスク黒:
英語タスク日本語の質問応答 (JEMHopQA, NIILC) 英日機械翻訳 (WMT20-en-ja) これらのタスクは他のタスクとは違う要因でスコアが変化している

- 主成分分析：3つの異なる能力因子 17 4. 評価結果の分析：日本語を学習させる効果青: 日本語タスク黒: 英語タスク全体的にプラス
→ 基礎能力第1主成分 (r=65.2) 第2主成分 (r=15.4) 日本語の質問応答と英日機械翻訳で強くプラス → 日本語能力第3主成分 (r=7.0) 算術推論能力・コード生成能力三つの能力は異なる傾向を持つ

- 計算予算：日本語テキストの学習と第2主成分の相関はあるのか 18 4. 評価結果の分析：日本語を学習させる効果日本語向け計算予算[7]の対数 = log(パラメータ数×日本語学習トークン数) 日本語能力と考えられる第2主成分両者の間に中程度の正の相関(r
= 0.768)を確認第2主成分は日本語向け計算予算にスケールしている

- 異なるモデル群での再検証：評価対象の選択に対する知見の頑健性 19 4. 評価結果の分析：日本語を学習させる効果異なるモデル群の一例として、継続事前学習モデルを除いて同様の実験を行った同様の結果を得た → 得られた知見はモデルの選択にロバスト青:
日本語タスク黒: 英語タスク r = 0.803

- 共通因子に溶け込む因子：尤度法 × プロマックス回転 20 4. 評価結果の分析：日本語を学習させる効果第4因子: JCom. (百科事典的知識・常識)
と JSQuAD (読解) で強いプラス英語能力算術・コード (≒ 第3主成分) 日本語能力 (≒ 第2主成分) これは第3因子とは別で現れておりまた計算予算の対数との相関も英語向けとの方が高く (r日 =0.241, r英 =0.788) これらのタスクは日本語テキストの学習と相関が低い青: 日本語タスク黒: 英語タスク

- まとめ 21 4. 評価結果の分析：日本語を学習させる効果多様なLLM 比較日本語に強いLLMの特徴日本語テキストを学習する効果・意義を調べたい
目的 35種類のLLMに対して日英19件のタスクを用いて評価「評価タスク間のスコアの相関」と「タスク性能の言語個別性」の二点で分析方法英語資源からでも学習可能かも：日本語の一般教養、算術推論、コード生成日本語資源の学習効果が顕著：日本の知識に関する質問応答や英日機械翻訳知見

- 今後の課題 22 4. 評価結果の分析：日本語を学習させる効果 ⚫ 一般教養、算術推論、コード生成が日英で高相関なのは対訳だからなのか？ ⚫ 学習データの独自性が高いPhi系列などを含めても同じ知見が得られるのか？ ①
分析対象のモデル・タスクの網羅性を高める ② 日本語LLM構築に寄与する具体的な方法論の探求 ⚫ 本研究で挙げた各能力因子はどのようなコーパスで強化できるのか？

Appendix

A. 評価の詳細 B. 結果の詳細な分析 C. 参考文献 D. 修正事項 Index 24

A. 評価の詳細

26 - 評価モデルの詳細 A. 評価の詳細 ※ 全て指示チューニングやアラインメントを行っていない事前学習済みモデル(ベースモデル) 「日本語継続事前学習」「多言語フルスクラッチ」
「英語フルスクラッチ」「日本語フルスクラッチ」

27 - 評価タスクの詳細 A. 評価の詳細日本語英語矢印は日英で対応したタスク

28 - 評価タスクの例 A. 評価の詳細 NIILC[8] 問慶応大学を作った人は？答福沢諭吉
TriviaQA[9] 問 Miami Beach in Florida borders which ocean? 答 Atlantic OpenBookQA[10] 問 The sun is responsible for A. puppies learning new tricks B. children growing up and getting old C. flowers wilting in a vase D. plants sprouting, blooming and wilting 答 D 百科事典的知識・常識

29 - 評価タスクの例 A. 評価の詳細 JcommonsenseQA[11] 問タバコを吸う事を何と言う？選択肢：0.電話,1.食事,2.喫煙,3.仏,4.飲酒答
2 HellaSwag[12] 問 We see a fitness center sign. We then see a man talking to the camera and sitting and laying on an exercise ball. The man a) demonstrates how to increase efficient exercise work by running up and down balls. b) moves all his arms and legs and builds up a lot of muscle. c) then plays the ball and we see a graphics and hedge trimming demonstration. d) performs sits ups while on the ball and talking. 答 d 百科事典的知識・常識

30 - 評価タスクの例 A. 評価の詳細 JEMHopQA[13] 問島津斉興が藩主となった藩の藩庁があった城は？答
鹿児島城論理推論・算術推論 BBH[14] 問 The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically consistent within each paragraph. On a branch, there are three birds: a blue jay, a quail, and a falcon. The falcon is to the right of the blue jay. The blue jay is to the right of the quail. Options: (A) The blue jay is the second from the left (B) The quail is the second from the left (C) The falcon is the second from the left 答 (A)

31 - 評価タスクの例 A. 評価の詳細 MGSM[15] 問ローブを作成するには、青色の繊維を2巻分、白色の繊維をその半分用いる必要があります。全体で何巻必要ですか？答
3 論理推論・算術推論 GSM8K[16] 問 A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? 答 3

32 - 評価タスクの例 A. 評価の詳細 XL-Sum[17] ([…]は中略を表す) 問ベッチウ枢機卿（左）は、[…]権を放棄している。（英語記事 Vatican
cardinal resigns unexpectedly）答キリスト教カトリック教会のローマ教皇庁（ヴァチカン）は、列聖省長官を務めるジョヴァンニ・アンジェロ・ベッチウ枢機卿が突然、辞任したと発表した。要約・翻訳 WMT20-en-ja[18] 問 Why the Queen made an error of judgement on proroguing Parliament - Joyce McMillan 答なぜ女王は議会停止という判断ミスを犯したのか - ジョイス・マクミラン

33 - 評価タスクの例 A. 評価の詳細 WMT20-ja-en[18] 問スタジオ騒然のしくじりエピソードを披露する。答 During
the episode she will be introducing various episodes in which her gaffes caused an uproar in the studio. 要約・翻訳 XWINO[19] 問 “The city councilmen refused the demonstrators a permit because they feared violence.” ”they” refers - “The city councilmen” - “the demonstrators”. 答 “The city councilmen” 読解

34 - 評価タスクの例 A. 評価の詳細 JSQuAD[11] ([…]は中略を表す, 赤文字は抜き出すべき箇所) 問梅雨（つゆ、ばいう）は、[…]
東アジアの広範囲においてみられる特有の気象現象で、[…] 。雨季の一種である。梅雨は、世界的にどのあたりで見られる気象ですか？答東アジアの広範囲読解 SQuAD2.0[20] ([…]は中略を表す, 赤文字は抜き出すべき箇所) 問 The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. […] In what country is Normandy located? 答 France

35 - 評価タスクの例 A. 評価の詳細 JMMLU[21] 問次のうち、体内で尿を集める構造を最もよく表しているものはどれか？ A. 膀胱
B. 腎臓 C. 尿管 D. 尿道答 A 一般教養 MMLU[22] 問 Which of the following best describes the structure that collects urine in the body? A. Bladder B. Kidney C. Ureter D. Urethra 答 A

36 - 評価タスクの例 A. 評価の詳細 JHumanEval[23] 問 def greatest_common_divisor(a: int,
b: int) -> int: “””整数 a と b の最大公約数を返す >>> greatest_common_divisor(3, 5) 1 >>> greatest_common_divisor(25, 15) 5 “”” コード生成 HumanEval[24] 問 def greatest_common_divisor(a: int, b: int) -> int: “””Return a greatest common divisor of two integers a and b >>> greatest_common_divisor(3, 5) 1 >>> greatest_common_divisor(25, 15) 5 “”” 答 while b: a, b = b, a % b return a 答 while b: a, b = b, a % b return a

37 - 評価結果の詳細 A. 評価の詳細 Swallow LLM. (2024). Evaluation Results
of English / Japanese LLMs Using Swallow-Evaluation ver.202407 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13219138 表データ(.csv)として取得インタラクティブなテーブルとして閲覧・分析 Swallow LLM. 日本語LLM評価🔗. https://swallow-llm.github.io/evaluation/index.ja.html

B.結果の詳細な分析

- 評価結果 39 青: 日本語タスク黒: 英語タスク日本語を中心とした言語データで事前学習したモデルの性能が比較的低い日本語の言語データで継続事前学習した
モデルは殆どの日本語タスクで好成績 ① ② 特に NIILC (日本の知識に関する質問応答) とWMT20-en-ja (英日機械翻訳) B. 結果の詳細な分析

- 評価結果：スコアの散布図 40 B. 結果の詳細な分析強い正の相関の例 (縦: HumanEval / 横:
JHumanEval) 負の相関の例 (縦: HumanEval / 横: NIILC) 分析にはこれらの(相関)行列を用いた Swallow LLM. 日本語LLM評価🔗. https://swallow- llm.github.io/evaluation/index.ja.html

41 - 各モデルの因子得点 B. 結果の詳細な分析基礎能力と考えられる第1主成分はパラメタ数と正の相関があるように見える日本語能力と考えられる第2主成分は日本語を継続事前学習した Swallow系列が元のLlama系列よりも高得点
算術推論・コード生成能力と考えられる第3主成分はQwenの得点が大きい

42 - 第4主成分について B. 結果の詳細な分析第4主成分は英語タスクに正の値を示しているが英語に強いモデルである Llama-3-70B
や Qwen2-72B よりも日本語LLMである CyberAgentLM2-7Bの方が大きな因子得点を示し整合性が取れない

C. 参考文献

44 - 参考文献一覧 C. 参考文献 [1] Berend, G.: Combating the
Curse of Multilinguality in Cross-Lingual WSD by Aligning Sparse Contextualized Word Representations, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 2459–2471 (online), DOI: 10.18653/v1/2022.naacl-main.176 (2022). [2] Choi, C., Jeong, Y., Park, S., Won, I., Lim, H., Kim, S., Kang, Y., Yoon, C., Park, J., Lee, Y., Lee, H., Hahm, Y., Kim, H. and Lim, K.: Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, pp. 12514–12526 (2024). [3] Zhao, J., Zhang, Z., Gao, L., Zhang, Q., Gui, T. and Huang, X.: LLaMA Beyond English: An Empirical Study on Language Capability Transfer, arXiv:2401.01055 (2024). [4] Ruan, Y., Maddison, C. J. and Hashimoto, T.: Observational Scaling Laws and the Predictability of Language Model Performance, arXiv:2405.10938 (2024). [5] Ni, J., Xue, F., Yue, X., Deng, Y., Shah, M., Jain, K., Neubig, G. and You, Y.: MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures, arXiv:2406.06565 (2024).

45 - 参考文献一覧 C. 参考文献 [6] Tiong, A., Zhao, J.,
Li, B., Li, J., Hoi, S. and Xiong, C.: What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 3427–3454 (online), DOI: 10.18653/v1/2024.naacl-long.188 (2024). [7] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre: Training Compute-Optimal Large Language Models, arXiv:2203.15556 (2022). [8] 関根聡：百科事典を対象とした質問応答システムの開発，言語処理学会第 9 回年次大会, 2003，pp. 637–640 (2003).

46 - 参考文献一覧 C. 参考文献 [9] Joshi, M., Choi, E.,
Weld, D. and Zettlemoyer, L.: TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension, Proceedings of the 55th Annual Meeting of the Association for Computa- tional Linguistics, Association for Computational Lin- guistics, pp. 1601–1611 (online), DOI: 10.18653/v1/P17- 1147 (2017). [10] Mihaylov, T., Clark, P., Khot, T. and Sabharwal, A.: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Com- putational Linguistics, pp. 2381–2391 (online), DOI: 10.18653/v1/D18-1260 (2018). [11] Kurihara, K., Kawahara, D. and Shibata, T.: JGLUE: Japanese General Language Understanding Evaluation, Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, pp. 2957–2966 (2022). [12] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. and Choi, Y.: HellaSwag: Can a Machine Really Finish Your Sentence?, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Associ- ation for Computational Linguistics, pp. 4791–4800 (online), DOI: 10.18653/v1/P19-1472 (2019).

47 - 参考文献一覧 C. 参考文献 [13] Ishii, A., Inoue, N.,
Suzuki, H. and Sekine, S.: JEMHopQA: Dataset for Japanese Explainable Multi- Hop Question Answering, Proceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation, pp. 9515– 9525 (2024). [14] Suzgun, M., Scales, N., Sch¨arli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D. and Wei, J.: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Find- ings of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 13003– 13051 (online), DOI: 10.18653/v1/2023.findings-acl.824 (2023). [15] Shi, F., Suzgun, M., Freitag, M., Wang, X., Srivats, S., Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D., Das, D. and Wei, J.: Language models are multilingual chain-of-thought reasoners, The Eleventh International Conference on Learning Representations (2023). [16] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C. and Schulman, J.: Training Ver- ifiers to Solve Math Word Problems, arXiv:2110.14168 (2021).

48 - 参考文献一覧 C. 参考文献 [17] Hasan, T., Bhattacharjee, A.,
Islam, M. S., Mubasshir, K., Li, Y.-F., Kang, Y.-B., Rahman, M. S. and Shahri- yar, R.: XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages, Findings of the Asso- ciation for Computational Linguistics, Association for Computational Linguistics, pp. 4693–4703 (online), DOI: 10.18653/v1/2021.findings-acl.413 (2021). [18] Barrault, L., Biesialska, M., Bojar, O., Costa-juss`a, M. R., Federmann, C., Graham, Y., Grundkiewicz, R., Haddow, B., Huck, M., Joanis, E., Kocmi, T., Koehn, P., Lo, C.-k., Ljubeˇsi´c, N., Monz, C., Morishita, M., Na- gata, M., Nakazawa, T., Pal, S., Post, M. and Zampieri, M.: Findings of the 2020 Conference on Machine Trans- lation, Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, pp. 1–55 (2020). [19] Tikhonov, A. and Ryabinin, M.: It’s All in the Heads: Using Attention Heads as a Baseline for Cross- Lingual Transfer in Commonsense Reasoning, Findings of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 3534–3546 (online), DOI: 10.18653/v1/2021.findings-acl.310 (2021).

49 - 参考文献一覧 C. 参考文献 [20] Rajpurkar, P., Jia, R.
and Liang, P.: Know What You Don’t Know: Unanswerable Questions for SQuAD, Pro- ceedings of the 56th Annual Meeting of the Associ- ation for Computational Linguistics, Association for Computational Linguistics, pp. 784–789 (online), DOI: 10.18653/v1/P18-2124 (2018). [21] 尹子旗，王昊，堀尾海斗，河原大輔，関根聡：プロンプトの丁寧さと大規模言語モデルの性能の関係検証，言語処理学会第 30 回年次大会発表論文集 (2024). [22] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J.: Measuring Massive Multitask Language Understanding, International Con- ference on Learning Representations (2021). [23] 佐藤美唯，高野志歩，梶浦照乃，倉光君郎： LLM は日本語追加学習により言語間知識転移を起こすのか？，言語処理学会第 30 回年次大会発表論文集 (2024).

50 - 参考文献一覧 C. 参考文献 [24] Chen, M., Tworek, J.,
Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., Mc- Grew, B., Amodei, D., McCandlish, S., Sutskever, I. and Zaremba, W.: Evaluating Large Language Models Trained on Code, arXiv:2107.03374 (2021).

D. 修正事項

52 - 予稿からの修正 (24/08/30時点) D. 修正事項継続事前学習を除いた(フルスクラッチモデルのみ)での再検証の結果を示す図 [修正版] 「4.1.1
構築手法による性能差」の記述修正 […] 算術計算（MGSM， GSM8K）やコード生成(JHumanEval， HumanEval)のスコア […] 属している5つのモデルのうち，4つを占めているQwen系のLLMが […] 算術計算（MGSM， GSM8K）やコード生成(JHumanEval， HumanEval)のスコア […] 属している4つのモデルのうち，3つを占めているQwen系のLLMが

LLMに日本語テキストを学習させる意義

LLMに日本語テキストを学習させる意義

Other Decks in Technology

Featured

Transcript