LLM-jp-3 and beyond: Training Large Language Models

LLM-jp-3 and beyond: Training Large Language Models Yusuke Oda (NII
LLMC) 2025-10-29

Contents • Basics of LLMs • Introducing LLM-jp • Corpus
construction (digest) • Model training (pretraining) • Release policy 2

Basics of LLMs 3

Deﬁnition of language models • Language model = probability distribution
of texts • Desirable text (e.g., human readable text) gets high value • Undesirable text (e.g., random) gets low value • Texts = array of tokens • What is the relationship between LM and generative AI? Last token </s> 1st token <s> 2nd token Hello 3rd token world a text 4

Next token prediction – Basis of Generative AI (1) •
Decomposing LM distribution using the "chain rule" of conditional distribution: Predict 1st word Predict 2nd word given the 1st word Predict 3rd word given the 1st/2nd words Next token prediction model (autoregressive model) Token to append History of past tokens 5

Next token prediction – Basis of Generative AI (2) •
derives a basic generation algorithm: history = [“<s>”] while history[-1] != “</s>”: history.append(sample w from P) return history <s> <s> <s> Hello Hello world <s> Hello world </s> Sample Sample Sample • Very simple, but recent line of research revealed that if the LLM is intelligent enough, next token prediction can solve tasks described in natural language. • Anyway, how do we construct ? 6

Modeling • Historical methods – count based ◦ E.g., n-gram
models: • Neural network methods (2001~) ◦ Feed-forward NN (2001) ◦ Recurrent NN (2010) ◦ Transformer (2017) ▪ Majority of current LLM architectures ▪ Handles history of any lengths (in theory) Figure taken from: https://arxiv.org/abs/1706.03762 Only the right part (decoder) of Transformer is used to construct LMs 7

Evolution of GPT models (~2023) • 2018 GPT (117M params)
• 2019 GPT-2 (1.5B params) • 2020 GPT-3 (175B params) • 2022 GPT-3.5 / InstructGPT • 2022 ChatGPT (application) ◦ Huge social impact • 2023 GPT-4 (2T? params) ◦ High performance on national exams: ▪ US legal bar exam ▪ USMLE (medical) ▪ SAT Exponential increase of #params ++ #layers ++ #hid. units ++ #att. heads Large models + next token prediction = handling complex tasks 8

Impact of ChatGPT Figure taken from: https://arxiv.org/abs/2307.06435 Figure taken from:
https://link.springer.com/article/10.1007/s11044-023-09962-0 Boosting research Boosting development of owned LLMs log-scale 9

Introducing LLM-jp 10

Introducing LLM-jp LLM-jp (LLM勉強会) • Construct open and Japanese-oriented LLMs
• Revealing the operating principle of LLMs • Release models, data, tools, and related techniques and materials, not only successful work but also progress and/or failures. • Anyone who agree the above policy can participate in us. 2023-05: First meetup 2023-10: First 13B model 2024-09: LLM-jp-3 ~13B (7 models) 2024-12: LLM-jp-3 172B 2025-05: LLM-jp-3.1 (3 models) 2025-03: LLM-jp-3 MoE 8x1.8B, 8x13B Now: Training v4 models 2000~ members Started with 30 NLP researchers 2024-04: NII R&D Center for LLMs (LLMC, dedicated lab) 11

Why developing LLMs in Japan (1) • Enhancing ability of
local knowledge ◦ Flagship LLMs are not trained well for local knowledge/commonsense ▪ They works anyway, with limited inner knowledge and RAG ◦ Necessity to train LLMs (pretraining and/or tuning) with local knowledge Japanese languages Japanese culture Geolocational information in/around Japan 12

Why developing LLMs in Japan (2) • Reducing geopolitical risks
around LLMs = Technical self-suﬃciency ◦ Secure technical capabilities, engineers, and resources to independently develop LLMs on its own country Model Dependency Self-suﬃciency 13

Challenges on developing LLMs (1) • Data ◦ Huge amount
of text data is required to train LLMs ▪ Trillions of tokens should be prepared • LLaMA 2: 2T tokens • LLaMA 3: 15T tokens • Qwen3: 36T tokens ◦ Collecting data is challenging especially in non-English languages (even Japanese) ▪ Most open web data (Common Crawl) is written in English ▪ Only ~1T open data is available in Japanese 14

Challenges on developing LLMs (2) • Compute ◦ Huge computing
cluster is required to handle training jobs ▪ GPT-3 scale models (175B) require hundreds ~ thousands of Flagship GPUs to train ▪ Even small models (1B) require tens of H100 GPUs to train within a handy time ◦ Operating computing cluster requires large cost ▪ Training a 32B model (LLM-jp-4 ﬂagship) with 10T tokens require 12k GPU (H200) days, $~10M • Engineering cost/workforce ◦ Human experts are also required to handle large scale data collection, developing/managing training pipelines, and computing resources 15

Governmental Support for LLMs in Japan (FY2024) METI (経済産業省 )
• Providing financial support for setting up 3rd-party compute resources • ABCI supercomputer • Providing compute support for LLM development • GENIAC (NEDO) • Providing financial/compute support for LLM development Cabinet Office（内閣府） • Providing support LLM development for medical domain (SIP) MIC（総務省） • NICT • Develop owned LLM model with their own corpus/resources MEXT（文部科学省） • University of Tokyo • Preparing computing resources for LLM/other foundation model • RIKEN • Experiment to employ Fugaku supercomputer for LLM training • National Institute of Informatics • Organize LLM-jp • R&D center for LLM 16

Working groups in LLM-jp Tuning/evaluation WG Large-scale corpus Evaluation data
Tuning data Prof. Ogata （Waseda U.） Prof. Suzuki （Tohoku U.） Prof. Miyao (U. Tokyo) Prof. Taura （U. Tokyo） Prof. Yokota （Science Tokyo） Safety WG Prof. Sekine （NII） Multimodal WG Prof. Okazaki （Science Tokyo） Real-world Interaction WG Prof. Kawahara （Waseda U.） Principle WG Prof. Ozeki （U. Tokyo） Model WG Computing cluster Corpus WG 17

Working groups in LLM-jp Corpus WG • Collect trillion-scale text
corpus to train LLMs • Develop tokenizers • Develop corpus ﬁltering pipeline • Data augmentation/synthesis • Collaboration with other national labs Model WG • Pretraining/mid-training of LLMs • Explore model architecture/hyperparameters • Manage computing cluster Tuning/evaluation WG Large-scale corpus Evaluation data Tuning data Model WG Computing cluster Corpus WG Focus 18

LLM-jp-3 model series: model sizes Model 150M 440M 980M 1.8B
3.7B 7.2B 13B 172B Vocab size 99487 #Layers 12 16 20 24 28 32 40 96 FFN size 2048 3584 5376 7168 8192 1100 8 1382 4 3846 4 Hid. size 512 1024 1536 2048 3072 4096 5120 1228 8 #Att. heads 8 16 24 32 40 96 #Query grps 8 16 24 32 40 16 19

Training Curve of LLM-jp Models LLM-jp-3 MoE 8x13B Trained 2.1T
tokens from LLM-jp-3 13B Our best model so far: comparable with GPT-3.5 without tuning LLM-jp-3 150M~172B (8 models) Trained 2.1T tokens LLM-jp-4 Experimental models (ongoing) Planning to train 15.6T tokens GPT-3.5 GPT-4 Trained tokens [in billion, log] Average of subtask scores 20

Corpus construction 21

Collecting trillion-scale corpus 22 • Corpus construction = the most
crucial part of LLM development ◦ Large-scale corpus improves model quality ◦ High-quality corpus also improves model quality • How to collect huge corpus (in LLM-jp) ◦ Leveraging public corpus ▪ Common Crawl (CC): huge collection of Web pages ◦ Our own crawling ◦ Collaborative work with other laboratories ▪ National Diet Library (NDL: 国立国会図書館） ▪ National Institute of Japanese Literature (NIJL: 国文学研究資料館）

Oﬃcial announcement of NII/NDL collaboration (2024, 2025) 23 2024: Providing
Web URL list of government oﬃcials 2025: Providing OCR texts of governmental documents

Corpus ratio (1) 24 • Determining number of repeats of
subset tokens during training ◦ Many repeats: Increase subset ratio, but may cause overﬁtting ◦ Few repeats: may cause underﬁtting ◦ Corpus ratio also affects against balance of task performance • Ablation study conducted with various patterns of repeats Subset Candidate patterns (color represents number of repeats)

Corpus ratio (2) 25 LLM-jp-3 Total: 2.1T tokens LLM-jp-4 Total:
22T tokens • Adopted corpus ratios (LLM-jp-3/LLM-jp-4) ◦ LLM-jp-3: Balancing between Japanese and English ◦ LLM-jp-4 (ongoing): Introducing much more English tokens ▪ Based on the fact that the setting achieves comparable performance on Japanese downstream tasks

Measuring effectiveness of additional corpus (1) 26 • Acceptance testing
of new corpus configuration (LLM-jp-4) ◦ 5 configurations are investigated ▪ Candidate 1: Replace Stack (coding corpus) from v1 to v2 ▪ Candidate 2: Reduce FineWeb (Web corpus) to enhance STEM data ▪ Candidate 3: Add MegaMath (Math corpus) ▪ Candidate 4: Add Laboro corpus (parallel corpus) ▪ Candidate 5: Add FinePDFs corpus (table corpus) ◦ We need to determine which configuration is actually effective independently from other settings (= avoid interaction effect) ◦ Directly measuring every on/off setting is intractable ▪ Training 32 (=2^5) models, each require certain cost (millions of JPY) ▪ Need to reduce number of experiments

Measuring effectiveness of additional corpus (2) 27 • Applying fractional
factorial design ◦ Design each experiment with a mixture of on/off ◦ Keep every experiment orthogonal • Perform only 8 model training ◦ Then aggregate results to measure each effect Effect size (Cohen's f) p-value of F statistical test Difference of mean (on-results - off-results) Result: Only the setting C (Add MegaMath) has a clear positive effect

Model training (pretraining) 28

Training parallelism (1) 29 • Training with multiple GPUs ◦
We need to determine which GPU is responsible for which calculation ◦ 3 typical conﬁgurations: GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 1 GPU 1 GPU 1 GPU 1 Data parallel (DP) • Replicate the same parameters into multiple GPUs • Process different data Tensor parallel (TP) • Split each stage (matrix) into small parts • Each GPU processes a single part Pipeline parallel (PP) • Split entire model into multiple sections • Each GPU processes a single section

Training parallelism (1) 30 • Actual conﬁguration of parallelism involves
DP/TP/PP simultaneously • Number of required GPUs = DP × TP × PP • How to determine the optimal combination of parallelisms? a. VRAM size on a single GPU ▪ Large VRAM reduces parallelism ▪ B300 > H200 > H100 b. Communication overhead ▪ Intra-node: GPU-GPU on the same machine • NVLink > PCI Express ▪ Inter-node: GPU-GPU on different machines • InﬁniBand > Ethernet c. Minibatch size (global batch size: GBS) ▪ DP must be smaller than GBS GPU 3 GPU 3 GPU 3 GPU 3 GPU 1 GPU 1 GPU 1 GPU 1 GPU 4 GPU 4 GPU 4 GPU 4 GPU 2 GPU 2 GPU 2 GPU 2 GPU 7 GPU 7 GPU 7 GPU 7 GPU 5 GPU 5 GPU 5 GPU 5 GPU 8 GPU 8 GPU 8 GPU 8 GPU 6 GPU 6 GPU 6 GPU 6 DP=2, TP=2, PP=2 8 GPUs colors represent the same parameter group

Training parallelism (3) 31 • Realism in LLM-jp: LLM-jp-4 32B
(ongoing) ◦ Determining the actual configuration according to not only performance but also actual stability of training Adopted configuration Not fast, but stable Best configuration Faster, but flaky Ablation study performed well, but actual training tends to OOM/hardware errors Some configurations can't be selected due to GBS restriction (256, 512)

Low-precision computation (1) 32 • LLM training with 16-bit ﬂoats
◦ Typical bit pattern: BFloat16 (BF16) ▪ Designed for machine learning ▪ 8-bits exponents • Same dynamic range with single precision (IEEE754 binary32) ▪ 7-bits fraction • Diminished resolution BF16 Single Exponent Fraction Sign

Low-precision computation (2) 33 • Fewer precision are often applied:
FP8 ◦ E4M3 : typically used for forward pass ◦ E5M2 : typically used for backward pass (gradients) • In general, using lower precision numbers: ◦ Reduces calculation cost ◦ Negatively impact against model quality • Ablation study by LLM-jp: ◦ 13B Transformer with FP8 (E4M3/E5M2) performed slightly worse than BF16 ◦ We adopt BF16 to prioritize model quality

Conﬁguration of Optimizer 34 • Transformer is trained using stochastic
gradient descent (SGD) • Actual training usually adopts a variant of SGD: AdamW ◦ SGD + momentum + adaptive gradient decay + weight decay • Many hyperparameters required: ◦ Learning rate (η) … depends on model (1e-3 ~ 1e-6) ◦ momentum strength (β1) … usually set to 0.9 for any models ◦ gradient decay strength (β2) … usually set to 0.95 for LLMs ◦ A factor to avoid zero-division (ε) … usually set to 1e-8 ◦ weight decay strength … usually set to 0.1 • Inappropriate hyperparameter causes inaccurate training and every one is crucial

Determining learning rate 35 Spike Instability of gradients • Effect
of learning rate on LLMs: ◦ Keeping large LR gives quick convergence, but causes instability ◦ Loss spikes: typical abnormal behavior during training ▪ Direct cause: gradient explosion, Large LR + deep network increases possibility ▪ Single loss spikes can be usually recovered Many overlapping of loss spikes tends to cause overall failure of training ◦ Need to set LR as high as possible with keeping small frequency of loss spikes

Learning rate scheduling (1) 36 • LR is controlled during
the entire training process ◦ Traditional conﬁguration: linear warmup + cosine decay ◦ LLM-jp-3 adopted cosine scheduling • Some studies argued the advantage of warmup-stable-decay (WSD) scheduling ◦ Keep the maximum LR during end of the training, with small amount of decay steps

Learning rate scheduling (2) 37 • Ablation study in LLM-jp
◦ Conducted very long-run training (~500k steps) ▪ Investigate actual behavior of real training scenarios ◦ WSD with enough decay steps achieves higher performance ◦ Adopted WSD for LLM-jp-4 Schedulers compared

Learning rate scheduling (3) 38 Japanese tasks English tasks gray:
cosine, red: stable, blue: decay

Problem on Adam epsilon 39 • The epsilon (ε) hyperparameter
of Adam Optimizer ◦ Important role for model convergence ▪ Should set to 1e-8 ▪ It was 1e-5 in the LLaMA2 technical report, but this setting lead to fail training of large models ◦ LLM-jp conducted some ablation study and reported results to Japanese community ▪ Conﬁrmed that some other organization faced the same problem ▪ Huge amount of loss of compute 1e-8: 3x faster 1e-5: very slow convergence

Hyperparameter problem → Entire delay 40 Release: LLM-jp v2 models
v3 172B v3 13B v3 1.8B v3 70B 2024-04 2024-05 2024-06 2024-07 2024-08 2024-09 2024-10 2024-11 2024-12 v3 172B (retry) v3 13B (retry) v3 1.8B (retry) v3 3.7B v3 7.2B Release: LLM-jp-3 models 1.8B, 3.7B, 13B, 172B beta1 BERT MoE VLM 2024-12-13 Finish LLM-jp-3 172B Training The epsilon problem detected & investigated

Release policy 41

LLM-jp and open science 42 • Encouraging open science ◦
LLM-jp is working to share our knowledge to wide range of people ▪ Academic researchers ▪ Industry developers ▪ Users • Materials developed by LLM-jp should also be released publicly, with as few restrictions as possible ◦ Transparency of model development • Several policy is formulated to keep our transparency

Policy of corpus use 43 • Corpus Release Level ◦
Quantiﬁes how the collected corpus is used in LLM-jp (train/test) Release Level Purpose L1: train, search, distribute Use for any purposes (if license allowed) L2: train, search Prohibit to re-distribute L3: train Prohibit to expose outside LLM-jp LX: no-train Prohibit to train, use test-time only LZ: no-use Prohibit to use for any purposes (including illegal data) Actual release levels Only subsets with L1,L2,L3 levels are used to train LLM-jp models

Policy of model/corpus release (1) 44 • FY2024 ◦ We
don't have any written policy of our released materials ◦ License is determined according to each discussion ◦ Release of LLM-jp-3 172B ▪ Under restricted license, but announced as "open model" • Other variants of LLM-jp-3 (~13B) were released under Apache License 2.0 • Only 172B model was released under a special license ▪ Received negative attention from public audience

Policy of model/corpus release (2) 45 • FY2025 ◦ Conducted
internal policy making process ◦ Released our oﬃcial licensing policy ▪ LLM-jp don't apply any restricted licenses (if not necessary) ▪ LLM-jp releases LLM/MMLM instances under Apache License 2.0

End of lecture 46 Homework assignment • Pick-up your favorite
LLM ◦ Focus on only publicly-available models ◦ Both proprietary/open models allowed • Investigate what are the important technical breakthroughs applied to construct the selected LLM ◦ Bonus: do such breakthroughs affect your own research topic? ◦ Bonus: what are NOT achieved yet by such breakthroughs? • Write 1-pager summary of the above survey ◦ LLM-assisted writing allowed But the soundness of the whole content must be guaranteed by yourself

LLM-jp-3 and beyond: Training Large Language Mo...

LLM-jp-3 and beyond: Training Large Language Models

More Decks by Yusuke Oda

Other Decks in Research

Featured

Transcript