Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLM-jp-3 and beyond: Training Large Language Mo...

Avatar for Yusuke Oda Yusuke Oda
October 29, 2025

LLM-jp-3 and beyond: Training Large Language Models

Slides on NII course on 2025-10-29.

Avatar for Yusuke Oda

Yusuke Oda

October 29, 2025
Tweet

More Decks by Yusuke Oda

Other Decks in Research

Transcript

  1. Contents • Basics of LLMs • Introducing LLM-jp • Corpus

    construction (digest) • Model training (pretraining) • Release policy 2
  2. Definition of language models • Language model = probability distribution

    of texts • Desirable text (e.g., human readable text) gets high value • Undesirable text (e.g., random) gets low value • Texts = array of tokens • What is the relationship between LM and generative AI? Last token </s> 1st token <s> 2nd token Hello 3rd token world a text 4
  3. Next token prediction – Basis of Generative AI (1) •

    Decomposing LM distribution using the "chain rule" of conditional distribution: Predict 1st word Predict 2nd word given the 1st word Predict 3rd word given the 1st/2nd words Next token prediction model (autoregressive model) Token to append History of past tokens 5
  4. Next token prediction – Basis of Generative AI (2) •

    derives a basic generation algorithm: history = [“<s>”] while history[-1] != “</s>”: history.append(sample w from P) return history <s> <s> <s> Hello Hello world <s> Hello world </s> Sample Sample Sample • Very simple, but recent line of research revealed that if the LLM is intelligent enough, next token prediction can solve tasks described in natural language. • Anyway, how do we construct ? 6
  5. Modeling • Historical methods – count based ◦ E.g., n-gram

    models: • Neural network methods (2001~) ◦ Feed-forward NN (2001) ◦ Recurrent NN (2010) ◦ Transformer (2017) ▪ Majority of current LLM architectures ▪ Handles history of any lengths (in theory) Figure taken from: https://arxiv.org/abs/1706.03762 Only the right part (decoder) of Transformer is used to construct LMs 7
  6. Evolution of GPT models (~2023) • 2018 GPT (117M params)

    • 2019 GPT-2 (1.5B params) • 2020 GPT-3 (175B params) • 2022 GPT-3.5 / InstructGPT • 2022 ChatGPT (application) ◦ Huge social impact • 2023 GPT-4 (2T? params) ◦ High performance on national exams: ▪ US legal bar exam ▪ USMLE (medical) ▪ SAT Exponential increase of #params ++ #layers ++ #hid. units ++ #att. heads Large models + next token prediction = handling complex tasks 8
  7. Impact of ChatGPT Figure taken from: https://arxiv.org/abs/2307.06435 Figure taken from:

    https://link.springer.com/article/10.1007/s11044-023-09962-0 Boosting research Boosting development of owned LLMs log-scale 9
  8. Introducing LLM-jp LLM-jp (LLM勉強会) • Construct open and Japanese-oriented LLMs

    • Revealing the operating principle of LLMs • Release models, data, tools, and related techniques and materials, not only successful work but also progress and/or failures. • Anyone who agree the above policy can participate in us. 2023-05: First meetup 2023-10: First 13B model 2024-09: LLM-jp-3 ~13B (7 models) 2024-12: LLM-jp-3 172B 2025-05: LLM-jp-3.1 (3 models) 2025-03: LLM-jp-3 MoE 8x1.8B, 8x13B Now: Training v4 models 2000~ members Started with 30 NLP researchers 2024-04: NII R&D Center for LLMs (LLMC, dedicated lab) 11
  9. Why developing LLMs in Japan (1) • Enhancing ability of

    local knowledge ◦ Flagship LLMs are not trained well for local knowledge/commonsense ▪ They works anyway, with limited inner knowledge and RAG ◦ Necessity to train LLMs (pretraining and/or tuning) with local knowledge Japanese languages Japanese culture Geolocational information in/around Japan 12
  10. Why developing LLMs in Japan (2) • Reducing geopolitical risks

    around LLMs = Technical self-sufficiency ◦ Secure technical capabilities, engineers, and resources to independently develop LLMs on its own country Model Dependency Self-sufficiency 13
  11. Challenges on developing LLMs (1) • Data ◦ Huge amount

    of text data is required to train LLMs ▪ Trillions of tokens should be prepared • LLaMA 2: 2T tokens • LLaMA 3: 15T tokens • Qwen3: 36T tokens ◦ Collecting data is challenging especially in non-English languages (even Japanese) ▪ Most open web data (Common Crawl) is written in English ▪ Only ~1T open data is available in Japanese 14
  12. Challenges on developing LLMs (2) • Compute ◦ Huge computing

    cluster is required to handle training jobs ▪ GPT-3 scale models (175B) require hundreds ~ thousands of Flagship GPUs to train ▪ Even small models (1B) require tens of H100 GPUs to train within a handy time ◦ Operating computing cluster requires large cost ▪ Training a 32B model (LLM-jp-4 flagship) with 10T tokens require 12k GPU (H200) days, $~10M • Engineering cost/workforce ◦ Human experts are also required to handle large scale data collection, developing/managing training pipelines, and computing resources 15
  13. Governmental Support for LLMs in Japan (FY2024) METI (経済産業省 )

    • Providing financial support for setting up 3rd-party compute resources • ABCI supercomputer • Providing compute support for LLM development • GENIAC (NEDO) • Providing financial/compute support for LLM development Cabinet Office(内閣府) • Providing support LLM development for medical domain (SIP) MIC(総務省) • NICT • Develop owned LLM model with their own corpus/resources MEXT(文部科学省) • University of Tokyo • Preparing computing resources for LLM/other foundation model • RIKEN • Experiment to employ Fugaku supercomputer for LLM training • National Institute of Informatics • Organize LLM-jp • R&D center for LLM 16
  14. Working groups in LLM-jp Tuning/evaluation WG Large-scale corpus Evaluation data

    Tuning data Prof. Ogata (Waseda U.) Prof. Suzuki (Tohoku U.) Prof. Miyao (U. Tokyo) Prof. Taura (U. Tokyo) Prof. Yokota (Science Tokyo) Safety WG Prof. Sekine (NII) Multimodal WG Prof. Okazaki (Science Tokyo) Real-world Interaction WG Prof. Kawahara (Waseda U.) Principle WG Prof. Ozeki (U. Tokyo) Model WG Computing cluster Corpus WG 17
  15. Working groups in LLM-jp Corpus WG • Collect trillion-scale text

    corpus to train LLMs • Develop tokenizers • Develop corpus filtering pipeline • Data augmentation/synthesis • Collaboration with other national labs Model WG • Pretraining/mid-training of LLMs • Explore model architecture/hyperparameters • Manage computing cluster Tuning/evaluation WG Large-scale corpus Evaluation data Tuning data Model WG Computing cluster Corpus WG Focus 18
  16. LLM-jp-3 model series: model sizes Model 150M 440M 980M 1.8B

    3.7B 7.2B 13B 172B Vocab size 99487 #Layers 12 16 20 24 28 32 40 96 FFN size 2048 3584 5376 7168 8192 1100 8 1382 4 3846 4 Hid. size 512 1024 1536 2048 3072 4096 5120 1228 8 #Att. heads 8 16 24 32 40 96 #Query grps 8 16 24 32 40 16 19
  17. Training Curve of LLM-jp Models LLM-jp-3 MoE 8x13B Trained 2.1T

    tokens from LLM-jp-3 13B Our best model so far: comparable with GPT-3.5 without tuning LLM-jp-3 150M~172B (8 models) Trained 2.1T tokens LLM-jp-4 Experimental models (ongoing) Planning to train 15.6T tokens GPT-3.5 GPT-4 Trained tokens [in billion, log] Average of subtask scores 20
  18. Collecting trillion-scale corpus 22 • Corpus construction = the most

    crucial part of LLM development ◦ Large-scale corpus improves model quality ◦ High-quality corpus also improves model quality • How to collect huge corpus (in LLM-jp) ◦ Leveraging public corpus ▪ Common Crawl (CC): huge collection of Web pages ◦ Our own crawling ◦ Collaborative work with other laboratories ▪ National Diet Library (NDL: 国立国会図書館) ▪ National Institute of Japanese Literature (NIJL: 国文学研究資料館)
  19. Official announcement of NII/NDL collaboration (2024, 2025) 23 2024: Providing

    Web URL list of government officials 2025: Providing OCR texts of governmental documents
  20. Corpus ratio (1) 24 • Determining number of repeats of

    subset tokens during training ◦ Many repeats: Increase subset ratio, but may cause overfitting ◦ Few repeats: may cause underfitting ◦ Corpus ratio also affects against balance of task performance • Ablation study conducted with various patterns of repeats Subset Candidate patterns (color represents number of repeats)
  21. Corpus ratio (2) 25 LLM-jp-3 Total: 2.1T tokens LLM-jp-4 Total:

    22T tokens • Adopted corpus ratios (LLM-jp-3/LLM-jp-4) ◦ LLM-jp-3: Balancing between Japanese and English ◦ LLM-jp-4 (ongoing): Introducing much more English tokens ▪ Based on the fact that the setting achieves comparable performance on Japanese downstream tasks
  22. Measuring effectiveness of additional corpus (1) 26 • Acceptance testing

    of new corpus configuration (LLM-jp-4) ◦ 5 configurations are investigated ▪ Candidate 1: Replace Stack (coding corpus) from v1 to v2 ▪ Candidate 2: Reduce FineWeb (Web corpus) to enhance STEM data ▪ Candidate 3: Add MegaMath (Math corpus) ▪ Candidate 4: Add Laboro corpus (parallel corpus) ▪ Candidate 5: Add FinePDFs corpus (table corpus) ◦ We need to determine which configuration is actually effective independently from other settings (= avoid interaction effect) ◦ Directly measuring every on/off setting is intractable ▪ Training 32 (=2^5) models, each require certain cost (millions of JPY) ▪ Need to reduce number of experiments
  23. Measuring effectiveness of additional corpus (2) 27 • Applying fractional

    factorial design ◦ Design each experiment with a mixture of on/off ◦ Keep every experiment orthogonal • Perform only 8 model training ◦ Then aggregate results to measure each effect Effect size (Cohen's f) p-value of F statistical test Difference of mean (on-results - off-results) Result: Only the setting C (Add MegaMath) has a clear positive effect
  24. Training parallelism (1) 29 • Training with multiple GPUs ◦

    We need to determine which GPU is responsible for which calculation ◦ 3 typical configurations: GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 1 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 2 GPU 1 GPU 1 GPU 1 GPU 1 Data parallel (DP) • Replicate the same parameters into multiple GPUs • Process different data Tensor parallel (TP) • Split each stage (matrix) into small parts • Each GPU processes a single part Pipeline parallel (PP) • Split entire model into multiple sections • Each GPU processes a single section
  25. Training parallelism (1) 30 • Actual configuration of parallelism involves

    DP/TP/PP simultaneously • Number of required GPUs = DP × TP × PP • How to determine the optimal combination of parallelisms? a. VRAM size on a single GPU ▪ Large VRAM reduces parallelism ▪ B300 > H200 > H100 b. Communication overhead ▪ Intra-node: GPU-GPU on the same machine • NVLink > PCI Express ▪ Inter-node: GPU-GPU on different machines • InfiniBand > Ethernet c. Minibatch size (global batch size: GBS) ▪ DP must be smaller than GBS GPU 3 GPU 3 GPU 3 GPU 3 GPU 1 GPU 1 GPU 1 GPU 1 GPU 4 GPU 4 GPU 4 GPU 4 GPU 2 GPU 2 GPU 2 GPU 2 GPU 7 GPU 7 GPU 7 GPU 7 GPU 5 GPU 5 GPU 5 GPU 5 GPU 8 GPU 8 GPU 8 GPU 8 GPU 6 GPU 6 GPU 6 GPU 6 DP=2, TP=2, PP=2 8 GPUs colors represent the same parameter group
  26. Training parallelism (3) 31 • Realism in LLM-jp: LLM-jp-4 32B

    (ongoing) ◦ Determining the actual configuration according to not only performance but also actual stability of training Adopted configuration Not fast, but stable Best configuration Faster, but flaky Ablation study performed well, but actual training tends to OOM/hardware errors Some configurations can't be selected due to GBS restriction (256, 512)
  27. Low-precision computation (1) 32 • LLM training with 16-bit floats

    ◦ Typical bit pattern: BFloat16 (BF16) ▪ Designed for machine learning ▪ 8-bits exponents • Same dynamic range with single precision (IEEE754 binary32) ▪ 7-bits fraction • Diminished resolution BF16 Single Exponent Fraction Sign
  28. Low-precision computation (2) 33 • Fewer precision are often applied:

    FP8 ◦ E4M3 : typically used for forward pass ◦ E5M2 : typically used for backward pass (gradients) • In general, using lower precision numbers: ◦ Reduces calculation cost ◦ Negatively impact against model quality • Ablation study by LLM-jp: ◦ 13B Transformer with FP8 (E4M3/E5M2) performed slightly worse than BF16 ◦ We adopt BF16 to prioritize model quality
  29. Configuration of Optimizer 34 • Transformer is trained using stochastic

    gradient descent (SGD) • Actual training usually adopts a variant of SGD: AdamW ◦ SGD + momentum + adaptive gradient decay + weight decay • Many hyperparameters required: ◦ Learning rate (η) … depends on model (1e-3 ~ 1e-6) ◦ momentum strength (β1) … usually set to 0.9 for any models ◦ gradient decay strength (β2) … usually set to 0.95 for LLMs ◦ A factor to avoid zero-division (ε) … usually set to 1e-8 ◦ weight decay strength … usually set to 0.1 • Inappropriate hyperparameter causes inaccurate training and every one is crucial
  30. Determining learning rate 35 Spike Instability of gradients • Effect

    of learning rate on LLMs: ◦ Keeping large LR gives quick convergence, but causes instability ◦ Loss spikes: typical abnormal behavior during training ▪ Direct cause: gradient explosion, Large LR + deep network increases possibility ▪ Single loss spikes can be usually recovered Many overlapping of loss spikes tends to cause overall failure of training ◦ Need to set LR as high as possible with keeping small frequency of loss spikes
  31. Learning rate scheduling (1) 36 • LR is controlled during

    the entire training process ◦ Traditional configuration: linear warmup + cosine decay ◦ LLM-jp-3 adopted cosine scheduling • Some studies argued the advantage of warmup-stable-decay (WSD) scheduling ◦ Keep the maximum LR during end of the training, with small amount of decay steps
  32. Learning rate scheduling (2) 37 • Ablation study in LLM-jp

    ◦ Conducted very long-run training (~500k steps) ▪ Investigate actual behavior of real training scenarios ◦ WSD with enough decay steps achieves higher performance ◦ Adopted WSD for LLM-jp-4 Schedulers compared
  33. Problem on Adam epsilon 39 • The epsilon (ε) hyperparameter

    of Adam Optimizer ◦ Important role for model convergence ▪ Should set to 1e-8 ▪ It was 1e-5 in the LLaMA2 technical report, but this setting lead to fail training of large models ◦ LLM-jp conducted some ablation study and reported results to Japanese community ▪ Confirmed that some other organization faced the same problem ▪ Huge amount of loss of compute 1e-8: 3x faster 1e-5: very slow convergence
  34. Hyperparameter problem → Entire delay 40 Release: LLM-jp v2 models

    v3 172B v3 13B v3 1.8B v3 70B 2024-04 2024-05 2024-06 2024-07 2024-08 2024-09 2024-10 2024-11 2024-12 v3 172B (retry) v3 13B (retry) v3 1.8B (retry) v3 3.7B v3 7.2B Release: LLM-jp-3 models 1.8B, 3.7B, 13B, 172B beta1 BERT MoE VLM 2024-12-13 Finish LLM-jp-3 172B Training The epsilon problem detected & investigated
  35. LLM-jp and open science 42 • Encouraging open science ◦

    LLM-jp is working to share our knowledge to wide range of people ▪ Academic researchers ▪ Industry developers ▪ Users • Materials developed by LLM-jp should also be released publicly, with as few restrictions as possible ◦ Transparency of model development • Several policy is formulated to keep our transparency
  36. Policy of corpus use 43 • Corpus Release Level ◦

    Quantifies how the collected corpus is used in LLM-jp (train/test) Release Level Purpose L1: train, search, distribute Use for any purposes (if license allowed) L2: train, search Prohibit to re-distribute L3: train Prohibit to expose outside LLM-jp LX: no-train Prohibit to train, use test-time only LZ: no-use Prohibit to use for any purposes (including illegal data) Actual release levels Only subsets with L1,L2,L3 levels are used to train LLM-jp models
  37. Policy of model/corpus release (1) 44 • FY2024 ◦ We

    don't have any written policy of our released materials ◦ License is determined according to each discussion ◦ Release of LLM-jp-3 172B ▪ Under restricted license, but announced as "open model" • Other variants of LLM-jp-3 (~13B) were released under Apache License 2.0 • Only 172B model was released under a special license ▪ Received negative attention from public audience
  38. Policy of model/corpus release (2) 45 • FY2025 ◦ Conducted

    internal policy making process ◦ Released our official licensing policy ▪ LLM-jp don't apply any restricted licenses (if not necessary) ▪ LLM-jp releases LLM/MMLM instances under Apache License 2.0
  39. End of lecture 46 Homework assignment • Pick-up your favorite

    LLM ◦ Focus on only publicly-available models ◦ Both proprietary/open models allowed • Investigate what are the important technical breakthroughs applied to construct the selected LLM ◦ Bonus: do such breakthroughs affect your own research topic? ◦ Bonus: what are NOT achieved yet by such breakthroughs? • Write 1-pager summary of the above survey ◦ LLM-assisted writing allowed But the soundness of the whole content must be guaranteed by yourself