Slide 13
Slide 13 text
Challenges on developing LLMs
• Data
• Huge amount of text data is required to train
• Trillions of tokens should be prepared
• E.g, LLaMA 2: 2T tokens, LLaMA 3: 15T tokens
• Collecting data is challenging especially non-English languages
• Only ~1T open data is available in Japanese
• Compute
• Huge computing cluster is required to handle training jobs
• GPT-3 scale models (175B) require hundreds ~ thousands of H100 GPUs to train
• Even small models (1B) require tens of H100 GPUs to train within a handy time
• Engineering
• Human experts are also required to handle large scale data collection,
developing/managing training pipelines, and computing resources,