Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bigger models or more data? The new scaling law...

Avatar for Luca Baggi Luca Baggi
June 17, 2024
5

Bigger models or more data? The new scaling laws forย LLMs

The famous Chinchilla paper changed the way we train LLMs. The authors - including the current Mistral CEO - outlined the scaling laws to maximise your model performance under a compute budget, balancing the number of parameters and training tokens.

These heuristics are now in jeopardy. LLaMA-3, for one, is trained on an unreasonable amount of tokens of text - but this is why it's so good. How much data do we actually need to train LLMs? How do we use synthetic data? Will we ever run out of data?

Avatar for Luca Baggi

Luca Baggi

June 17, 2024
Tweet

Transcript

  1. Bigger models or more data? The new scaling laws for

    LLMs ๐Ÿ‘ค Luca Baggi ๐Ÿ’ผ Machine Learning Engineer @ xtream ๐Ÿ”ฎ Maintainer @ functime ๐Ÿ‡ฎ๐Ÿ‡น AIConf (2024/06/17)
  2. ๐Ÿ‘พ Letโ€™s play a game A tale of optimism You

    are the CEO of a BigTechโ„ข, and you want to jump on the GenAI bandwagon in the splashiest way possible. You gather your board and announce: you want to pre-train a large language model (LLM). The board agrees - resources are not a problem. After all, your business has been thriving since 2021, and your revenues are soaring. You donโ€™t really know LLMs in a technical sense, but you know this will actually have a positive impact on your business - it wonโ€™t just be a market stunt.
  3. ๐Ÿ‘พ Letโ€™s play a game A tale of optimism However,

    the board gives you a spending constraint ๐˜Š , which you canโ€™t exceed. You tell yourself thatโ€™s not bad: after all, you just need a bunch of data and make the model as big as you can, under that constraint. You send an email to your data sciente team and ask them some estimates. The reply comes in just two hours later. You are excited. Then you notice they have booked a two-hour slot on your calendar. They ask you to bring paper and pencil - a sharp one. In the attachments, you see long list of academic papers.
  4. ๐Ÿ“Outline ๐Ÿ’ธ How can I spend my budget compute? ๐Ÿ‹

    History of models, by their weights ๐Ÿง‘๐Ÿณ How do I train a large language model? ๐Ÿญ What is even a Chinchilla? ๐Ÿฆ™ Not just one LLaMA, but three ๐Ÿ“ˆ The latest trends Or, how science is made
  5. ๐ŸŽฏ Takeaways After the talk, you will be more familiar

    with: 1. How we used to train models and how big we made them 2. How a large language model (LLM) is trained 3. The Chinchilla paper, which introduced the scaling laws 4. How we are updating the rules 5. How much data current (open) models need, and how to process it 6. How models like LLaMA just donโ€™t care
  6. ๐Ÿ’ธ How can I spend my budget compute? Just two

    ingredients and one goal argmin N,D L(N, D) s.t. FLOPs(N, D) = C
  7. ๐Ÿฆ™ Not just one LLaMA, but three Or, how to

    blow Scaling laws out of the water The LLaMA models go well beyond the optimum found by the scaling laws. All LLaMA 2 models, including the 7B, use 2T tokens (2000B). In other words: LLaMA2 7B is using the same data budget of Gopher (~10x larger). LLaMA3 7B uses 15T tokens: 7 times as much as its predecessors.
  8. ๐Ÿฆ™ Not just one LLaMA, but three And both LLaMA

    2 and 3 could have been trained more
  9. ๐Ÿ“ˆ The latest trends Are we over with big-big models?

    No. Thereโ€™s a LLaMA 400B still being trained. We donโ€™t know much about Gemini 2/Claude 3 or GPT-4(o), but we are pretty sure they are much, much bigger. NVIDIA just released Nemotron-4 340B (also the reward model!).
  10. ๐Ÿ“ˆ The latest trends How can we train bigger and

    bigger models? We should just scale data appropriately: perhaps even more than we might think. This means: more data, with some degrees of repetition (i.e., longer training runs). What about synthetic data? Can be thrown in the mix, but might be insu ff icient on its own. For example: Microsoft Phi series has strong benchmarks, but does not pass the vibe check.
  11. ๐Ÿ“š References Used to prepare this talk, and more โ€ข

    Large Language Models: A New Moore's Law? โ€ข chinchilla's wild implications โ€ข Chinchilla Explained: How to read DeepMind's paper on Compute-Optimal Scaling Laws โ€ข Training Compute-Optimal Large Language Models โ€ข Scaling Laws for Autoregressive Generative Modelling โ€ข Scaling Data-Constrained Language Models