Bigger models or more data? The new scaling laws for LLMs

Bigger models or more data? The new scaling laws for
LLMs 👤 Luca Baggi 💼 Machine Learning Engineer @ xtream 🔮 Maintainer @ functime 🇮🇹 AIConf (2024/06/17)

👾 Let’s play a game A tale of optimism You
are the CEO of a BigTech™, and you want to jump on the GenAI bandwagon in the splashiest way possible. You gather your board and announce: you want to pre-train a large language model (LLM). The board agrees - resources are not a problem. After all, your business has been thriving since 2021, and your revenues are soaring. You don’t really know LLMs in a technical sense, but you know this will actually have a positive impact on your business - it won’t just be a market stunt.

👾 Let’s play a game A tale of optimism However,
the board gives you a spending constraint 𝘊 , which you can’t exceed. You tell yourself that’s not bad: after all, you just need a bunch of data and make the model as big as you can, under that constraint. You send an email to your data sciente team and ask them some estimates. The reply comes in just two hours later. You are excited. Then you notice they have booked a two-hour slot on your calendar. They ask you to bring paper and pencil - a sharp one. In the attachments, you see long list of academic papers.

📍Outline 💸 How can I spend my budget compute? 🏋
History of models, by their weights 🧑🍳 How do I train a large language model? 🐭 What is even a Chinchilla? 🦙 Not just one LLaMA, but three 📈 The latest trends Or, how science is made

🎯 Takeaways After the talk, you will be more familiar
with: 1. How we used to train models and how big we made them 2. How a large language model (LLM) is trained 3. The Chinchilla paper, which introduced the scaling laws 4. How we are updating the rules 5. How much data current (open) models need, and how to process it 6. How models like LLaMA just don’t care

💸 How can I spend my budget compute? Just two
ingredients and one goal argmin N,D L(N, D) s.t. FLOPs(N, D) = C

🏋 History of models, by their weights Basically a f
lexing contest

This is not how it ended up*

🧑🍳 How do I train a large language model? The
three key steps

🧑🍳 How do I train a large language model? Pre-training:
lots of data of relatively poor quality

🧑🍳 How do I train a large language model? Supervised
f ine-tuning: learn to be an assistant

🧑🍳 How do I train a large language model? Alignment:
how to actually return useful answers

Data matters

🐭 What is even a Chinchilla? The paper that introduced
scaling laws

🐭 What is even a Chinchilla? What did they do?

🐭 What is even a Chinchilla? Just a linear regression,
after all

🐭 What is even a Chinchilla? What if…?

🦙 Not just one LLaMA, but three Or, how to
blow Scaling laws out of the water The LLaMA models go well beyond the optimum found by the scaling laws. All LLaMA 2 models, including the 7B, use 2T tokens (2000B). In other words: LLaMA2 7B is using the same data budget of Gopher (~10x larger). LLaMA3 7B uses 15T tokens: 7 times as much as its predecessors.

LLaMA models are purposefully overtrained

🦙 Not just one LLaMA, but three And both LLaMA
2 and 3 could have been trained more

Do we have enough data?

📈 The latest trends Find ways to augment the data

📈 The latest trends Are we over with big-big models?
No. There’s a LLaMA 400B still being trained. We don’t know much about Gemini 2/Claude 3 or GPT-4(o), but we are pretty sure they are much, much bigger. NVIDIA just released Nemotron-4 340B (also the reward model!).

📈 The latest trends How can we train bigger and
bigger models? We should just scale data appropriately: perhaps even more than we might think. This means: more data, with some degrees of repetition (i.e., longer training runs). What about synthetic data? Can be thrown in the mix, but might be insu ff icient on its own. For example: Microsoft Phi series has strong benchmarks, but does not pass the vibe check.

We say things like "machine learning is the new electricity”.
I'd like to offer another analogy

Machine learning has become alchemy Ali Rahimi, NIPS 2017 Test-of-Time
Award Presentation Speech

👤 Luca Baggi 💼 Machine Learning Engineer @ xtream 🔮
Maintainer @ functime

📚 References Used to prepare this talk, and more •
Large Language Models: A New Moore's Law? • chinchilla's wild implications • Chinchilla Explained: How to read DeepMind's paper on Compute-Optimal Scaling Laws • Training Compute-Optimal Large Language Models • Scaling Laws for Autoregressive Generative Modelling • Scaling Data-Constrained Language Models

Bigger models or more data? The new scaling law...

Bigger models or more data? The new scaling laws for LLMs

Luca Baggi

More Decks by Luca Baggi

Featured

Transcript