From o1 to DeepSeek: New Scaling Laws for LLMs that Reason

From o1 to DeepSeek New Scaling Laws for LLMs that
Reason 🇮🇹 Codemotion (2025/10/15) 👋 Luca Baggi 👾 AI Engineer @ xtream

🫧 AI Doomers 2024 edition Gary Marcus

🫧 AI Doomers 2024 edition Maxime Labonne

🫧 AI Doomers 2024 edition Gary Marcus The wall that
I once warned about, in 2022, may f inally be approaching. One more way to look at this, is this graph I just saw: enormous convergence on GPT-4 level performance, in multiple models released since, yet nothing decisively ahead. […] Reliable, trustworthy AI is surely achievable, but we may need to go back to the drawing board to get there.

🎯 Takeaways After the talk, you will be more familiar
with: 1. How dramatic improvements in inference (hardware and algorithms) enabled a new generation of Large Language Models, shifting the narrative from training-time scaling laws to inference-time scaling laws 2. How a reasoning model like DeepSeek-R1 is trained.

📍Outline 🔙 Training-time scaling laws, a recap 🚨 Chart crimes
♻ Inference-time compute, before it was cool 🍓 So, what is o1? 🐳 The whale in the room: enter DeepSeek R1 🔜 What’s next? From alchemy to reasoning

🔙 In the previous episode A recap of training-time scaling
laws • Before circa 2020, we used to think we could just train bigger models to achieve better performance. • This led to hilariously big models that haven’t found the same success as GPT-3. • A series of studies challenged this view, outlining that when training models, the dataset size should scale in tandem with the number of parameters.

🔙 Training-time scaling laws, a recap Step 1: Oh, don’t
forget to scale dataset size eventually (OpenAI) Kaplan et al (2020)

🔙 Training-time scaling laws, a recap Step 2: Scale dataset
size equally (DeepMind) Ho ff man et al (2022)

🔙 Training-time scaling laws, a recap Step 2: Scale dataset
size equally (DeepMind) Ho ff man et al (2022) We predict that for the compute budget used to train Gopher [280B], an optimal model should be 4 times smaller, while being training on 4 times more tokens. We verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens. Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably and greatly facilitates downstream uses on smaller hardware.

🔙 Training-time scaling laws, a recap Twitter Step 3: Just
scale the data (Meta)

🔙 Training-time scaling laws, a recap Training tokens per parameter
ratio model Year Tokens/ parameter GPT-3 2020 1.7 Chinchilla 2022 20 Llama 1 2023 142 Llama 2 2023 284 Llama 3 2024 1875

🔙 Training-time scaling laws, a recap Step 4: Scale the
data even more, or the Chinchilla trap (Databricks) Sardana et al (2023)

data even more, or the Chinchilla trap (Databricks) • Accounting for both training and inference, how does one minimize the cost required to produce and serve a high quality model? • In other words, to deploy a model that needs to serve requests at scale, you want to train a smaller model for longer. • It’s going to be more expensive at the training level, but cheaper at inference.

data even more, or the Chinchilla trap (Databricks) Sardana et al (2023)

We are hitting diminishing returns

Inference costs, however, dropped by 1000x in 3 years.

How can we exploit this?

🚨 Chart crimes Enter o1 Learning to reason with LLMs
(OpenAI, 12/09/2024)

🚨 Chart crimes Crime solved - with o3 and ARC-AGI
OpenAI o3 Breakthrough High Score on ARC-AGI-PUB (20/12/2024)

♻ Inference-time compute, before it was cool Something we’ve known
since 2021 • Just tell the model to “Take a deep breath and think step by step” • [Nye et al, 2021] • “[…] reasoning abilities emerge naturally in su ff iciently large language models via a simple method called chain of thought prompting […] “ [Wei et al, 2022] • “[…] we show that LLMs are decent zero-shot reasoners by simply adding 'Let's think step by step' before each answer” [Kojima et al, 2022]

♻ Inference-time compute, before it was cool Two main families
of test-time compute 1. Search against a Veri f ier (parallel sampling) 2. Self-Re f inement (sequential revision)

♻ Inference-time compute, before it was cool Parallel sampling •
This approach focuses on generating multiple candidate answers and using a veri f ier to select the best one. • Also simply known as search. • It’s the same idea behind AlphaZero, AlphaGo…

♻ Scaling laws before it was cool Parallel sampling: some
strategies Scaling test time compute with Open Models (Beeching, Tunstall, Rush 2024)

♻ Scaling laws before it was cool Parallel sampling: What’s
a veri f ier? • Deterministic processes: heuristics, solvers for equations, linters and unit tests for code… • Models • Outcome Reward Models (ORM): trained to give a score to the f inal outcome • Process Reward Models (PRM): trained to give a score to the intermediate steps too (basically, how likely every step can lead to the correct solution)

a veri f ier? • As usual, top research labs were experimenting with this way before we got o1: • Training Veri f iers to Solve Math Word Problems (OpenAI, 2021) • Solving math word problems with process- and outcome-based feedback (DeepMind, 2022) • OpenAI does the same one year later: Let’s Verify Step by Step

a veri f ier? Snell et al (2024)

♻ Inference-time compute, before it was cool Self-Re f inement
• Self-Re f inement (sequential revision): models iteratively re f ine their own outputs or “thoughts” by identifying and correcting errors in subsequent iterations. • Likely requires f ine-tuning so the model can learn how to “self-correct”.

♻ Scaling laws before it was cool Self-Re f inement
Recursive introspection: Teaching foundation models how to self-improve

♻ Scaling laws before it was cool Self-Re f inement
• Another idea that’s been explored since 2022 at least: • STaR: Bootstrapping Reasoning With Reasoning [Zelikman et al, 2022] • Beyond human data: Scaling self-training for problem-solving with language models [Singh et al, 2024] • Recursive introspection: Teaching foundation models how to self- improve (RISE) [Qu et al, 2024]

🍓 So, what is o1? It’s de f initely parallel
sampling • OpenAI has (willingly?) spread rumours about o1 for about a year before it came out - it was November 2023. • The project was referred to as Q* [Reuters]. The training procedure was called strawberry, and was supposedly used to train a new model, codenamed Orion (Interconnects AI, September 2024).

🍓 So, what is o1? It’s de f initely parallel
sampling • “As I’ve dug into this in more detail, I’ve become convinced that they are doing something powerful by searching over language steps via tree-of- thoughts reasoning” (Interconnects AI, November 2023)

🍓 So, what is o1? Until it isn’t? • OpenAI's
o1 using "search" was a PSYOP (Inteconnects AI, December 2024) • People at OpenAI clari f ied it was “‘just’ an LLM trained with RL” - does this mean it’s not a system: no search, no veri f ier?

🍓 So, what is o1? Until it isn’t? • They
might tell you in the release post? • “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data- e ff icient training process.” • “Through reinforcement learning, o1 learns to hone its chain of thought and re f ine the strategies it uses.”

🍓 So, what is o1? Doesn’t this remind you of
RISE? How Reasoning Works, OpenAI documentation

🍓 So, what is o1? Still unclear whether it’s search
or self improvement • A former OpenAI employee (now at Anthropic) explained that o3 “samples many solutions and uses a learned function [a veri f ier] to pick the best” • Claude 3.7 uses serial test-time compute (i.e., self-improvement), but “Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one […].” • “Parallel test-time compute scaling isn’t available in our newly-deployed model” • Claude 3.5 was actually already doing “self-talk”

🐳 The whale in the room: enter DeepSeek R1 Playing
catch-up • Right at the turning point between 2024 and 2025, DeepSeek-R1 was released • The lab had been publishing very high-quality technical reports, praised for their novelties, throughout the year. This paper is no less

catch-up • Most importantly, it uncovered a recipe for self improvement (nowadays simply called reasoning) with reinforcement learning • Only three months after the the announcement of the “preview” version of o1, and with performances comparable to o3 (that was released about a month prior)

catch-up • This learning procedure is now known as reinforcement learning with veri f iable rewards (RLVR) • It uses smaller datasets with question and answer pairs, using veri f iers to score the answers.

🐳 The whale in the room: enter DeepSeek R1 Phase
zero: bootstrap a reasoning dataset

one: learn how to “reason"

two (synthetic data): teach the question -> answer format

three (alignment): train a useful assistant

🐳 The whale in the room: enter DeepSeek R1 High
level overview

🔜 What’s next? Capabilities have been improving

🔜 What’s next? Not a plateau, but a saturated benchmark

🔜 What’s next? The f lip

🔜 What’s next? What’s in the near future • We
are just at the beginning of what we can achieve with RL: we expect we can scale RLVR more and more. • This was f inally possible because we have strong base models (i.e., we f igured out pre-training). • Though we need to work on veri f iers beyond the reasoning, math and code domains (which are easier to verify).

👋 Luca Baggi 👾 AI Engineer @ xtream

📚 References Other references not in the previous slides •
Why we think (Weng, 2025) • Scaling test time compute with Open Models (Beeching, Tunstall, Rush 2024) • Welcome to LLM f lation – LLM inference cost is going down fast ⬇ • Machine Learning Trends (very cool project by EpochAI)

From o1 to DeepSeek: New Scaling Laws for LLMs ...

From o1 to DeepSeek: New Scaling Laws for LLMs that Reason

More Decks by Luca Baggi

Other Decks in Programming

Featured

Transcript