Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From o1 to DeepSeek: New Scaling Laws for LLMs ...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

From o1 to DeepSeek: New Scaling Laws for LLMs that Reason

With o1, OpenAI ushered a new era of LLMs: reasoning capabilities. This new breed of models broadened the concept of scaling laws, shifting focus from **train-time** to **test-time** (or inference-time) compute. How do these models work? What do we think their architectures look like, and what data do we use to train them? And finally - and perhaps more importantly: how expensive can they get, and what can we use them for?

Avatar for Luca Baggi

Luca Baggi

October 15, 2025
Tweet

More Decks by Luca Baggi

Other Decks in Programming

Transcript

  1. From o1 to DeepSeek New Scaling Laws for LLMs that

    Reason 🇮🇹 Codemotion (2025/10/15) 👋 Luca Baggi 👾 AI Engineer @ xtream
  2. 🫧 AI Doomers 2024 edition Gary Marcus The wall that

    I once warned about, in 2022, may f inally be approaching. One more way to look at this, is this graph I just saw: enormous convergence on GPT-4 level performance, in multiple models released since, yet nothing decisively ahead. […] Reliable, trustworthy AI is surely achievable, but we may need to go back to the drawing board to get there.
  3. 🎯 Takeaways After the talk, you will be more familiar

    with: 1. How dramatic improvements in inference (hardware and algorithms) enabled a new generation of Large Language Models, shifting the narrative from training-time scaling laws to inference-time scaling laws 2. How a reasoning model like DeepSeek-R1 is trained.
  4. 📍Outline 🔙 Training-time scaling laws, a recap 🚨 Chart crimes

    ♻ Inference-time compute, before it was cool 🍓 So, what is o1? 🐳 The whale in the room: enter DeepSeek R1 🔜 What’s next? From alchemy to reasoning
  5. 🔙 In the previous episode A recap of training-time scaling

    laws • Before circa 2020, we used to think we could just train bigger models to achieve better performance. • This led to hilariously big models that haven’t found the same success as GPT-3. • A series of studies challenged this view, outlining that when training models, the dataset size should scale in tandem with the number of parameters.
  6. 🔙 Training-time scaling laws, a recap Step 1: Oh, don’t

    forget to scale dataset size eventually (OpenAI) Kaplan et al (2020)
  7. 🔙 Training-time scaling laws, a recap Step 1: Oh, don’t

    forget to scale dataset size eventually (OpenAI) Kaplan et al (2020)
  8. 🔙 Training-time scaling laws, a recap Step 2: Scale dataset

    size equally (DeepMind) Ho ff man et al (2022)
  9. 🔙 Training-time scaling laws, a recap Step 2: Scale dataset

    size equally (DeepMind) Ho ff man et al (2022)
  10. 🔙 Training-time scaling laws, a recap Step 2: Scale dataset

    size equally (DeepMind) Ho ff man et al (2022) We predict that for the compute budget used to train Gopher [280B], an optimal model should be 4 times smaller, while being training on 4 times more tokens. We verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens. Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably and greatly facilitates downstream uses on smaller hardware.
  11. 🔙 Training-time scaling laws, a recap Training tokens per parameter

    ratio model Year Tokens/ parameter GPT-3 2020 1.7 Chinchilla 2022 20 Llama 1 2023 142 Llama 2 2023 284 Llama 3 2024 1875
  12. 🔙 Training-time scaling laws, a recap Step 4: Scale the

    data even more, or the Chinchilla trap (Databricks) Sardana et al (2023)
  13. 🔙 Training-time scaling laws, a recap Step 4: Scale the

    data even more, or the Chinchilla trap (Databricks) • Accounting for both training and inference, how does one minimize the cost required to produce and serve a high quality model? • In other words, to deploy a model that needs to serve requests at scale, you want to train a smaller model for longer. • It’s going to be more expensive at the training level, but cheaper at inference.
  14. 🔙 Training-time scaling laws, a recap Step 4: Scale the

    data even more, or the Chinchilla trap (Databricks) Sardana et al (2023)
  15. 🚨 Chart crimes Crime solved - with o3 and ARC-AGI

    OpenAI o3 Breakthrough High Score on ARC-AGI-PUB (20/12/2024)
  16. ♻ Inference-time compute, before it was cool Something we’ve known

    since 2021 • Just tell the model to “Take a deep breath and think step by step” • [Nye et al, 2021] • “[…] reasoning abilities emerge naturally in su ff iciently large language models via a simple method called chain of thought prompting […] “ [Wei et al, 2022] • “[…] we show that LLMs are decent zero-shot reasoners by simply adding 'Let's think step by step' before each answer” [Kojima et al, 2022]
  17. ♻ Inference-time compute, before it was cool Two main families

    of test-time compute 1. Search against a Veri f ier (parallel sampling) 2. Self-Re f inement (sequential revision)
  18. ♻ Inference-time compute, before it was cool Parallel sampling •

    This approach focuses on generating multiple candidate answers and using a veri f ier to select the best one. • Also simply known as search. • It’s the same idea behind AlphaZero, AlphaGo…
  19. ♻ Scaling laws before it was cool Parallel sampling: some

    strategies Scaling test time compute with Open Models (Beeching, Tunstall, Rush 2024)
  20. ♻ Scaling laws before it was cool Parallel sampling: What’s

    a veri f ier? • Deterministic processes: heuristics, solvers for equations, linters and unit tests for code… • Models • Outcome Reward Models (ORM): trained to give a score to the f inal outcome • Process Reward Models (PRM): trained to give a score to the intermediate steps too (basically, how likely every step can lead to the correct solution)
  21. ♻ Scaling laws before it was cool Parallel sampling: What’s

    a veri f ier? • As usual, top research labs were experimenting with this way before we got o1: • Training Veri f iers to Solve Math Word Problems (OpenAI, 2021) • Solving math word problems with process- and outcome-based feedback (DeepMind, 2022) • OpenAI does the same one year later: Let’s Verify Step by Step
  22. ♻ Inference-time compute, before it was cool Self-Re f inement

    • Self-Re f inement (sequential revision): models iteratively re f ine their own outputs or “thoughts” by identifying and correcting errors in subsequent iterations. • Likely requires f ine-tuning so the model can learn how to “self-correct”.
  23. ♻ Scaling laws before it was cool Self-Re f inement

    Recursive introspection: Teaching foundation models how to self-improve
  24. ♻ Scaling laws before it was cool Self-Re f inement

    • Another idea that’s been explored since 2022 at least: • STaR: Bootstrapping Reasoning With Reasoning [Zelikman et al, 2022] • Beyond human data: Scaling self-training for problem-solving with language models [Singh et al, 2024] • Recursive introspection: Teaching foundation models how to self- improve (RISE) [Qu et al, 2024]
  25. 🍓 So, what is o1? It’s de f initely parallel

    sampling • OpenAI has (willingly?) spread rumours about o1 for about a year before it came out - it was November 2023. • The project was referred to as Q* [Reuters]. The training procedure was called strawberry, and was supposedly used to train a new model, codenamed Orion (Interconnects AI, September 2024).
  26. 🍓 So, what is o1? It’s de f initely parallel

    sampling • “As I’ve dug into this in more detail, I’ve become convinced that they are doing something powerful by searching over language steps via tree-of- thoughts reasoning” (Interconnects AI, November 2023)
  27. 🍓 So, what is o1? Until it isn’t? • OpenAI's

    o1 using "search" was a PSYOP (Inteconnects AI, December 2024) • People at OpenAI clari f ied it was “‘just’ an LLM trained with RL” - does this mean it’s not a system: no search, no veri f ier?
  28. 🍓 So, what is o1? Until it isn’t? • They

    might tell you in the release post? • “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data- e ff icient training process.” • “Through reinforcement learning, o1 learns to hone its chain of thought and re f ine the strategies it uses.”
  29. 🍓 So, what is o1? Doesn’t this remind you of

    RISE? How Reasoning Works, OpenAI documentation
  30. 🍓 So, what is o1? Still unclear whether it’s search

    or self improvement • A former OpenAI employee (now at Anthropic) explained that o3 “samples many solutions and uses a learned function [a veri f ier] to pick the best” • Claude 3.7 uses serial test-time compute (i.e., self-improvement), but “Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one […].” • “Parallel test-time compute scaling isn’t available in our newly-deployed model” • Claude 3.5 was actually already doing “self-talk”
  31. 🐳 The whale in the room: enter DeepSeek R1 Playing

    catch-up • Right at the turning point between 2024 and 2025, DeepSeek-R1 was released • The lab had been publishing very high-quality technical reports, praised for their novelties, throughout the year. This paper is no less
  32. 🐳 The whale in the room: enter DeepSeek R1 Playing

    catch-up • Most importantly, it uncovered a recipe for self improvement (nowadays simply called reasoning) with reinforcement learning • Only three months after the the announcement of the “preview” version of o1, and with performances comparable to o3 (that was released about a month prior)
  33. 🐳 The whale in the room: enter DeepSeek R1 Playing

    catch-up • This learning procedure is now known as reinforcement learning with veri f iable rewards (RLVR) • It uses smaller datasets with question and answer pairs, using veri f iers to score the answers.
  34. 🐳 The whale in the room: enter DeepSeek R1 Phase

    zero: bootstrap a reasoning dataset
  35. 🐳 The whale in the room: enter DeepSeek R1 Phase

    two (synthetic data): teach the question -> answer format
  36. 🐳 The whale in the room: enter DeepSeek R1 Phase

    three (alignment): train a useful assistant
  37. 🔜 What’s next? What’s in the near future • We

    are just at the beginning of what we can achieve with RL: we expect we can scale RLVR more and more. • This was f inally possible because we have strong base models (i.e., we f igured out pre-training). • Though we need to work on veri f iers beyond the reasoning, math and code domains (which are easier to verify).
  38. 📚 References Other references not in the previous slides •

    Why we think (Weng, 2025) • Scaling test time compute with Open Models (Beeching, Tunstall, Rush 2024) • Welcome to LLM f lation – LLM inference cost is going down fast ⬇ • Machine Learning Trends (very cool project by EpochAI)