Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From o1 to DeepSeek: New Scaling Laws for LLMs ...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

From o1 to DeepSeek: New Scaling Laws for LLMs that Reason

With o1, OpenAI ushered a new era of LLMs: reasoning capabilities. This new breed of models broadened the concept of scaling laws, shifting focus from **train-time** to **test-time** (or inference-time) compute. How do these models work? What do we think their architectures look like, and what data do we use to train them? And finally - and perhaps more importantly: how expensive can they get, and what can we use them for?

Avatar for Luca Baggi

Luca Baggi

October 15, 2025

More Decks by Luca Baggi

Other Decks in Programming

Transcript

  1. From o1 to DeepSeek New Scaling Laws for LLMs that

    Reason 🇱🇹 Vilnius (2026/04/09) 👋 Luca Baggi 👾 AI Engineer @ xtream
  2. 📍Outline 🔙 In the previous episode: a recap of training-time

    scaling laws 🚨 Chart crimes ⛓💥 “Take a deep breath and think step by step” ♻ Scaling laws before it was cool 🍓 So, what is o1? 🐳 The whale in the room: enter DeepSeek R1 🔜 What’s next? From alchemy to reasoning
  3. 🎯 Takeaways After the talk, you will be more familiar

    with: 1. Test-time compute is enabled by dramatic decreases in inference costs 2. The two main “families” of test-time compute strategies: i. Parallel generations (search) with veri f iers ii. Self-improvement, usually achieved via f ine-tuning
  4. 🎯 Takeaways After the talk, you will be more familiar

    with: 3. Two likely hypothesis for how o1 works 4. Why DeepSeek-R1 matters: i. How it was trained ii. How we can achieve self-improvement (reasoning) with reinforcement learning with veri f iable rewards (RLVR)
  5. 🔙 In the previous episode A recap of training-time scaling

    laws • Before circa 2020, we used to think we could just train bigger models to achieve better performance. • This led to hilariously big models that haven’t found the same success as GPT-3. • A series of studies challenged this view, outlining that when training models, the dataset size should scale in tandem with the number of parameters.
  6. 🔙 In the previous episode Step 1: Oh, don’t forget

    to scale dataset size eventually (OpenAI) Kaplan et al (2020)
  7. 🔙 In the previous episode Step 2: Scale dataset size

    equally (DeepMind) Ho ff man et al (2022)
  8. 🔙 In the previous episode Step 3: Just scale the

    data™ (Meta) Thomas Shalom @Latent Space Podcast • “My intuition is that the web is full of 💩 in terms of text, and training on those tokens is a waste of compute.” • “Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.”
  9. 🔙 In the previous episode Post-Chinchilla Scaling Laws, or the

    Chinchilla Trap (Databricks) • “[…] Chinchilla scaling laws, neglect to include the cost of inference” • Following the Chinchilla Scaling Laws leads to the "Chinchilla Trap", whereby you end up with a model that is way too large and therefore expensive to run at large scale at inference time.
  10. 🔙 In the previous episode Post-Chinchilla Scaling Laws, or the

    Chinchilla Trap (Databricks) • In other words, to deploy a model that needs to serve requests at scale, you want to train a smaller model for longer. • It’s going to be more expensive at the training level, but cheaper at inference.
  11. 🔙 In the previous episode A recap, in f igures

    model Year Tokens/ parameter GPT-3 2020 1.7 Chinchilla 2022 20 Llama 1 2023 142 Llama 2 2023 284 Llama 3 2024 1875
  12. 🚨 Chart crimes Crime solved - with o3 and ARC-AGI

    OpenAI o3 Breakthrough High Score on ARC-AGI-PUB (20/12/2024)
  13. ⛓💥 “Take a deep breath and think step by step”

    Something we’ve known since 2021 • Just prompt the model to plan before it answers: • [Nye et al, 2021] • “[…] reasoning abilities emerge naturally in su ff iciently large language models via a simple method called chain of thought prompting […] “ [Wei et al, 2022] • “[…] we show that LLMs are decent zero-shot reasoners by simply adding 'Let's think step by step' before each answer” [Kojima et al, 2022]
  14. ♻ Scaling laws before it was cool Two main families

    of test-time compute • Search against a Veri f ier (parallel sampling): This approach focuses on generating multiple candidate answers and using a veri f ier to select the best one. • Also simply known as search. It’s the same idea behind AlphaZero, AlphaGo…
  15. ♻ Scaling laws before it was cool Two main families

    of test-time compute • Self-Re f inement (sequential revision): Models iteratively re f ine their own outputs or “thoughts” by identifying and correcting errors in subsequent iterations. • Likely requires f ine-tuning so the model can learn how to “self-correct”.
  16. ♻ Scaling laws before it was cool Parallel sampling: some

    strategies Scaling test time compute with Open Models (Beeching, Tunstall, Rush 2024)
  17. ♻ Scaling laws before it was cool Parallel sampling: What’s

    a veri f ier? • Deterministic processes: heuristics, solvers for equations, linters and unit tests for code… • Models • Outcome Reward Models (ORM): trained to give a score to the f inal outcome • Process Reward Models (PRM): trained to give a score to the intermediate steps too (basically, how likely every step can lead to the correct solution)
  18. ♻ Scaling laws before it was cool Parallel sampling: What’s

    a veri f ier? • As usual, top research labs were experimenting with this way before we got o1: • Training Veri f iers to Solve Math Word Problems (OpenAI, 2021) • Introduces the GSM8K dataset used for evaluations • Actually about training an outcome reward model (the veri f ier) to score answers to math problems • Solving math word problems with process- and outcome-based feedback (DeepMind, 2022) • Compares outcome reward model and process reward model on GSM8K • OpenAI does the same one year later: Let’s Verify Step by Step
  19. ♻ Scaling laws before it was cool Parallel sampling: What’s

    a veri f ier? • Google went back on this and derived scaling laws on how to trade test- time compute with pretraining compute: • Scaling LLM Test-Time Compute Optimally can be More E ff ective than Scaling Model Parameters (Snell et al, 2024) • “[Y]ou can use smaller models or models that haven't been pre-trained for as long, and boost their performance using the test-time strategies […]. For hard problems the authors f ind that pre-training is likely to be more e ff ective.” Scaling LLM Test Time Compute (2024) (blog)
  20. ♻ Scaling laws before it was cool Parallel sampling: conclusions

    • In general, PRM seem to perform best. • However, there aren’t a lot of open PRM around. • Reasoning problems, in contrast to general chat or writing requests, can be automatically veri f ied or labeled.
  21. ♻ Scaling laws before it was cool Parallel sampling: conclusions

    • You can generate lots of synthetic data, and verify it - without human annotations. • This is a necessary condition to enable scaling these processes to training large language models.
  22. ♻ Scaling laws before it was cool Self-Re f inement

    • Another idea that’s been explored since 2022 at least: • STaR: Bootstrapping Reasoning With Reasoning [Zelikman et al, 2022] • Beyond human data: Scaling self-training for problem-solving with language models [Singh et al, 2024] • Recursive introspection: Teaching foundation models how to self- improve (RISE) [Qu et al, 2024]
  23. ♻ Scaling laws before it was cool Self-Re f inement

    Recursive introspection: Teaching foundation models how to self-improve
  24. 🍓 So, what is o1? It’s de f initely parallel

    sampling • OpenAI has (willingly?) spread rumours about o1 for about a year before it came out - it was November 2023. • The project was referred to as Q* [Reuters]. The training procedure was called strawberry, and was supposedly used to train a new model, codenamed Orion (Interconnects AI, September 2024).
  25. 🍓 So, what is o1? It’s de f initely parallel

    sampling • “As I’ve dug into this in more detail, I’ve become convinced that they are doing something powerful by searching over language steps via tree-of- thoughts reasoning” (Interconnects AI, November 2023)
  26. 🍓 So, what is o1? Until it isn’t? • OpenAI's

    o1 using "search" was a PSYOP (Inteconnects AI, December 2024) • People at OpenAI clari f ied it was “‘just’ an LLM trained with RL” - does this mean it’s not a system: no search, no veri f ier?
  27. 🍓 So, what is o1? Until it isn’t? • They

    might tell you in the release post? • “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data- e ff icient training process.” • “Through reinforcement learning, o1 learns to hone its chain of thought and re f ine the strategies it uses.”
  28. 🍓 So, what is o1? Doesn’t this remind you of

    RISE? How Reasoning Works, OpenAI documentation
  29. 🍓 So, what is o1? Still unclear whether it’s search

    or self improvement • A former OpenAI employee (now at Anthropic) explained that o3 “samples many solutions and uses a learned function [a veri f ier] to pick the best” • Claude 3.7 uses serial test-time compute (i.e., self-improvement), but “Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one […].” • “Parallel test-time compute scaling isn’t available in our newly-deployed model” • Claude 3.5 was actually already doing “self-talk”
  30. 🐳 The whale in the room: enter DeepSeek R1 Playing

    catch-up • Right at the turning point between 2024 and 2025, DeepSeek-R1 was released • The lab had been publishing very high-quality technical reports, praised for their novelties, throughout the year. This paper is no less
  31. 🐳 The whale in the room: enter DeepSeek R1 Playing

    catch-up • Most importantly, it uncovered a recipe for self improvement (nowadays simply called reasoning) with reinforcement learning • Only three months after the the announcement of the “preview” version of o1, and with performances comparable to o3 (that was released about a month prior)
  32. 🐳 The whale in the room: enter DeepSeek R1 Playing

    catch-up • This learning procedure is now known as reinforcement learning with veri f iable rewards (RLVR) • It uses smaller datasets with question and answer pairs, using veri f iers to score the answers.
  33. 🔜 What’s next? A recap, and what’s in the near

    future • We are just at the beginning of what we can achieve with RL: we expect we can scale RLVR more and more. • This was f inally possible because we have strong base models (i.e., we f igured out pre-training). • Though we need to work on veri f iers beyond the reasoning, math and code domains (which are easier to verify). • Does RL elicit existing capabilities in the latent space of the model, or uncover new ones? It seems like the latter.
  34. 📚 References Other references not in the previous slides •

    Why we think (Weng, 2025) • Scaling test time compute with Open Models (Beeching, Tunstall, Rush 2024) • Welcome to LLM f lation – LLM inference cost is going down fast ⬇ • Machine Learning Trends (very cool project by EpochAI)