Uncertainty in the LLM era - Science, more than scale

Uncertainty in the LLM era Ga¨ el Varoquaux :Probabl.

Casting – acknowledgements On social impact – Sasha Luccioni, Meredith
Whittaker On NLP – Lihu Chen, Fabian Suchanek On Uncertainty – Marine le Morvan, Alexandre Perez, S´ ebastien Melo Broader teams Soda Inria scikit-learn⋆ :probabl. ⋆ hit me up for scikit-learn stickers G Varoquaux 1

This talk Maths + LLMs + agents Cuz’ this is
our special sauce Politics Cuz’ technology is political G Varoquaux 2

1 Growth and splendor of AI

An AI revolution Obsolete radiologists since 2017 G Varoquaux 4

An AI revolution Self driving cars (Waimo is not fully
autonomous remote operators) G Varoquaux 5

An AI revolution AGI in 2025 G Varoquaux 6

The narrative of progress [Varoquaux... 2025] More compute drive smarter
AI “leveraging computation is ultimately the most effective” [Sutton 2019] G Varoquaux 7 “our results can be improved simply by waiting for faster GPUs and bigger datasets” [Krizhevsky... 2012]

The narrative of progress [Varoquaux... 2025] More compute drive smarter
AI “leveraging computation is ultimately the most effective” [Sutton 2019] G Varoquaux 7 “our results can be improved simply by waiting for faster GPUs and bigger datasets” [Krizhevsky... 2012] Downplays progress in architectures (transformers) statistics (diffusion, flows) algorithms (Adam, Muon)

Biiiiiig AI [Varoquaux... 2025] G Varoquaux 8 “our results can
be improved simply by waiting for faster GPUs and bigger datasets” [Krizhevsky... 2012] 2000 2010 2020 1012 1018 1024 Training FLOP 1 day compute of largest supercomputer Language Vision Other Unknown Games Multimodal Drawing Speech

The bitter lesson [Varoquaux... 2025] “leveraging computation is ultimately the
most effective [...] The ultimate reason for this is Moore’s law” [Sutton 2019] Not Moore’s law, just pouring more & more money G Varoquaux 9 2010 2020 $1 $1k $1M Training cost a single run

Checking this narrative [Varoquaux... 2025] 1GB 500MB 5GB GPU Memory
used 70 75 80 85 90 Performance Organ segmentation 1.0GB Medical imaging a 10MB 100MB 1GB Model Size 30 40 50 60 70 Performance (for bfloat16) Object Detection (COCO) 0.7GB 2020 2022 Date b 10MB 100MB 1GB Model Size 40 45 50 55 60 Performance (for bfloat16) Scene parsing (ADE20K) 3.0GB 0.4GB 2021 2022 2023 Date Computer vision c 1mn 10mn Train time 25% 50% 75% Performance XGBoost Neural nets Tree models Fraction of best score Tabular learning Data science d 100MB 1GB 10GB Model Size 50 60 Performance Text Embedding Massive Text Embedding Benchmark 14.2GB 1.3GB e 100Mb 1Gb 10Gb 100Gb Model Size 30 60 Performance (for bfloat16) LLM Benchmark 21.5GB Natural Language f Bigger isn’t always better Diminishing returns of scale In many applications, utility does not requires scale Unrepresentative benchmarks magnify scale effects G Varoquaux 10

Bigger is better is a simplified narrative It has consequences
G Varoquaux 11

Sustainability consequences Perturbing the US power grid Consumption equivalent to
Japan G Varoquaux 12

Financial consequences It costs a lot to run AI ∼
Danish government expenditure: 90 Billion € G Varoquaux 13

Financial consequences It costs a lot to run AI ∼
Danish government expenditure: 90 Billion € But the promises are amazing G Varoquaux 13

Can computer science fix the size inflation? Tech keeps improving
More efficient algorithms Better compute and data infrastructure G Varoquaux 14

More efficient computing Efficiency improvement are super useful But demand
will increase more (rebound effects) It’s all behavior. And thus very social G Varoquaux 15

Better infrastructure Cloud brings compute and data enables spying and
control We need to be careful who we platform G Varoquaux 16

Are we the Are we the baddies? baddies? Varoquaux, Luccioni,
Whittaker, FAccT 2025 Varoquaux, Luccioni, Whittaker, FAccT 2025 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI G Varoquaux 17

What can we, computer-science researchers, do? G Varoquaux 18 [Varoquaux...
2025]

As scholars: document the tradeoffs Reporting compute usage souldn’t be
optional G Varoquaux 19

The role of academia is indepen- dent consolidation of knowledge
If we don’t ensure a grounded conversation, the trust in science and technology breaks down G Varoquaux 20

Innovation is also crucial Being actors of our future It
requires industries + checks and bounds G Varoquaux 21

As innovators: explore and value resource-efficient G Varoquaux 22

As researchers: many questions do not require scale Fixation on
scale disempowers many (GPU poor) We should legitimate AI research questions beyond scale Reasoning [Jolicoeur-Martineau 2025] Causality Game theory Privacy Uncertainty ... Reviewing shouldn’t ask for large compute increase G Varoquaux 23

Undoing the “cool” in scale Let’s not play others’ game
We need to define our own cool and be proud of it We’re actors of the narratives via our papers, our reviews G Varoquaux 24

Unpacking AI progress 1. There are bubble dynamics in AI
- Promise-based economics 2. These influence the narratives simplified view on scale 3. Compute-rich actors set social norms for more compute 4. We can do better: - As scholars: document tradeoffs - As innovators: push those tradeoffs - As researchers: study diverse questions G Varoquaux 25 [Varoquaux... 2025]

2 Uncertainty control in classifiers Auditing predictors

Can we trust a black box predictor? G Varoquaux 27
Tell me something about Albert Einstein e.g., a short bio with the birth date and place Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest scientists of all time. Born in the German Empire, on January 14, 1879, Einstein grew up in Ulm, Germany. In 1905, he submitted a successful PhD dissertation to the Humboldt University of Berlin

Can we trust a black box predictor? G Varoquaux 27
Tell me something about Albert Einstein e.g., a short bio with the birth date and place Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest scientists of all time. Born in the German Empire, on January 14, 1879, Einstein grew up in Ulm, Germany. In 1905, he submitted a successful PhD dissertation to the Humboldt University of Berlin Some fishy stuff

Can we trust a black box predictor? We’d really want
uncertainty quantification G Varoquaux 28 Tell me something about Albert Einstein e.g., a short bio with the birth date and place Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest scientists of all time. Born in the German Empire, on January 14, 1879, Einstein grew up in Ulm, Germany. In 1905, he submitted a successful PhD dissertation to the Humboldt University of Berlin High confidence Medium confidence Low confidence

Discrete outcomes Correct picture: probability Probability is never observed G
Varoquaux 29

Uncertainty bins and error rates – Mistral 7B, on birth
dates 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Positives Estimate probabilities by counting on bins G Varoquaux 30

Notion of calibrated classifiers 0 1 Predicted probability 0 1
Observed positive rate over-confident under-confident perfect calibration If predictor says 60%, do I observe 60%? G Varoquaux 31

Varying the queried nationality [Chen... 2024] 0.0 0.2 0.4 0.6
0.8 1.0 Mean Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Positives Calibration Curve Perfectly calibrated Germany United_States India France China Calibration is a control on average Need control on individual queries (y|X) G Varoquaux 32

To control probabilities: strictly proper scoring rules Question: f (X)
? = (y|X) Notation: s def = f (X) Strictly proper scoring rules provide divergence functions between two distributions d(p, q) Minimal for p = q d(p, q) ≥ 0 and d(p, q) = 0 ⇔ p = q Example: Brier score (squared loss) Brier score = i ( ˆ si − yi )2 Observed (binary) label Confidence score Control probabilities, though only discrete events are observed Good losses for machine learning G Varoquaux 33

Decomposing errors: epistemic vs aleatoric Proposition (Expected loss decomposition) [Kull
and Flach 2015] ¾[d(f (X), Y)] Expected loss: L = ¾ d(f (X), f⋆(X)) Epistemic loss: EL + ¾ d(f⋆(X), Y) Aleatoric loss: AL Epistemic Error: how sub-optimal is the predictor Can be reduced by fitting a better model / more samples Aleatoric Error: residual uncertainty on y|X Can be reduced by a model on more features G Varoquaux 34

and Flach 2015] ¾[d(f (X), Y)] Expected loss: L = ¾ d(f (X), f⋆(X)) Epistemic loss: EL + ¾ d(f⋆(X), Y) Aleatoric loss: AL Epistemic Error: how sub-optimal is the predictor Can be reduced by fitting a better model / more samples Aleatoric Error: residual uncertainty on y|X Can be reduced by a model on more features Some questions cannot be answered with high confidence G Varoquaux 34

and Flach 2015] ¾[d(f (X), Y)] Expected loss: L = ¾ d(f (X), f⋆(X)) Epistemic loss: EL + ¾ d(f⋆(X), Y) Aleatoric loss: AL Epistemic Error: how sub-optimal is the predictor Can be reduced by fitting a better model / more samples Aleatoric Error: residual uncertainty on y|X Can be reduced by a model on more features Some questions cannot be answered with high confidence G Varoquaux 34 Very hard to estimate: would need access to f⋆

Decomposing epistemic... calibration Theorem (Epistemic loss decomposition) [Kull and Flach
2015] ¾ d(f (X), f⋆(X)) Epistemic loss EL = ¾[d(f (X), c ◦f (X))] Calibration loss CL + ¾ d(c ◦f (X), f⋆(X)) Grouping loss GL 0 1 f (X ) Probability estimate 0 1 True probability f ? (X ) c Oracle view Calibration is the bias: average error “Easy to compute” on bins of f (X) G Varoquaux 35

2015] ¾ d(f (X), f⋆(X)) Epistemic loss EL = ¾[d(f (X), c ◦f (X))] Calibration loss CL + ¾ d(c ◦f (X), f⋆(X)) Grouping loss GL 0 1 f (X ) Probability estimate 0 1 True probability f ? (X ) c GL Oracle view Grouping is the variance Lemma (Grouping loss as a variance) [Perez-Lebel... 2023] Grouping loss for Brier score writes: GL = ¾ f⋆(X) f (X) G Varoquaux 35

2015] ¾ d(f (X), f⋆(X)) Epistemic loss EL = ¾[d(f (X), c ◦f (X))] Calibration loss CL + ¾ d(c ◦f (X), f⋆(X)) Grouping loss GL 0 1 f (X ) Probability estimate 0 1 True probability f ? (X ) c GL Oracle view Challenge: f⋆(X) is unknown. We only observe Y ∼ Bernoulli(f⋆(X)) Figure needs oracle knowledge G Varoquaux 35

Estimation principle: Law of total variance [g(X)] X P g(X)
- Input space X, function g(X): our goal is estimating [g(X)] G Varoquaux 36

1 2 3 4 5 6 7 8 9 10 - Split input space X into a partition G Varoquaux 36

1 2 3 4 5 6 7 8 9 10 - Compute the average of g on each part G Varoquaux 36

Estimation principle: Law of total variance [g(X)] = [¾[g(X) |
P]] + . . . X P E[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - The variance of the local averages captured parts of the variance of g G Varoquaux 36

P]] + . . . X P V[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - But there is unaccounted-for variance, the gap between g and local average G Varoquaux 36

P]] + . . . X P V[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - The reminder term is the sum of these gaps G Varoquaux 36

P]] Explained variance + ¾[ [g(X) | P]] Residual variance X P g(X) - X P E[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - X P V[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - Law of total variance decomposition [g(X)] into the variance of local averages this reminder (sum of gaps) G Varoquaux 36

Capturing the variance of predicted probabilities Theorem (Grouping loss decomposition)
[Perez-Lebel... 2023] Let P : X → be a partitioning of the feature space: f⋆(X) f (X) = p GL(p) = [¾[Y | f (X), P(X)]|f (X) = p] GLexplained (p)≥0 + f⋆(X) f (X), P(X) GLresidual (p)≥0 1. Find a partition to minimize GLresidual 2. Compute variance on regions G Varoquaux 37

Creating the partition of the input space [Perez-Lebel... 2023] Goal:
Find a partition P of X to minimize GLresidual ie such that ¾[Y | f (X), P(X)] varies Supervised partitioning (uses X and Y): eg Decision Trees Targets heterogeneity 1. Train a decision tree on (X1, Y1) 2. Tree leaves give the partition P on X 3. Compute estimates on left-out Improvements [Melo... 2025] Target ¾[Y − c | X] Lower bound + consistent GL estimator G Varoquaux 38

From epistemic error to decisions [Perez-Lebel... 2025] 0 1 f
(X ) Probability estimate 0 1 True probability f ? (X ) c GL Epistemic error controls probabilistic predictions: how far to best possible (y|X) Decisions are made on top 1f (X)>t Errors (false detections, misses) incur costs (decision theory) New question: how far are we in terms of cost? “Sub-optimality gap” G Varoquaux 39

Knowing oracle probabilities [Elkan 2001] Utility Event ¯ E Event
E Action 0 U00 U01 Action 1 U10 U11 Knowing f⋆ def = (y|X) To maximize expected utility, the optimal decision is: Take action 1 ⇔ f⋆(X) ≥ t⋆, t⋆ def = U00−U10 U00−U10+U11−U01 Stop irrigation Avoid surgery Ban user Block payment Call customer Block content G Varoquaux 40

Sub-optimality of imperfect predictions - 0 1 f(X) 0 1
f (X) G Varoquaux 41 Plot true probabilities as a function of estimated ones

Sub-optimality of imperfect predictions - 0 1 f(X) 0 1
f (X) Oracle view G Varoquaux 41 In practice: estimation errors between them

Sub-optimality of imperfect predictions - 0 1 f(X) 0 t
1 f (X) δ = 0 δ = 1 Oracle view G Varoquaux 41 The optimal decision is to threshold f⋆ at t⋆, allocating action above and below the corresponding line

Sub-optimality of imperfect predictions - 0 t 1 f(X) 0
t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 Oracle view G Varoquaux 41 A candidate decision thresholds f at t, allocating action left and right of the corresponding line

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 Oracle view G Varoquaux 41 Combining views shows where the two decisions agree and disagree

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 Regret Oracle view G Varoquaux 41 Regret incurs where they disagree

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ Oracle view G Varoquaux 41 The further away the true probability f⋆ is from t⋆, the more regret

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ Lemma (Pointwise regret) [Perez-Lebel... 2025] R(δ, x) =        U ∆ |f⋆(x) − t⋆| if δ⋆(x) δ(x) 0 if δ⋆(x) = δ(x) with U ∆ ≜ U00 − U10 + U11 − U01 > 0. Oracle view G Varoquaux 41

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ Oracle view G Varoquaux 41 But this plot is an oracle view

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 And we do not observed the oracle f⋆

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 c 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 We have an information on the average distance to f⋆ (calibration) Incurs regret RCL

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 c GL 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 We have an estimate of the variance (grouping) Incurs regret RGL

t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 c GL 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 Sub-optimality gap (regret) [Perez-Lebel... 2025] Calibration regret + grouping regret Lower and upper bound of grouping regret from spread ap 0 µ t 1 Penalty area pZ Mean Var

Estimating local suboptimality [Melo... 2025] Estimates on the parcellation ⇒
Input-specific regret ie how much a better model, given the same information, would do in expected costs G Varoquaux 42

f (X) ? = (y|X) We use partition-based estimate of
bias and variance Bound the cost of hard decisions [Perez-Lebel... 2023, 2025, Melo... 2025] G Varoquaux 43

f (X) ? = (y|X) We use partition-based estimate of
bias and variance Bound the cost of hard decisions [Perez-Lebel... 2023, 2025, Melo... 2025] G Varoquaux 43 Position to conformal prediction? Conditional control Bounds, not worst-case Bridge to binary decisions

3 Uncertainty improves AI systems

Today’s AI systems: Many components, tools, and agents What to
call? What to trust? The biggest? G Varoquaux 45

The role of small models [Chen and Varoquaux 2024] 0
100 200 300 400 Downloads per Month (Million) [0, 200M] [200M, 500M] [500M, 1B] [1B, 6B] [6B, + ] Model Size bert-base-uncased roberta-large xlm-roberta-large Phi-3-mini-4k-instruct sharegpt4video-8b Most downloaded model Monthly downloads on hugginface hub Small models matter G Varoquaux 46

The role of small models [Chen and Varoquaux 2024] G
Varoquaux 47 Calling the smallest possible model Decreases inference cost Model routing Model cascading

Model cascading → → → 0. Pick smallest model 1.
Use model 2. See if answers are good enough - If so return answer 3. If no, pick next model (in order of cost) 4. Go to 1. G Varoquaux 48 [Kag and Fedorov 2023, Chen... 2023, Hu... 2024 Review in [Chen and Varoquaux 2024

Cascading with epistemic [Melo... 2025] Epistemic / aleatoric error!? “What
shirt is Ga¨ el wearing today?” High aleatoric to the LLM No use calling bigger model “What is the birth date of Einstein?” An LLM should know If uncertain, call a bigger model G Varoquaux 49

Cascading with epistemic Llama 1B → 3B → 8B →
70B [Melo... 2025] 0.0 0.2 0.4 0.6 0.8 Average Cost per Sample ($) 0.62 0.64 0.66 0.68 0.70 0.72 0.74 Accuracy =1 =20 Llama 1B Llama 3B Llama 8B Llama 70B Predictive router G Varoquaux 49

70B [Melo... 2025] 0.0 0.2 0.4 0.6 0.8 Average Cost per Sample ($) 0.62 0.64 0.66 0.68 0.70 0.72 0.74 Accuracy t=0 t=0.05 t=0.5 =1 =20 Llama 1B Llama 3B Llama 8B Llama 70B CL Predictive router G Varoquaux 49

70B [Melo... 2025] 0.0 0.2 0.4 0.6 0.8 Average Cost per Sample ($) 0.62 0.64 0.66 0.68 0.70 0.72 0.74 Accuracy t=0 t=0.05 t=0.5 t=0 t=0.05 t=0.5 =1 =20 Llama 1B Llama 3B Llama 8B Llama 70B CL + GL CL Predictive router G Varoquaux 49

Our narrative: LLMs and resource efficiency G Varoquaux 50

Make research on small cool again Once upon a time
computer science was about making efficient sorting algorithms G Varoquaux 51

G Varoquaux 52 Tech is political What we choose to
work Technologie isn’t neutral Vehicles: Ambulances or Tanks? What narratives we propagate Narratives shape societies and AI research, which is social Build to choose our future navigating economic rationals It’s all about tradeoffs

Uncertainty in the LLM era Local black-box uncertainty control (grouping
error) for better AI decisions and orchestration The maths still matter for AI We need to be more uncertain on LLM utility Naive overconfidence will fuel bubble dynamics Not economic or societal benefits Research beyond scale, LLMs, “AGI” Research where you own your success

References I L. Chen and G. Varoquaux. What is the
role of small models in the llm era: A survey. arXiv preprint arXiv:2409.06857, 2024. L. Chen, M. Zaharia, and J. Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. ArXiv preprint, 2023. URL https://arxiv.org/abs/2305.05176. L. Chen, A. Perez-Lebel, F. M. Suchanek, and G. Varoquaux. Reconfidencing llms from the grouping loss perspective. EMNLP Findings, 2024. C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, 2001. Q. J. Hu, J. Bieker, X. Li, N. Jiang, B. Keigwin, G. Ranganath, K. Keutzer, and S. K. Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024. A. Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv preprint arXiv:2510.04871, 2025. A. Kag and I. Fedorov. Efficient edge inference by selective query. In International Conference on Learning Representations, 2023.

References II A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. M. Kull and P. Flach. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 68–85. Springer, 2015. S. Melo, G. Varoquaux, and M. Le Morvan. Epistemic Uncertainty Quantification to Improve Decisions From Black-Box Models. working paper or preprint, Dec. 2025. URL https://hal.science/hal-05393027. A. Perez-Lebel, M. Le Morvan, and G. Varoquaux. Beyond calibration: estimating the grouping loss of modern neural networks. ICLR, 2023. A. Perez-Lebel, G. Varoquaux, S. Koyejo, M. Doutreligne, and M. Le Morvan. Decision from suboptimal classifiers: Excess risk pre- and post-calibration. AIstats, 2025. R. Sutton. The bitter lesson. Incomplete Ideas (blog), 13(1), 2019.

References III G. Varoquaux, S. Luccioni, and M. Whittaker. Hype,
sustainability, and the price of the bigger-is-better paradigm in ai. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 61–75, 2025.

Uncertainty in the LLM era - Science, more than...

Uncertainty in the LLM era - Science, more than scale

More Decks by Gael Varoquaux

Other Decks in Technology

Featured

Transcript