Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Uncertainty in the LLM era - Science, more than...

Uncertainty in the LLM era - Science, more than scale

Today's AI narrative is anchored in scale, and models are far from textbook statistical modeling. And yet, the plague that are hallucination shows us that statistical concepts such as uncertainty, still matter. I will discuss uncertainty quantification on a black-box classifier, in particular how errors can be decomposed, connecting epistemic and calibration error, and corresponding estimators. I will show how roping in a bit of decision theory, these, fairly theoretical, tools can be used to build better AI systems.

Avatar for Gael Varoquaux

Gael Varoquaux

December 05, 2025
Tweet

More Decks by Gael Varoquaux

Other Decks in Technology

Transcript

  1. Casting – acknowledgements On social impact – Sasha Luccioni, Meredith

    Whittaker On NLP – Lihu Chen, Fabian Suchanek On Uncertainty – Marine le Morvan, Alexandre Perez, S´ ebastien Melo Broader teams Soda Inria scikit-learn⋆ :probabl. ⋆ hit me up for scikit-learn stickers G Varoquaux 1
  2. This talk Maths + LLMs + agents Cuz’ this is

    our special sauce Politics Cuz’ technology is political G Varoquaux 2
  3. An AI revolution Self driving cars (Waimo is not fully

    autonomous remote operators) G Varoquaux 5
  4. The narrative of progress [Varoquaux... 2025] More compute drive smarter

    AI “leveraging computation is ul- timately the most effective” [Sutton 2019] G Varoquaux 7 “our results can be improved simply by waiting for faster GPUs and bigger datasets” [Krizhevsky... 2012]
  5. The narrative of progress [Varoquaux... 2025] More compute drive smarter

    AI “leveraging computation is ul- timately the most effective” [Sutton 2019] G Varoquaux 7 “our results can be improved simply by waiting for faster GPUs and bigger datasets” [Krizhevsky... 2012] Downplays progress in architectures (transformers) statistics (diffusion, flows) algorithms (Adam, Muon)
  6. Biiiiiig AI [Varoquaux... 2025] G Varoquaux 8 “our results can

    be improved simply by waiting for faster GPUs and bigger datasets” [Krizhevsky... 2012] 2000 2010 2020 1012 1018 1024 Training FLOP 1 day compute of largest supercomputer Language Vision Other Unknown Games Multimodal Drawing Speech
  7. The bitter lesson [Varoquaux... 2025] “leveraging computation is ultimately the

    most effective [...] The ultimate reason for this is Moore’s law” [Sutton 2019] Not Moore’s law, just pouring more & more money G Varoquaux 9 2010 2020 $1 $1k $1M Training cost a single run
  8. Checking this narrative [Varoquaux... 2025] 1GB 500MB 5GB GPU Memory

    used 70 75 80 85 90 Performance Organ segmentation 1.0GB Medical imaging a 10MB 100MB 1GB Model Size 30 40 50 60 70 Performance (for bfloat16) Object Detection (COCO) 0.7GB 2020 2022 Date b 10MB 100MB 1GB Model Size 40 45 50 55 60 Performance (for bfloat16) Scene parsing (ADE20K) 3.0GB 0.4GB 2021 2022 2023 Date Computer vision c 1mn 10mn Train time 25% 50% 75% Performance XGBoost Neural nets Tree models Fraction of best score Tabular learning Data science d 100MB 1GB 10GB Model Size 50 60 Performance Text Embedding Massive Text Embedding Benchmark 14.2GB 1.3GB e 100Mb 1Gb 10Gb 100Gb Model Size 30 60 Performance (for bfloat16) LLM Benchmark 21.5GB Natural Language f Bigger isn’t always better Diminishing returns of scale In many applications, utility does not requires scale Unrepresentative benchmarks magnify scale effects G Varoquaux 10
  9. Financial consequences It costs a lot to run AI ∼

    Danish government expenditure: 90 Billion € G Varoquaux 13
  10. Financial consequences It costs a lot to run AI ∼

    Danish government expenditure: 90 Billion € But the promises are amazing G Varoquaux 13
  11. Financial consequences It costs a lot to run AI ∼

    Danish government expenditure: 90 Billion € But the promises are amazing G Varoquaux 13
  12. Can computer science fix the size inflation? Tech keeps improving

    More efficient algorithms Better compute and data infrastructure G Varoquaux 14
  13. More efficient computing Efficiency improvement are super useful But demand

    will increase more (rebound effects) It’s all behavior. And thus very social G Varoquaux 15
  14. Better infrastructure Cloud brings compute and data enables spying and

    control We need to be careful who we platform G Varoquaux 16
  15. Are we the Are we the baddies? baddies? Varoquaux, Luccioni,

    Whittaker, FAccT 2025 Varoquaux, Luccioni, Whittaker, FAccT 2025 Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI G Varoquaux 17
  16. The role of academia is indepen- dent consolidation of knowledge

    If we don’t ensure a grounded conversation, the trust in science and technology breaks down G Varoquaux 20
  17. Innovation is also crucial Being actors of our future It

    requires industries + checks and bounds G Varoquaux 21
  18. As researchers: many questions do not require scale Fixation on

    scale disempowers many (GPU poor) We should legitimate AI research questions beyond scale Reasoning [Jolicoeur-Martineau 2025] Causality Game theory Privacy Uncertainty ... Reviewing shouldn’t ask for large compute increase G Varoquaux 23
  19. Undoing the “cool” in scale Let’s not play others’ game

    We need to define our own cool and be proud of it We’re actors of the narratives via our papers, our reviews G Varoquaux 24
  20. Unpacking AI progress 1. There are bubble dynamics in AI

    - Promise-based economics 2. These influence the narratives simplified view on scale 3. Compute-rich actors set social norms for more compute 4. We can do better: - As scholars: document tradeoffs - As innovators: push those tradeoffs - As researchers: study diverse questions G Varoquaux 25 [Varoquaux... 2025]
  21. Can we trust a black box predictor? G Varoquaux 27

    Tell me something about Albert Einstein e.g., a short bio with the birth date and place Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest scientists of all time. Born in the German Empire, on January 14, 1879, Einstein grew up in Ulm, Germany. In 1905, he submitted a successful PhD dissertation to the Humboldt University of Berlin
  22. Can we trust a black box predictor? G Varoquaux 27

    Tell me something about Albert Einstein e.g., a short bio with the birth date and place Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest scientists of all time. Born in the German Empire, on January 14, 1879, Einstein grew up in Ulm, Germany. In 1905, he submitted a successful PhD dissertation to the Humboldt University of Berlin Some fishy stuff
  23. Can we trust a black box predictor? We’d really want

    uncertainty quantification G Varoquaux 28 Tell me something about Albert Einstein e.g., a short bio with the birth date and place Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest scientists of all time. Born in the German Empire, on January 14, 1879, Einstein grew up in Ulm, Germany. In 1905, he submitted a successful PhD dissertation to the Humboldt University of Berlin High confidence Medium confidence Low confidence
  24. Uncertainty bins and error rates – Mistral 7B, on birth

    dates 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Positives Estimate probabilities by counting on bins G Varoquaux 30
  25. Uncertainty bins and error rates – Mistral 7B, on birth

    dates 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Positives Estimate probabilities by counting on bins G Varoquaux 30
  26. Notion of calibrated classifiers 0 1 Predicted probability 0 1

    Observed positive rate over-confident under-confident perfect calibration If predictor says 60%, do I observe 60%? G Varoquaux 31
  27. Varying the queried nationality [Chen... 2024] 0.0 0.2 0.4 0.6

    0.8 1.0 Mean Predicted Probability 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Positives Calibration Curve Perfectly calibrated Germany United_States India France China Calibration is a control on aver- age Need control on individual queries (y|X) G Varoquaux 32
  28. To control probabilities: strictly proper scoring rules Question: f (X)

    ? = (y|X) Notation: s def = f (X) Strictly proper scoring rules provide divergence functions between two distributions d(p, q) Minimal for p = q d(p, q) ≥ 0 and d(p, q) = 0 ⇔ p = q Example: Brier score (squared loss) Brier score = i ( ˆ si − yi )2 Observed (binary) label Confidence score Control probabilities, though only discrete events are observed Good losses for machine learning G Varoquaux 33
  29. Decomposing errors: epistemic vs aleatoric Proposition (Expected loss decomposition) [Kull

    and Flach 2015] ¾[d(f (X), Y)] Expected loss: L = ¾ d(f (X), f⋆(X)) Epistemic loss: EL + ¾ d(f⋆(X), Y) Aleatoric loss: AL Epistemic Error: how sub-optimal is the predictor Can be reduced by fitting a better model / more samples Aleatoric Error: residual uncertainty on y|X Can be reduced by a model on more features G Varoquaux 34
  30. Decomposing errors: epistemic vs aleatoric Proposition (Expected loss decomposition) [Kull

    and Flach 2015] ¾[d(f (X), Y)] Expected loss: L = ¾ d(f (X), f⋆(X)) Epistemic loss: EL + ¾ d(f⋆(X), Y) Aleatoric loss: AL Epistemic Error: how sub-optimal is the predictor Can be reduced by fitting a better model / more samples Aleatoric Error: residual uncertainty on y|X Can be reduced by a model on more features Some questions cannot be answered with high confidence G Varoquaux 34
  31. Decomposing errors: epistemic vs aleatoric Proposition (Expected loss decomposition) [Kull

    and Flach 2015] ¾[d(f (X), Y)] Expected loss: L = ¾ d(f (X), f⋆(X)) Epistemic loss: EL + ¾ d(f⋆(X), Y) Aleatoric loss: AL Epistemic Error: how sub-optimal is the predictor Can be reduced by fitting a better model / more samples Aleatoric Error: residual uncertainty on y|X Can be reduced by a model on more features Some questions cannot be answered with high confidence G Varoquaux 34 Very hard to estimate: would need access to f⋆
  32. Decomposing epistemic... calibration Theorem (Epistemic loss decomposition) [Kull and Flach

    2015] ¾ d(f (X), f⋆(X)) Epistemic loss EL = ¾[d(f (X), c ◦f (X))] Calibration loss CL + ¾ d(c ◦f (X), f⋆(X)) Grouping loss GL 0 1 f (X ) Probability estimate 0 1 True probability f ? (X ) c Oracle view Calibration is the bias: average error “Easy to compute” on bins of f (X) G Varoquaux 35
  33. Decomposing epistemic... calibration Theorem (Epistemic loss decomposition) [Kull and Flach

    2015] ¾ d(f (X), f⋆(X)) Epistemic loss EL = ¾[d(f (X), c ◦f (X))] Calibration loss CL + ¾ d(c ◦f (X), f⋆(X)) Grouping loss GL 0 1 f (X ) Probability estimate 0 1 True probability f ? (X ) c GL Oracle view Grouping is the variance Lemma (Grouping loss as a variance) [Perez-Lebel... 2023] Grouping loss for Brier score writes: GL = ¾ f⋆(X) f (X) G Varoquaux 35
  34. Decomposing epistemic... calibration Theorem (Epistemic loss decomposition) [Kull and Flach

    2015] ¾ d(f (X), f⋆(X)) Epistemic loss EL = ¾[d(f (X), c ◦f (X))] Calibration loss CL + ¾ d(c ◦f (X), f⋆(X)) Grouping loss GL 0 1 f (X ) Probability estimate 0 1 True probability f ? (X ) c GL Oracle view Challenge: f⋆(X) is unknown. We only observe Y ∼ Bernoulli(f⋆(X)) Figure needs oracle knowledge G Varoquaux 35
  35. Estimation principle: Law of total variance [g(X)] X P g(X)

    - Input space X, function g(X): our goal is estimating [g(X)] G Varoquaux 36
  36. Estimation principle: Law of total variance [g(X)] X P g(X)

    1 2 3 4 5 6 7 8 9 10 - Split input space X into a partition G Varoquaux 36
  37. Estimation principle: Law of total variance [g(X)] X P g(X)

    1 2 3 4 5 6 7 8 9 10 - Compute the average of g on each part G Varoquaux 36
  38. Estimation principle: Law of total variance [g(X)] = [¾[g(X) |

    P]] + . . . X P E[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - The variance of the local averages captured parts of the variance of g G Varoquaux 36
  39. Estimation principle: Law of total variance [g(X)] = [¾[g(X) |

    P]] + . . . X P V[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - But there is unaccounted-for variance, the gap between g and local average G Varoquaux 36
  40. Estimation principle: Law of total variance [g(X)] = [¾[g(X) |

    P]] + . . . X P V[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - The reminder term is the sum of these gaps G Varoquaux 36
  41. Estimation principle: Law of total variance [g(X)] = [¾[g(X) |

    P]] Explained variance + ¾[ [g(X) | P]] Residual variance X P g(X) - X P E[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - X P V[g(X)|P] 1 2 3 4 5 6 7 8 9 10 - Law of total variance decomposition [g(X)] into the variance of local averages this reminder (sum of gaps) G Varoquaux 36
  42. Capturing the variance of predicted probabilities Theorem (Grouping loss decomposition)

    [Perez-Lebel... 2023] Let P : X → be a partitioning of the feature space: f⋆(X) f (X) = p GL(p) = [¾[Y | f (X), P(X)]|f (X) = p] GLexplained (p)≥0 + f⋆(X) f (X), P(X) GLresidual (p)≥0 1. Find a partition to minimize GLresidual 2. Compute variance on regions G Varoquaux 37
  43. Creating the partition of the input space [Perez-Lebel... 2023] Goal:

    Find a partition P of X to minimize GLresidual ie such that ¾[Y | f (X), P(X)] varies Supervised partitioning (uses X and Y): eg Decision Trees Targets heterogeneity 1. Train a decision tree on (X1, Y1) 2. Tree leaves give the partition P on X 3. Compute estimates on left-out Improvements [Melo... 2025] Target ¾[Y − c | X] Lower bound + consistent GL estimator G Varoquaux 38
  44. From epistemic error to decisions [Perez-Lebel... 2025] 0 1 f

    (X ) Probability estimate 0 1 True probability f ? (X ) c GL Epistemic error controls probabilistic predictions: how far to best possible (y|X) Decisions are made on top 1f (X)>t Errors (false detections, misses) incur costs (decision theory) New question: how far are we in terms of cost? “Sub-optimality gap” G Varoquaux 39
  45. Knowing oracle probabilities [Elkan 2001] Utility Event ¯ E Event

    E Action 0 U00 U01 Action 1 U10 U11 Knowing f⋆ def = (y|X) To maximize expected utility, the optimal decision is: Take action 1 ⇔ f⋆(X) ≥ t⋆, t⋆ def = U00−U10 U00−U10+U11−U01 Stop irrigation Avoid surgery Ban user Block payment Call customer Block content G Varoquaux 40
  46. Sub-optimality of imperfect predictions - 0 1 f(X) 0 1

    f (X) G Varoquaux 41 Plot true probabilities as a function of estimated ones
  47. Sub-optimality of imperfect predictions - 0 1 f(X) 0 1

    f (X) Oracle view G Varoquaux 41 In practice: estimation errors between them
  48. Sub-optimality of imperfect predictions - 0 1 f(X) 0 t

    1 f (X) δ = 0 δ = 1 Oracle view G Varoquaux 41 The optimal decision is to threshold f⋆ at t⋆, allocating action above and below the corresponding line
  49. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 Oracle view G Varoquaux 41 A candidate decision thresholds f at t, allocating action left and right of the corresponding line
  50. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 Oracle view G Varoquaux 41 Combining views shows where the two decisions agree and disagree
  51. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 Regret Oracle view G Varoquaux 41 Regret incurs where they disagree
  52. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ Oracle view G Varoquaux 41 The further away the true probability f⋆ is from t⋆, the more regret
  53. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ Lemma (Pointwise regret) [Perez-Lebel... 2025] R(δ, x) =        U ∆ |f⋆(x) − t⋆| if δ⋆(x) δ(x) 0 if δ⋆(x) = δ(x) with U ∆ ≜ U00 − U10 + U11 − U01 > 0. Oracle view G Varoquaux 41
  54. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ Oracle view G Varoquaux 41 But this plot is an oracle view
  55. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 And we do not observed the oracle f⋆
  56. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 c 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 We have an information on the average distance to f⋆ (calibration) Incurs regret RCL
  57. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 c GL 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 We have an estimate of the variance (grouping) Incurs regret RGL
  58. Sub-optimality of imperfect predictions - 0 t 1 f(X) 0

    t 1 f (X) δ = 0 δ = 1 δ = 0 δ = 1 c GL 0.0 0.1 0.2 0.3 0.4 0.5 Regret R(δf,t , x) ×U∆ G Varoquaux 41 Sub-optimality gap (regret) [Perez-Lebel... 2025] Calibration regret + grouping regret Lower and upper bound of grouping regret from spread ap 0 µ t 1 Penalty area pZ Mean Var
  59. Estimating local suboptimality [Melo... 2025] Estimates on the parcellation ⇒

    Input-specific regret ie how much a better model, given the same information, would do in expected costs G Varoquaux 42
  60. f (X) ? = (y|X) We use partition-based estimate of

    bias and variance Bound the cost of hard decisions [Perez-Lebel... 2023, 2025, Melo... 2025] G Varoquaux 43
  61. f (X) ? = (y|X) We use partition-based estimate of

    bias and variance Bound the cost of hard decisions [Perez-Lebel... 2023, 2025, Melo... 2025] G Varoquaux 43 Position to conformal prediction? Conditional control Bounds, not worst-case Bridge to binary decisions
  62. Today’s AI systems: Many components, tools, and agents What to

    call? What to trust? The biggest? G Varoquaux 45
  63. The role of small models [Chen and Varoquaux 2024] 0

    100 200 300 400 Downloads per Month (Million) [0, 200M] [200M, 500M] [500M, 1B] [1B, 6B] [6B, + ] Model Size bert-base-uncased roberta-large xlm-roberta-large Phi-3-mini-4k-instruct sharegpt4video-8b Most downloaded model Monthly downloads on hugginface hub Small models matter G Varoquaux 46
  64. The role of small models [Chen and Varoquaux 2024] G

    Varoquaux 47 Calling the smallest possible model Decreases inference cost Model routing Model cascading
  65. Model cascading → → → 0. Pick smallest model 1.

    Use model 2. See if answers are good enough - If so return answer 3. If no, pick next model (in order of cost) 4. Go to 1. G Varoquaux 48 [Kag and Fedorov 2023, Chen... 2023, Hu... 2024 Review in [Chen and Varoquaux 2024
  66. Cascading with epistemic [Melo... 2025] Epistemic / aleatoric error!? “What

    shirt is Ga¨ el wearing today?” High aleatoric to the LLM No use calling bigger model “What is the birth date of Einstein?” An LLM should know If uncertain, call a bigger model G Varoquaux 49
  67. Cascading with epistemic Llama 1B → 3B → 8B →

    70B [Melo... 2025] 0.0 0.2 0.4 0.6 0.8 Average Cost per Sample ($) 0.62 0.64 0.66 0.68 0.70 0.72 0.74 Accuracy =1 =20 Llama 1B Llama 3B Llama 8B Llama 70B Predictive router G Varoquaux 49
  68. Cascading with epistemic Llama 1B → 3B → 8B →

    70B [Melo... 2025] 0.0 0.2 0.4 0.6 0.8 Average Cost per Sample ($) 0.62 0.64 0.66 0.68 0.70 0.72 0.74 Accuracy t=0 t=0.05 t=0.5 =1 =20 Llama 1B Llama 3B Llama 8B Llama 70B CL Predictive router G Varoquaux 49
  69. Cascading with epistemic Llama 1B → 3B → 8B →

    70B [Melo... 2025] 0.0 0.2 0.4 0.6 0.8 Average Cost per Sample ($) 0.62 0.64 0.66 0.68 0.70 0.72 0.74 Accuracy t=0 t=0.05 t=0.5 t=0 t=0.05 t=0.5 =1 =20 Llama 1B Llama 3B Llama 8B Llama 70B CL + GL CL Predictive router G Varoquaux 49
  70. Make research on small cool again Once upon a time

    computer science was about making efficient sorting algorithms G Varoquaux 51
  71. G Varoquaux 52 Tech is political What we choose to

    work Technologie isn’t neutral Vehicles: Ambulances or Tanks? What narratives we propagate Narratives shape societies and AI research, which is social Build to choose our future navigating economic rationals It’s all about tradeoffs
  72. Uncertainty in the LLM era Local black-box uncertainty control (grouping

    error) for better AI decisions and orchestration The maths still matter for AI We need to be more uncertain on LLM utility Naive overconfidence will fuel bubble dynamics Not economic or societal benefits Research beyond scale, LLMs, “AGI” Research where you own your success
  73. References I L. Chen and G. Varoquaux. What is the

    role of small models in the llm era: A survey. arXiv preprint arXiv:2409.06857, 2024. L. Chen, M. Zaharia, and J. Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. ArXiv preprint, 2023. URL https://arxiv.org/abs/2305.05176. L. Chen, A. Perez-Lebel, F. M. Suchanek, and G. Varoquaux. Reconfidencing llms from the grouping loss perspective. EMNLP Findings, 2024. C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, 2001. Q. J. Hu, J. Bieker, X. Li, N. Jiang, B. Keigwin, G. Ranganath, K. Keutzer, and S. K. Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024. A. Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks. arXiv preprint arXiv:2510.04871, 2025. A. Kag and I. Fedorov. Efficient edge inference by selective query. In International Conference on Learning Representations, 2023.
  74. References II A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. M. Kull and P. Flach. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 68–85. Springer, 2015. S. Melo, G. Varoquaux, and M. Le Morvan. Epistemic Uncertainty Quantification to Improve Decisions From Black-Box Models. working paper or preprint, Dec. 2025. URL https://hal.science/hal-05393027. A. Perez-Lebel, M. Le Morvan, and G. Varoquaux. Beyond calibration: estimating the grouping loss of modern neural networks. ICLR, 2023. A. Perez-Lebel, G. Varoquaux, S. Koyejo, M. Doutreligne, and M. Le Morvan. Decision from suboptimal classifiers: Excess risk pre- and post-calibration. AIstats, 2025. R. Sutton. The bitter lesson. Incomplete Ideas (blog), 13(1), 2019.
  75. References III G. Varoquaux, S. Luccioni, and M. Whittaker. Hype,

    sustainability, and the price of the bigger-is-better paradigm in ai. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 61–75, 2025.