Upgrade to Pro — share decks privately, control downloads, hide ads and more …

My LLM is better than yours.

My LLM is better than yours.

My session at OSCAFest (Open Source Community Africa) 2025.
OSCAFest is one of Africa’s largest gatherings of developers, designers and tech professionals.

There are hundreds of large language models (LLMs) out in the wild, and the list keeps on growing.
Every other month, the big players (Google, OpenAI, Anthropic, etc.) release shiny new models, along with their benchmarks, (MMLU, GPQA, etc.)

...but what do these cryptic benchmarks really mean? ...and how can one tell what model is best for their $100m "AI" app use case.

(Views are mine, and do not reflect that of my employer)

Avatar for Wale Olowonyo

Wale Olowonyo

December 06, 2025
Tweet

Other Decks in Technology

Transcript

  1. OpenAI Google Meta Gemini 2.5 (Pro, Flash, Flash-lite) Llama 3

    Llama 4 (Scout, Maverick, Behemoth) IBM Yes, IBM Hugging Face Thousands of open models (Qwen, Deepseek, Falcon, etc.) • GPT-5 (regular, nano, mini) • GPT-oss (20b, 120b • GPT-4o • GPT-4.5 • o1, o3, o4, (pro, mini • GPT-4.1 (regular, mini) Gemini 2.0 (Pro, Flash-lite, Flash-preview) Gemma 3 (1B, 4B, 12B, 27B) IBM Granite series
  2. • We now have powerful multimodal LMs • Thousands of

    LLMs are out there The wheel of innovation keeps spinning …new models being released almost every month
  3. ???

  4. Benchmarks Humanity’s Last Exam 2,500+ expert-curated problems in 100+ domains

    MMLU (Massive Multitask Language Understanding) 15k MCQ’s across 57 subjects (law, chemistry, etc.) GPQA (Graduate-Level Google-Proof Q&A) 448 expert-crafted Q’s by domain experts in biology, physics, chemistry. ARC-AGI 1/ 2 (Abstract and Reasoning Corpus for Artificial General Intelligence) Abstract puzzles that measure logical thinking/ pattern recognition. GSM-8k (Grade School Math 8K) 8,500 grade-school math riddles LivecodeBench Fresh LeetCode-style coding challenges. SimpleBench Basic reasoning test where an average person outsmarts top LLMs.
  5. Data contamination The issue with Benchmarks …when a LLM is

    trained on data suspiciously similar to benchmark questions.
  6. Most LLM benchmarks are static, which makes them prone to

    contamination. LLMs are trained on datasets from the web, which includes pages with answers to common questions used to test the models. How can we evaluate them if they’ve studied the answers before the test? Data contamination
  7. LiveCodeBench LiveCodeBench is a holistic and contamination-free evaluation benchmark of

    LLMs for code that continuously collects new problems over time. Periodic contests on Leetcode, AtCoder, and Codeforces.
  8. When a measure becomes a target, it ceases to be

    a good measure. “ Goodhart’s Law - Charles Goodhart
  9. Benchmark scores and Leaderboard metrics look impressive, but production-grade LLMs

    requires evals that reflect real-world performance. Reasoning quality, Consistency, Hallucinations, Prompt injections, etc.
  10. Use benchmarks as a filter Use benchmarks to create a

    shortlist of models worth testing, but don’t let them choose your final pick. Build your own tests They‘d frankly tell you more than most public benchmarks will on your particular tasks. Ask yourself, “who made this benchmark” and “what are they really trying to test?” Is it true reasoning beyond the knowledge an AI has, or simply the recall of a few facts that when put together, simulate reasoning. Read the fine print
  11. • Context Window • Rate limiting • Pricing Other things

    to look out for when selecting a language model
  12. Some other things to look out for Having a great

    prompt. Great prompts drive better results. Guide the model step-by-step (1-2-3) to improve clarity and accuracy. Context window Understand how much the model can “remember” in one go. Structure information so key details stay in range. Pricing Maximize value for your budget. Guardrails Policy-driven rules that reject off-topic queries.