My LLM is better than yours.

My LLM is better than yours Wale Olowonyo Developer x
Designer

There’s a lot of language models out there.

OpenAI Google Meta Gemini 2.5 (Pro, Flash, Flash-lite) Llama 3
Llama 4 (Scout, Maverick, Behemoth) IBM Yes, IBM Hugging Face Thousands of open models (Qwen, Deepseek, Falcon, etc.) • GPT-5 (regular, nano, mini) • GPT-oss (20b, 120b • GPT-4o • GPT-4.5 • o1, o3, o4, (pro, mini • GPT-4.1 (regular, mini) Gemini 2.0 (Pro, Flash-lite, Flash-preview) Gemma 3 (1B, 4B, 12B, 27B) IBM Granite series

Snapshot of OpenAI’s models

How did we get here? Let’s take a step back

The paper that kick started it all

Explosion of ideas, possibilities…

We’ve come a loooong way since then

• We now have powerful multimodal LMs • Thousands of
LLMs are out there The wheel of innovation keeps spinning …new models being released almost every month

“The world’s most powerful Model”

Evals Model Evaluation • Benchmarks • Leaderboards

Benchmarks …standardised test for AI models that measures their skills,
knowledge, and reasoning.

Benchmarks Humanity’s Last Exam 2,500+ expert-curated problems in 100+ domains
MMLU (Massive Multitask Language Understanding) 15k MCQ’s across 57 subjects (law, chemistry, etc.) GPQA (Graduate-Level Google-Proof Q&A) 448 expert-crafted Q’s by domain experts in biology, physics, chemistry. ARC-AGI 1/ 2 (Abstract and Reasoning Corpus for Artificial General Intelligence) Abstract puzzles that measure logical thinking/ pattern recognition. GSM-8k (Grade School Math 8K) 8,500 grade-school math riddles LivecodeBench Fresh LeetCode-style coding challenges. SimpleBench Basic reasoning test where an average person outsmarts top LLMs.

Data contamination The issue with Benchmarks …when a LLM is
trained on data suspiciously similar to benchmark questions.

Most LLM benchmarks are static, which makes them prone to
contamination. LLMs are trained on datasets from the web, which includes pages with answers to common questions used to test the models. How can we evaluate them if they’ve studied the answers before the test? Data contamination

LiveCodeBench LiveCodeBench is a holistic and contamination-free evaluation benchmark of
LLMs for code that continuously collects new problems over time. Periodic contests on Leetcode, AtCoder, and Codeforces.

SimpleBench Benchmark

SimpleBench Benchmark Modern LLMs fail this simple reasoning question

Shot prompting CoT(Chain of Thought reasoning)

• LMArena (fka Chatbot arena) Leaderboards Crowdsourced evaluation leaderboards o
Chat, WebDev, Image-gen, Vision, Search, etc.

LMArena WebDev

LMArena

LMArena Benchmark

When a measure becomes a target, it ceases to be
a good measure. “ Goodhart’s Law - Charles Goodhart

Benchmark scores and Leaderboard metrics look impressive, but production-grade LLMs
requires evals that reflect real-world performance. Reasoning quality, Consistency, Hallucinations, Prompt injections, etc.

Use benchmarks as a filter Use benchmarks to create a
shortlist of models worth testing, but don’t let them choose your final pick. Build your own tests They‘d frankly tell you more than most public benchmarks will on your particular tasks. Ask yourself, “who made this benchmark” and “what are they really trying to test?” Is it true reasoning beyond the knowledge an AI has, or simply the recall of a few facts that when put together, simulate reasoning. Read the fine print

• Context Window • Rate limiting • Pricing Other things
to look out for when selecting a language model

Making the LLM you pick work for you. Finetuning RAG
Retrieval-Augmented Generation

Some other things to look out for Having a great
prompt. Great prompts drive better results. Guide the model step-by-step (1-2-3) to improve clarity and accuracy. Context window Understand how much the model can “remember” in one go. Structure information so key details stay in range. Pricing Maximize value for your budget. Guardrails Policy-driven rules that reject off-topic queries.

THANK YOU! X: @wale_io Wale Olowonyo Designer/ Developer

My LLM is better than yours.

My LLM is better than yours.

Wale Olowonyo

Other Decks in Technology

Featured

Transcript