Evaluating LLM Applications is hard

Evaluating LLM Applications is Hard. (sorry) bit.ly/llm-evals-12-06 link to slides
bit.ly/llm-evals-12-06 link to slides

tl;dr Evaluation Matters. No One Evaluates LLMs Well. Evaluating with
Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. impatient folks can leave after this slide

Evaluation Matters. Evaluation Matters. No One Evaluates LLMs Well. Evaluating
with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. it’s a critical differentiator

sequoiacap.com/article/generative-ai-act-two/

🔥 eugeneyan.com/writing/llm-patterns/ Diagram by Josh Tobin

No One Does Evals Well. Evaluation Matters. No One Evaluates
LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. everyone agrees that everyone else is doing it wrong

anthropic.com/index/claude-2-1

reddit.com/r/ClaudeAI reddit.com/r/LocalLLaMA/comments/180p17f

youtu.be/U9mJuUkhUzk?t=857

reddit.com/r/ChatGPT reddit.com/r/ChatGPT/comments/187tpcs

“Okay, you guys figure it out then!”

“Ground Truth” Can’t Save You. Evaluation Matters. No One Evaluates
LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. and it’s not always possible

fullstackdeeplearning.com/llm-bootcamp/ spring-2023/llmops accuracy F1 perplexity BLEU Rouge cosine sim

This is how we train models: take gradients of eval
metrics. (real predictions from huggingface.co/deepset/gbert-base) Du sollst keine Maschine nach dem ____ eines menschlichen Geistes machen. Muster Vorbild Willen Geschmack Modell 0.455 0.249 0.057 0.031 0.024 Fill-in-the-Blank 0 1 0 0 0 Muster Vorbild Willen Geschmack Modell model output (μ) target (y)

It’s also part of some popular benchmarks, like MMLU.

huggingface.co/blog/evaluating-mmlu-leaderboard But simplicity can be deceiving.

And ground truth isn’t always true. pubmedqa.github.io

twitter.com/mcxfrank/status/1643296199682961408 Learn from experimental psychology and cognitive science!

🔥 www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/#/12 This is not just theoretical!

fullstackdeeplearning.com/llm-bootcamp/ spring-2023/llmops

🔥 arxiv.org/abs/2301.01751

Annotators Are Flawed. Evaluation Matters. No One Evaluates LLMs Well.
Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. good help is hard to find

This is also part of training for most models. arxiv.org/abs/2203.02155
Ground truth fine-tuning on top of pretraining, then learning from feedback.

Humans are lazy and have cognitive biases.

chat.lmsys.org/?arena

Fake screenshot, modified from Figure 1 of arxiv.org/abs/2305.15717

arxiv.org/abs/2305.15717

twitter.com/davidad/status/1663863725319770114

The more expensive the task we are trying to automate,
the more expensive annotation will be.

LLMs Are Flawed. Evaluation Matters. No One Evaluates LLMs Well.
Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. yo dawg i heard yu like LLMs & eval, so i put eval LLMs in yr LLM eval

GPT-3.5 is (roughly) a median crowd-worker. arxiv.org/abs/2303.15056 So let’s just
use models instead of annotators!

We’re “cramming more cognition into integrated circuits”. This is the
most important long-term trend. Plan for it! 3x 2¢ Moore, 1965 Me, rn 100x

LLMs are lazy and have cognitive biases.

arxiv.org/abs/2207.07051 arxiv.org/abs/2305.17926 arxiv.org/abs/2309.17012

But wait! How can we use LLMs as verifiers for
LLMs?

We do the same with software 1.0. Tests are simple
and trusted.

twitter.com/mipsytipsy/status/1706737263357706568 But that’s harder with LLMs than with software 1.0.

fullstackdeeplearning.com Model improvements drive model improvements.

User Behavior: Best of Bad Evaluation Matters. No One Evaluates
LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Behavior is Hard. All We Can Do is Our Best. never trust your users

twitter.com/gdb/status/1665851244915613700

But your users now are not your users forever.

Be careful what you optimize for! Imagine A/B testing this
form against email signup rates.

Be careful what you optimize for! arxiv.org/abs/2303.06135

twitter.com/emollick/status/1730608737898176637 Beware the McNamara Fallacy.

Just Do Your Best Evaluation Matters. No One Evaluates LLMs
Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. best isn’t good but it’s better

Get to observability. Adapt existing o11y solutions Embrace “LLMOps” tooling
Roll your own with FOSS Use existing MLOps tooling

Measure actual end goals, like retention and activation. 🔥 honeycomb.io/blog/we-shipped-ai-product

Optimize 1 metric, satisfice N-1. fullstackdeeplearning.com/spring2021/lecture-5/#5-metrics 🔥 Herbert Simon, satisficer

“Look at the data” is genchi genbutsu for ML.

linkedin.com/pulse/what-genchi-genbutsu-erik-vaal-l-i-o-n-/ Learn lessons from the factory floor, not just IT.

tl;dr Evaluation Matters. No One Evaluates LLMs Well. Evaluating with
Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best.

Thanks! and good luck, you’re gonna need it @charles_irl on
Twitter + Discord (DMs open) HMU if you’re interested in open source approaches to educating engineers about building with AI. bit.ly/llm-evals-12-06 link to slides

Evaluating LLM Applications is hard

Evaluating LLM Applications is hard

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript