Slide 1

Slide 1 text

Evaluating LLM Applications is Hard. (sorry) bit.ly/llm-evals-12-06 link to slides bit.ly/llm-evals-12-06 link to slides

Slide 2

Slide 2 text

tl;dr Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. impatient folks can leave after this slide

Slide 3

Slide 3 text

Evaluation Matters. Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. it’s a critical differentiator

Slide 4

Slide 4 text

sequoiacap.com/article/generative-ai-act-two/

Slide 5

Slide 5 text

🔥 eugeneyan.com/writing/llm-patterns/ Diagram by Josh Tobin

Slide 6

Slide 6 text

No One Does Evals Well. Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. everyone agrees that everyone else is doing it wrong

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

anthropic.com/index/claude-2-1

Slide 9

Slide 9 text

reddit.com/r/ClaudeAI reddit.com/r/LocalLLaMA/comments/180p17f

Slide 10

Slide 10 text

youtu.be/U9mJuUkhUzk?t=857

Slide 11

Slide 11 text

reddit.com/r/ChatGPT reddit.com/r/ChatGPT/comments/187tpcs

Slide 12

Slide 12 text

“Okay, you guys figure it out then!”

Slide 13

Slide 13 text

“Ground Truth” Can’t Save You. Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. and it’s not always possible

Slide 14

Slide 14 text

fullstackdeeplearning.com/llm-bootcamp/ spring-2023/llmops accuracy F1 perplexity BLEU Rouge cosine sim

Slide 15

Slide 15 text

This is how we train models: take gradients of eval metrics. (real predictions from huggingface.co/deepset/gbert-base) Du sollst keine Maschine nach dem ____ eines menschlichen Geistes machen. Muster Vorbild Willen Geschmack Modell 0.455 0.249 0.057 0.031 0.024 Fill-in-the-Blank 0 1 0 0 0 Muster Vorbild Willen Geschmack Modell model output (μ) target (y)

Slide 16

Slide 16 text

It’s also part of some popular benchmarks, like MMLU.

Slide 17

Slide 17 text

huggingface.co/blog/evaluating-mmlu-leaderboard But simplicity can be deceiving.

Slide 18

Slide 18 text

And ground truth isn’t always true. pubmedqa.github.io

Slide 19

Slide 19 text

twitter.com/mcxfrank/status/1643296199682961408 Learn from experimental psychology and cognitive science!

Slide 20

Slide 20 text

🔥 www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/#/12 This is not just theoretical!

Slide 21

Slide 21 text

fullstackdeeplearning.com/llm-bootcamp/ spring-2023/llmops

Slide 22

Slide 22 text

🔥 arxiv.org/abs/2301.01751

Slide 23

Slide 23 text

Annotators Are Flawed. Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. good help is hard to find

Slide 24

Slide 24 text

This is also part of training for most models. arxiv.org/abs/2203.02155 Ground truth fine-tuning on top of pretraining, then learning from feedback.

Slide 25

Slide 25 text

Humans are lazy and have cognitive biases.

Slide 26

Slide 26 text

chat.lmsys.org/?arena

Slide 27

Slide 27 text

Fake screenshot, modified from Figure 1 of arxiv.org/abs/2305.15717

Slide 28

Slide 28 text

Fake screenshot, modified from Figure 1 of arxiv.org/abs/2305.15717

Slide 29

Slide 29 text

Fake screenshot, modified from Figure 1 of arxiv.org/abs/2305.15717

Slide 30

Slide 30 text

arxiv.org/abs/2305.15717

Slide 31

Slide 31 text

twitter.com/davidad/status/1663863725319770114

Slide 32

Slide 32 text

The more expensive the task we are trying to automate, the more expensive annotation will be.

Slide 33

Slide 33 text

LLMs Are Flawed. Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. yo dawg i heard yu like LLMs & eval, so i put eval LLMs in yr LLM eval

Slide 34

Slide 34 text

GPT-3.5 is (roughly) a median crowd-worker. arxiv.org/abs/2303.15056 So let’s just use models instead of annotators!

Slide 35

Slide 35 text

We’re “cramming more cognition into integrated circuits”. This is the most important long-term trend. Plan for it! 3x 2¢ Moore, 1965 Me, rn 100x

Slide 36

Slide 36 text

LLMs are lazy and have cognitive biases.

Slide 37

Slide 37 text

arxiv.org/abs/2207.07051 arxiv.org/abs/2305.17926 arxiv.org/abs/2309.17012

Slide 38

Slide 38 text

But wait! How can we use LLMs as verifiers for LLMs?

Slide 39

Slide 39 text

We do the same with software 1.0. Tests are simple and trusted.

Slide 40

Slide 40 text

twitter.com/mipsytipsy/status/1706737263357706568 But that’s harder with LLMs than with software 1.0.

Slide 41

Slide 41 text

fullstackdeeplearning.com Model improvements drive model improvements.

Slide 42

Slide 42 text

User Behavior: Best of Bad Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Behavior is Hard. All We Can Do is Our Best. never trust your users

Slide 43

Slide 43 text

twitter.com/gdb/status/1665851244915613700

Slide 44

Slide 44 text

But your users now are not your users forever.

Slide 45

Slide 45 text

Be careful what you optimize for! Imagine A/B testing this form against email signup rates.

Slide 46

Slide 46 text

Be careful what you optimize for! arxiv.org/abs/2303.06135

Slide 47

Slide 47 text

twitter.com/emollick/status/1730608737898176637 Beware the McNamara Fallacy.

Slide 48

Slide 48 text

Just Do Your Best Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. best isn’t good but it’s better

Slide 49

Slide 49 text

Get to observability. Adapt existing o11y solutions Embrace “LLMOps” tooling Roll your own with FOSS Use existing MLOps tooling

Slide 50

Slide 50 text

Measure actual end goals, like retention and activation. 🔥 honeycomb.io/blog/we-shipped-ai-product

Slide 51

Slide 51 text

Optimize 1 metric, satisfice N-1. fullstackdeeplearning.com/spring2021/lecture-5/#5-metrics 🔥 Herbert Simon, satisficer

Slide 52

Slide 52 text

“Look at the data” is genchi genbutsu for ML.

Slide 53

Slide 53 text

linkedin.com/pulse/what-genchi-genbutsu-erik-vaal-l-i-o-n-/ Learn lessons from the factory floor, not just IT.

Slide 54

Slide 54 text

tl;dr Evaluation Matters. No One Evaluates LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best.

Slide 55

Slide 55 text

Thanks! and good luck, you’re gonna need it @charles_irl on Twitter + Discord (DMs open) HMU if you’re interested in open source approaches to educating engineers about building with AI. bit.ly/llm-evals-12-06 link to slides