Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evaluating LLM Applications is hard

Anyscale
December 07, 2023

Evaluating LLM Applications is hard

A specter is haunting generative AI: the specter of evaluation. In a rush of excitement at the new capabilities provided by open and proprietary foundation models, it seems everyone from homebrew hackers to engineering teams at NASDAQ companies has shipped products and features based on those capabilities. But how do we know whether those products and features are good? And how do we know whether our changes make them better? I will share some case studies and experiences on just how hard this problem is – from the engineering, product, and business perspectives – and a bit about what is to be done.

Anyscale

December 07, 2023
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. tl;dr Evaluation Matters. No One Evaluates LLMs Well. Evaluating with

    Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. impatient folks can leave after this slide
  2. Evaluation Matters. Evaluation Matters. No One Evaluates LLMs Well. Evaluating

    with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. it’s a critical differentiator
  3. No One Does Evals Well. Evaluation Matters. No One Evaluates

    LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. everyone agrees that everyone else is doing it wrong
  4. “Ground Truth” Can’t Save You. Evaluation Matters. No One Evaluates

    LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. and it’s not always possible
  5. This is how we train models: take gradients of eval

    metrics. (real predictions from huggingface.co/deepset/gbert-base) Du sollst keine Maschine nach dem ____ eines menschlichen Geistes machen. Muster Vorbild Willen Geschmack Modell 0.455 0.249 0.057 0.031 0.024 Fill-in-the-Blank 0 1 0 0 0 Muster Vorbild Willen Geschmack Modell model output (μ) target (y)
  6. Annotators Are Flawed. Evaluation Matters. No One Evaluates LLMs Well.

    Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. good help is hard to find
  7. This is also part of training for most models. arxiv.org/abs/2203.02155

    Ground truth fine-tuning on top of pretraining, then learning from feedback.
  8. The more expensive the task we are trying to automate,

    the more expensive annotation will be.
  9. LLMs Are Flawed. Evaluation Matters. No One Evaluates LLMs Well.

    Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. yo dawg i heard yu like LLMs & eval, so i put eval LLMs in yr LLM eval
  10. We’re “cramming more cognition into integrated circuits”. This is the

    most important long-term trend. Plan for it! 3x 2¢ Moore, 1965 Me, rn 100x
  11. User Behavior: Best of Bad Evaluation Matters. No One Evaluates

    LLMs Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Behavior is Hard. All We Can Do is Our Best. never trust your users
  12. Just Do Your Best Evaluation Matters. No One Evaluates LLMs

    Well. Evaluating with Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best. best isn’t good but it’s better
  13. tl;dr Evaluation Matters. No One Evaluates LLMs Well. Evaluating with

    Ground Truth is Hard. Evaluating with Annotators is Hard. Evaluating with LLMs is Hard. Evaluating with User Preferences is Hard. All We Can Do is Our Best.
  14. Thanks! and good luck, you’re gonna need it @charles_irl on

    Twitter + Discord (DMs open) HMU if you’re interested in open source approaches to educating engineers about building with AI. bit.ly/llm-evals-12-06 link to slides