Slide 18
Slide 18 text
Evaluation
• Given a “ground truth” dataset of source documents and question/answer pairs, how to
evaluate a document Q&A system?
• Human evaluation:
• Absolute rating: “Is the content of the generated text correct / equivalent to the reference
answer?” → Accuracy
• Relative rating: “Which of the two answers is more accurate / more similar to the reference?”
→ Ranking, Elo
• Issues: Does not scale, annotators need domain knowledge, agreement, …
• Evaluation by LLM:
• Same questions as before, but LLM replaces human.
• Stronger model assesses weaker model; GPT-4 generally acknowledged as “strongest”.
• Issues: Unknown bias of GPT-4, token costs per test run, LLM drift (Chen et al., 2023)
18