Slide 24
Slide 24 text
Human judge
→ Ask humans to score model generations on specific
properties (accuracy, relevance, toxicity, etc.).
Example: vibe checks, Chatbot Arena, data annotation
Advantages
✅High flexibility
✅No data contamination risk
✅Direct human preferences
Limitations
❌Costly & time-consuming
❌Biased (tone, first impression, etc.)
❌Limited scalability
Clémentine Fourrier and The Hugging Face Community, "LLM Evaluation Guidebook.", 2024.
Chatbot Arena by LMSYS