measure the surface similarity b/w the reference and candidate.
◦ correlate poorly with human judgment
• Metrics based on learned components
◦ High correlation with human judgment
◦ Fully learned metrics (BEER, RUSE, ESIM)
▪ are trained ent-to-end, and rely on handcrafted features and/or learned embeddings ▪ offers gread expressivity ◦ Hybrid metrics (YiSi, BERTscore)
▪ combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules ▪ offers robustness