Slide 21
Slide 21 text
21
Appendix
•GSM8K (Grade School Math 8K): A dataset of multi-step mathematical word problems from
grade school, testing quantitative reasoning.
•MATH: A benchmark of challenging problems from high school math competitions, requiring
advanced reasoning in topics like algebra, geometry, and calculus.
•GPQA (Graduate-Level Google-Proof Q&A): A set of very difficult, expert-level questions in
biology, physics, and chemistry that are hard to answer using a simple web search, thus requiring
deep domain knowledge.
•HumanEval: A standard benchmark for code generation. The model is given a description of a
function (in a docstring) and must write the correct Python code.
•HumanEval-FIM (Fill-in-the-Middle): A variation of HumanEval that tests the model's ability to
complete code by filling in a missing part in the middle of a function.
•MBPP (Mostly Basic Python Programming): A benchmark where models generate Python
code based on short, natural language descriptions of programming tasks.
•C-MMLU (Chinese MMLU): A version of the MMLU benchmark adapted for the Chinese
language, covering a wide range of subjects.
•C-Eval: A comprehensive evaluation suite for Chinese LLMs that covers subjects in humanities,
social sciences, and STEM, developed with a focus on Chinese knowledge domains.