multi-step mathematical word problems from grade school, testing quantitative reasoning. •MATH: A benchmark of challenging problems from high school math competitions, requiring advanced reasoning in topics like algebra, geometry, and calculus. •GPQA (Graduate-Level Google-Proof Q&A): A set of very difficult, expert-level questions in biology, physics, and chemistry that are hard to answer using a simple web search, thus requiring deep domain knowledge. •HumanEval: A standard benchmark for code generation. The model is given a description of a function (in a docstring) and must write the correct Python code. •HumanEval-FIM (Fill-in-the-Middle): A variation of HumanEval that tests the model's ability to complete code by filling in a missing part in the middle of a function. •MBPP (Mostly Basic Python Programming): A benchmark where models generate Python code based on short, natural language descriptions of programming tasks. •C-MMLU (Chinese MMLU): A version of the MMLU benchmark adapted for the Chinese language, covering a wide range of subjects. •C-Eval: A comprehensive evaluation suite for Chinese LLMs that covers subjects in humanities, social sciences, and STEM, developed with a focus on Chinese knowledge domains.