Slide 53
Slide 53 text
53
参考文献
• Hugging Face. (2023), "Open LLM Leaderboard”,
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
• Stability-AI. (2023), "JP Language Model Evaluation Harness", https://github.com/Stability-AI/lm-evaluation-
harness/blob/jp-stable/README.md
• Weights & Biases Japan. (2023), "Nejumi LLMリーダーボード",
https://wandb.ai/wandb/LLM_evaluation_Japan/reports/Nejumi-LLM---Vmlldzo0NTUzMDE2
• Chang et al. (2023), "A survey on evaluation of large language models”, arXiv preprint arXiv:2307.03109.
• OpenAI. (2023), "GPT-4 Technical Report", ArXiv abs/2303.08774.
• Google. (2023), “PaLM 2 Technical Report”, ArXiv abs/2305.10403.
• Big-bench authors. (2022), "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language
models", ArXiv abs/2206.04615.
• YuzuAI. (2023), "The Rakuda Ranking of Japanese AI", https://yuzuai.jp/benchmark
• Labrak et al. (2023), “A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to
Clinical and Biomedical Tasks”, ArXiv abs/2307.12114.
• Zhuo et al. (2023). "On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model:
An Empirical Study on Codex", ArXiv abs/2301.12868.