Slide 21
Slide 21 text
参考文献 21
[1] Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish
Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking
interactive coding agents. arXiv preprint arXiv:2407.18901, 2024.
[2]Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion
Androutsopoulos, and Georgios Paliouras. "FiNER: Financial numeric entity recognition for XBRL tagging." arXiv
preprint arXiv:2203.06482, 2022
[3]Dannong Wang, Jaisal Patel, Daochen Zha, Steve Y Yang, and Xiao-Yang Liu. "FinLoRA: Benchmarking LoRA methods
for fine-tuning LLMs on financial datasets." arXiv preprint arXiv:2505.19819, 2025
[4]Rishabh Agarwal et al., "Many-shot in-context learning," Advances in Neural Information Processing Systems,
37:76930–76966, 2024
[5]Lakshya A Agrawal et al., "Gepa: Reflective prompt evolution can outperform reinforcement learning," arXiv
preprint arXiv:2507.19457, 2025
[6]Mirac Suzgun et al., "Dynamic cheatsheet: Test-time learning with adaptive memory," arXiv preprint
arXiv:2504.07952, 2025
[7]Krista Opsahl-Ong et al., "Optimizing instructions and demonstrations for multi-stage language model
programs," arXiv preprint arXiv:2406.11695, 2024