Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901, 2024. [2]Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras. "FiNER: Financial numeric entity recognition for XBRL tagging." arXiv preprint arXiv:2203.06482, 2022 [3]Dannong Wang, Jaisal Patel, Daochen Zha, Steve Y Yang, and Xiao-Yang Liu. "FinLoRA: Benchmarking LoRA methods for fine-tuning LLMs on financial datasets." arXiv preprint arXiv:2505.19819, 2025 [4]Rishabh Agarwal et al., "Many-shot in-context learning," Advances in Neural Information Processing Systems, 37:76930–76966, 2024 [5]Lakshya A Agrawal et al., "Gepa: Reflective prompt evolution can outperform reinforcement learning," arXiv preprint arXiv:2507.19457, 2025 [6]Mirac Suzgun et al., "Dynamic cheatsheet: Test-time learning with adaptive memory," arXiv preprint arXiv:2504.07952, 2025 [7]Krista Opsahl-Ong et al., "Optimizing instructions and demonstrations for multi-stage language model programs," arXiv preprint arXiv:2406.11695, 2024