Slide 4
Slide 4 text
論文 10月分
認知
• VHELM: A Holistic Evaluation of Vision Language Models
計画
• On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability
• Benchmarking Agentic Workflow Generation
• Planning in the Dark: LLM-Symbolic Planning Pipeline without Experts
• LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
• Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
推論
• Inference Scaling for Long-Context Retrieval Augmented Generation
• Steering Large Language Models between Code Execution and Textual Reasoning
• Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External
Data More Wisely
• MARPLE: A Benchmark for Long-Horizon Inference
評価
• The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends
• Evaluation of OpenAI o1: Opportunities and Challenges of AGI